[ANN] pandas 0.1, a new NumPy-based data analysis library

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANN] pandas 0.1, a new NumPy-based data analysis library

Wes McKinney
Hello all,

(resending, as this didn't make it through to the ML the first time)

I'm very happy to announce the release of a new data analysis library
that many of you will hopefully find useful. This release is the
product of a long period of development and use; hence, despite the
low version number, it is quite suitable for general use. The
documentation is still a bit sparse but will become much more complete
in the coming weeks and months.

Info / Documentation: http://pandas.sourceforge.net/
Overview slides: http://pandas.googlecode.com/files/nyfpug.pdf

What it is
==========

pandas is a library for pan-el da-ta analysis, i.e. multidimensional
time series and cross-sectional data sets commonly found in
statistics, econometrics, or finance. It provides convenient and
easy-to-understand NumPy-based data structures for generic labeled
data, with focus on automatically aligning data based on its label(s)
and handling missing observations. One major goal of the library is to
simplify the implementation of statistical models on unreliable data.

Main Features
=============

* Data structures: for 1, 2, and 3 dimensional labeled data
 sets. Some of their main features include:

   * Automatically aligning data
   * Handling missing observations in calculations
   * Convenient slicing and reshaping ("reindexing") functions
   * Provide 'group by' aggregation or transformation functionality
   * Tools for merging / joining together data sets
   * Simple matplotlib integration for plotting

* Date tools: objects for expressing date offsets or generating date
 ranges; some functionality similar to scikits.timeseries

* Statistical models: convenient ordinary least squares and panel OLS
 implementations for in-sample or rolling time series /
 cross-sectional regressions. These will hopefully be the starting
 point for implementing other models

pandas is not necessarily intended as a standalone library but rather
as something which can be used in tandem with other NumPy-based
packages like scikits.statsmodels. Where possible wheel-reinvention
has largely been avoided. Also, its time series manipulation
capability is not as extensive as scikits.timeseries; pandas does have
its own time series object which fits into the unified data model.

Some other useful tools for time series data (moving average, standard
deviation, etc.) are available in the codebase but do not yet have a
convenient interface. These will be highlighted in a future release.

Where to get it
===============

The source code is currently hosted on googlecode at:

http://pandas.googlecode.com

Releases can be downloaded currently on the Python package index or using
easy_install

PyPi: http://pypi.python.org/pypi/pandas/

License
=======

BSD

Documentation
=============

The official documentation is hosted on SourceForge.

http://pandas.sourceforge.net/

The sphinx documentation is still in an incomplete state, but it
should provide a good starting point for learning how to use the
library. Expect the docs to continue to expand as time goes on.

Background
==========

Work on pandas started at AQR (a quantitative hedge fund) in 2008 and
has been under active development since then.

Discussion and Development
==========================

Since pandas development is related to a number of other scientific
Python projects, questions are welcome on the scipy-user mailing
list. Specialized discussions or design issues should take place on
the pystatsmodels mailing list / google group, where
scikits.statsmodels and other libraries will also be discussed:

http://groups.google.com/group/pystatsmodels

Best regards,

Wes McKinney
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] pandas 0.1, a new NumPy-based data analysis library

Timmie
Administrator
Hello,
thanks for the announcement.

> * Date tools: objects for expressing date offsets or generating date
>  ranges; some functionality similar to scikits.timeseries
Why do you create data structures similar to scikits.timeseries?
Couldn't you reuse the functionality from scikits.timeseries?

Could you see chances to design a interface between both packages?
I have a lot of timeseries code. I would love to reuse that together
with your package.

Thanks in advance for calrifications,
Timmie

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] pandas 0.1, a new NumPy-based data analysis library

Wes McKinney
On Wed, Dec 30, 2009 at 8:36 AM, Tim Michelsen
<[hidden email]> wrote:
> Hello,
> thanks for the announcement.
>
>> * Date tools: objects for expressing date offsets or generating date
>>  ranges; some functionality similar to scikits.timeseries
> Why do you create data structures similar to scikits.timeseries?
> Couldn't you reuse the functionality from scikits.timeseries?

I think there are two relevant questions here:

  - Why don't I use scikits.timeseries's Date and DateArray objects

In pandas I wanted to stick with working with python datetime objects,
and I needed objects to encapsulate generic date shifts (like "add 5
business days"). The idea was to extend the dateutil.relativedelta
concept to handle business days, last business day of month, etc. Once
you've done that, generating date ranges is a fairly trivial (albeit
not super efficient) next step. The DateRange class is also a valid
Index for a Series or DataFrame object and requires no conversion
(plan to write more about this in the docs when I get a chance)

  - Why don't I use the scikits.timeseries for time series data itself

I don't think you were asking this, but I have gotten this question
from others. We should probably have a broader discussion about
handling time series data particularly given the recent datetime dtype
addition to NumPy. In any case, there are many reasons why I didn't
use it-- the main one is that I wanted to have a unified data model
(i.e. use the same basic class) for both time series and
cross-sectional data. The scikits.timeseries TimeSeries object behaves
too differently. Here's one example:

http://pytseries.sourceforge.net/core.timeseries.operations.html#binary-operations

for adding two scikits.timeseries.TimeSeries
"
When the second input is another TimeSeries object, the two series
must satisfy the following conditions:

        * they must have the same frequency;
        * they must be sorted in chronological order;
        * they must have matching dates;
        * they must have the same shape.
"

pandas does not know or care about the frequency, shape, or sortedness
of the two TimeSeries. If the above conditions are met, it will bypass
the "matching logic" and go at NumPy vectorized binary op speed. But
if you break one of the above conditions, it will still match dates
and produce a TimeSeries result. If you break the conditions above
with scikits.timeseries, you will get a MaskedArray result and lose
all of your date information (correct me if I'm wrong).

pandas's TimeSeries-specific functionality could definitely be much
improved, but I think the easier option for now would be to provide an
interface between the two libraries, as you suggest:

> Could you see chances to design a interface between both packages?
> I have a lot of timeseries code. I would love to reuse that together
> with your package.

Designing a bridge interface between the two packages would probably
be pretty easy and fairly desirable. If you could give me some
examples of what you're doing in your time series code that would be
helpful to know.

> Thanks in advance for calrifications,
> Timmie
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: [ANN] pandas 0.1, a new NumPy-based data analysis library

Matt Knox-4
Wes McKinney <wesmckinn <at> gmail.com> writes:

> I don't think you were asking this, but I have gotten this question
> from others. We should probably have a broader discussion about
> handling time series data particularly given the recent datetime dtype
> addition to NumPy.

Agreed. I think once the numpy datetime dtype matures a bit, it would be
worthwhile to have a "meeting of the minds" on the future of time series
data in python in general. In the mean time, I think it is very healthy to have
some different approaches out in the wild (scikits.timeseries, pandas, nipy
timeseries) to allow people to flesh out ideas, see what works, what doesn't,
where there is overlap, etc. Hopefully we can then unite the efforts and not
end up with a confusing landscape of multiple time series packages like R has.

However, I think any specific interoperability work between the packages is a
bit premature at this point until the final vision is a bit clearer.


> for adding two scikits.timeseries.TimeSeries
> "
> When the second input is another TimeSeries object, the two series
> must satisfy the following conditions:
>
>         * they must have the same frequency;
>         * they must be sorted in chronological order;
>         * they must have matching dates;
>         * they must have the same shape.
> "
>
> pandas does not know or care about the frequency, shape, or sortedness
> of the two TimeSeries. If the above conditions are met, it will bypass
> the "matching logic" and go at NumPy vectorized binary op speed. But
> if you break one of the above conditions, it will still match dates
> and produce a TimeSeries result.

Believe it or not, what you just described is along the lines of how the
original scikits.timeseries prototype behaved. It drew inspiration from the
"FAME 4GL" time series language. FAME does all of the frequency / shape
matching implicitly. It was decided (by the two person comittee of Pierre and
I) that this behaviour felt a little to alien relative to the standard numpy
array objects so we went back to the drawing board and used a more conservative
approach. That is to say, frequency conversion and alignment must be done
explicitly in the scikits.timeseries module. In practice, I don't find this to
be a burden and like the extra clarity in the code, but it really depends what
kind of problems you are solving, and certainly personal preference and
experience plays a big role.

At any rate, looking forward to seeing how the pandas module evolves and
hopefully we can collaborate at some point in the future.

- Matt


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user