Hello all,
(resending, as this didn't make it through to the ML the first time) I'm very happy to announce the release of a new data analysis library that many of you will hopefully find useful. This release is the product of a long period of development and use; hence, despite the low version number, it is quite suitable for general use. The documentation is still a bit sparse but will become much more complete in the coming weeks and months. Info / Documentation: http://pandas.sourceforge.net/ Overview slides: http://pandas.googlecode.com/files/nyfpug.pdf What it is ========== pandas is a library for panel data analysis, i.e. multidimensional time series and crosssectional data sets commonly found in statistics, econometrics, or finance. It provides convenient and easytounderstand NumPybased data structures for generic labeled data, with focus on automatically aligning data based on its label(s) and handling missing observations. One major goal of the library is to simplify the implementation of statistical models on unreliable data. Main Features ============= * Data structures: for 1, 2, and 3 dimensional labeled data sets. Some of their main features include: * Automatically aligning data * Handling missing observations in calculations * Convenient slicing and reshaping ("reindexing") functions * Provide 'group by' aggregation or transformation functionality * Tools for merging / joining together data sets * Simple matplotlib integration for plotting * Date tools: objects for expressing date offsets or generating date ranges; some functionality similar to scikits.timeseries * Statistical models: convenient ordinary least squares and panel OLS implementations for insample or rolling time series / crosssectional regressions. These will hopefully be the starting point for implementing other models pandas is not necessarily intended as a standalone library but rather as something which can be used in tandem with other NumPybased packages like scikits.statsmodels. Where possible wheelreinvention has largely been avoided. Also, its time series manipulation capability is not as extensive as scikits.timeseries; pandas does have its own time series object which fits into the unified data model. Some other useful tools for time series data (moving average, standard deviation, etc.) are available in the codebase but do not yet have a convenient interface. These will be highlighted in a future release. Where to get it =============== The source code is currently hosted on googlecode at: http://pandas.googlecode.com Releases can be downloaded currently on the Python package index or using easy_install PyPi: http://pypi.python.org/pypi/pandas/ License ======= BSD Documentation ============= The official documentation is hosted on SourceForge. http://pandas.sourceforge.net/ The sphinx documentation is still in an incomplete state, but it should provide a good starting point for learning how to use the library. Expect the docs to continue to expand as time goes on. Background ========== Work on pandas started at AQR (a quantitative hedge fund) in 2008 and has been under active development since then. Discussion and Development ========================== Since pandas development is related to a number of other scientific Python projects, questions are welcome on the scipyuser mailing list. Specialized discussions or design issues should take place on the pystatsmodels mailing list / google group, where scikits.statsmodels and other libraries will also be discussed: http://groups.google.com/group/pystatsmodels Best regards, Wes McKinney _______________________________________________ SciPyUser mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipyuser 
Administrator

Hello,
thanks for the announcement. > * Date tools: objects for expressing date offsets or generating date > ranges; some functionality similar to scikits.timeseries Why do you create data structures similar to scikits.timeseries? Couldn't you reuse the functionality from scikits.timeseries? Could you see chances to design a interface between both packages? I have a lot of timeseries code. I would love to reuse that together with your package. Thanks in advance for calrifications, Timmie _______________________________________________ SciPyUser mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipyuser 
On Wed, Dec 30, 2009 at 8:36 AM, Tim Michelsen
<[hidden email]> wrote: > Hello, > thanks for the announcement. > >> * Date tools: objects for expressing date offsets or generating date >> ranges; some functionality similar to scikits.timeseries > Why do you create data structures similar to scikits.timeseries? > Couldn't you reuse the functionality from scikits.timeseries? I think there are two relevant questions here:  Why don't I use scikits.timeseries's Date and DateArray objects In pandas I wanted to stick with working with python datetime objects, and I needed objects to encapsulate generic date shifts (like "add 5 business days"). The idea was to extend the dateutil.relativedelta concept to handle business days, last business day of month, etc. Once you've done that, generating date ranges is a fairly trivial (albeit not super efficient) next step. The DateRange class is also a valid Index for a Series or DataFrame object and requires no conversion (plan to write more about this in the docs when I get a chance)  Why don't I use the scikits.timeseries for time series data itself I don't think you were asking this, but I have gotten this question from others. We should probably have a broader discussion about handling time series data particularly given the recent datetime dtype addition to NumPy. In any case, there are many reasons why I didn't use it the main one is that I wanted to have a unified data model (i.e. use the same basic class) for both time series and crosssectional data. The scikits.timeseries TimeSeries object behaves too differently. Here's one example: http://pytseries.sourceforge.net/core.timeseries.operations.html#binaryoperations for adding two scikits.timeseries.TimeSeries " When the second input is another TimeSeries object, the two series must satisfy the following conditions: * they must have the same frequency; * they must be sorted in chronological order; * they must have matching dates; * they must have the same shape. " pandas does not know or care about the frequency, shape, or sortedness of the two TimeSeries. If the above conditions are met, it will bypass the "matching logic" and go at NumPy vectorized binary op speed. But if you break one of the above conditions, it will still match dates and produce a TimeSeries result. If you break the conditions above with scikits.timeseries, you will get a MaskedArray result and lose all of your date information (correct me if I'm wrong). pandas's TimeSeriesspecific functionality could definitely be much improved, but I think the easier option for now would be to provide an interface between the two libraries, as you suggest: > Could you see chances to design a interface between both packages? > I have a lot of timeseries code. I would love to reuse that together > with your package. Designing a bridge interface between the two packages would probably be pretty easy and fairly desirable. If you could give me some examples of what you're doing in your time series code that would be helpful to know. > Thanks in advance for calrifications, > Timmie > > _______________________________________________ > SciPyUser mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipyuser > _______________________________________________ SciPyUser mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipyuser 
Wes McKinney <wesmckinn <at> gmail.com> writes:
> I don't think you were asking this, but I have gotten this question > from others. We should probably have a broader discussion about > handling time series data particularly given the recent datetime dtype > addition to NumPy. Agreed. I think once the numpy datetime dtype matures a bit, it would be worthwhile to have a "meeting of the minds" on the future of time series data in python in general. In the mean time, I think it is very healthy to have some different approaches out in the wild (scikits.timeseries, pandas, nipy timeseries) to allow people to flesh out ideas, see what works, what doesn't, where there is overlap, etc. Hopefully we can then unite the efforts and not end up with a confusing landscape of multiple time series packages like R has. However, I think any specific interoperability work between the packages is a bit premature at this point until the final vision is a bit clearer. > for adding two scikits.timeseries.TimeSeries > " > When the second input is another TimeSeries object, the two series > must satisfy the following conditions: > > * they must have the same frequency; > * they must be sorted in chronological order; > * they must have matching dates; > * they must have the same shape. > " > > pandas does not know or care about the frequency, shape, or sortedness > of the two TimeSeries. If the above conditions are met, it will bypass > the "matching logic" and go at NumPy vectorized binary op speed. But > if you break one of the above conditions, it will still match dates > and produce a TimeSeries result. Believe it or not, what you just described is along the lines of how the original scikits.timeseries prototype behaved. It drew inspiration from the "FAME 4GL" time series language. FAME does all of the frequency / shape matching implicitly. It was decided (by the two person comittee of Pierre and I) that this behaviour felt a little to alien relative to the standard numpy array objects so we went back to the drawing board and used a more conservative approach. That is to say, frequency conversion and alignment must be done explicitly in the scikits.timeseries module. In practice, I don't find this to be a burden and like the extra clarity in the code, but it really depends what kind of problems you are solving, and certainly personal preference and experience plays a big role. At any rate, looking forward to seeing how the pandas module evolves and hopefully we can collaborate at some point in the future.  Matt _______________________________________________ SciPyUser mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipyuser 
Free forum by Nabble  Edit this page 