Status of TimeSeries SciKit

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Matt Knox-4

Wes McKinney <wesmckinn <at> gmail.com> writes:

> > Frequency conversion flexibility:>
> >    - allow you to specify where to place the value - the start or end of the
> >      period - when converting from lower frequency to higher frequency (eg.
> >      monthly to daily)
>
> I'll make sure to make this available as an option. down going
> low-to-high you have two interpolation options: forward fill (aka
> "pad") and back fill, which I think is what you're saying?
>

I guess I had a bit of a misunderstanding when I wrote this comment because I
was framing things in the context of how I think about the scikits.timeseries
module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
day information at all. So when converting to daily you need to tell it
where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
decision from wanting to back fill or forward fill.

However, since pandas uses regular datetime objects, the day of the month is
already embedded in it. A potential drawback of this approach is that to
support "start of period" stuff you need to add a separate frequency,
effectively doubling the number of frequencies. And if you account for
"business day end of month" and "regular day end of month", then you have to
quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
Similarly for all the quarterly frequencies, annual frequencies, and so on.
Whether this is a major problem in practice or not, I don't know.

> >    - support of a larger number of frequencies
>
> Which ones are you thinking of? Currently I have:
>
> - hourly, minutely, secondly (and things like 5-minutely can be done,
> e.g. Minute(5))
> - daily / business daily
> - weekly (anchored on a particular weekday)
> - monthly / business month-end
> - (business) quarterly, anchored on jan/feb/march
> - annual / business annual (start and end)

I think it is missing quarterly frequencies anchored at the other 9 months of
the year. If, for example, you work at a weird Canadian Bank like me, then your
fiscal year end is October.

Other than that, it has all the frequencies I care about. Semi-annual would be
a nice touch, but not that important to me and timeseries module doesn't have
it either. People have also asked for higher frequencies in the timeseries
module before (eg. millisecond), but that is not something I personally care
about.

> > Indexing:
> >    - slicing with dates (looks like "truncate" method does this, but would
> >      be nice to be able to just use slicing directly)
>
> you can use fancy indexing to do this now, e.g:
>
> ts.ix[d1:d2]
>
> I could push this down into __getitem__ and __setitem__ too without much work

I see. I'd be +1 on pushing it down into __getitem__ and __setitem__

> > - full missing value support (TimeSeries class is a subclass of MaskedArray)
>
> I challenge you to find a (realistic) use case where the missing value
> support in pandas in inadequate. I'm being completely serious =) But
> I've been very vocal about my dislike of MaskedArrays in the missing
> data discussions. They're hard for (normal) people to use, degrade
> performance, use extra memory, etc. They add a layer of complication
> for working with time series that strikes me as completely
> unnecessary.

From my understanding, pandas just uses nans for missing values. So that means
strings, int's, or anything besides floats are not supported. So that
is my major issue with it. I agree that masked arrays are overly complicated
and it is not ideal. Hopefully the improved missing value support in numpy will
provide the best of both worlds.

- Matt


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Wes McKinney
On Wed, Jul 27, 2011 at 1:54 PM, Matt Knox <[hidden email]> wrote:

>
> Wes McKinney <wesmckinn <at> gmail.com> writes:
>
>> > Frequency conversion flexibility:>
>> >    - allow you to specify where to place the value - the start or end of the
>> >      period - when converting from lower frequency to higher frequency (eg.
>> >      monthly to daily)
>>
>> I'll make sure to make this available as an option. down going
>> low-to-high you have two interpolation options: forward fill (aka
>> "pad") and back fill, which I think is what you're saying?
>>
>
> I guess I had a bit of a misunderstanding when I wrote this comment because I
> was framing things in the context of how I think about the scikits.timeseries
> module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
> day information at all. So when converting to daily you need to tell it
> where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
> decision from wanting to back fill or forward fill.
>
> However, since pandas uses regular datetime objects, the day of the month is
> already embedded in it. A potential drawback of this approach is that to
> support "start of period" stuff you need to add a separate frequency,
> effectively doubling the number of frequencies. And if you account for
> "business day end of month" and "regular day end of month", then you have to
> quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
> Similarly for all the quarterly frequencies, annual frequencies, and so on.
> Whether this is a major problem in practice or not, I don't know.

I see what you mean. I'm going to wait until the dust on the NumPy
stuff settles and then figure out what to do. Using datetime objects
is good and bad-- it makes life a lot easier in many ways but some
things are less clean as a result. Should start documenting all the
use cases on a wiki somewhere.

>> >    - support of a larger number of frequencies
>>
>> Which ones are you thinking of? Currently I have:
>>
>> - hourly, minutely, secondly (and things like 5-minutely can be done,
>> e.g. Minute(5))
>> - daily / business daily
>> - weekly (anchored on a particular weekday)
>> - monthly / business month-end
>> - (business) quarterly, anchored on jan/feb/march
>> - annual / business annual (start and end)
>
> I think it is missing quarterly frequencies anchored at the other 9 months of
> the year. If, for example, you work at a weird Canadian Bank like me, then your
> fiscal year end is October.

For quarterly you need only anchor on Jan/Feb/March right?

In [76]: list(DateRange('1/1/2000', '1/1/2002',
offset=datetools.BQuarterEnd(startingMonth=1)))
Out[76]:
[datetime.datetime(2000, 1, 31, 0, 0),
 datetime.datetime(2000, 4, 28, 0, 0),
 datetime.datetime(2000, 7, 31, 0, 0),
 datetime.datetime(2000, 10, 31, 0, 0),
 datetime.datetime(2001, 1, 31, 0, 0),
 datetime.datetime(2001, 4, 30, 0, 0),
 datetime.datetime(2001, 7, 31, 0, 0),
 datetime.datetime(2001, 10, 31, 0, 0)]

(I know, I'm trying to get rid of the camel casing floating around...)

> Other than that, it has all the frequencies I care about. Semi-annual would be
> a nice touch, but not that important to me and timeseries module doesn't have
> it either. People have also asked for higher frequencies in the timeseries
> module before (eg. millisecond), but that is not something I personally care
> about.

numpy.datetime64 will help here. I've a mind to start playing with TAQ
(US equity tick data) in the near future in which case my requirements
will change.

>> > Indexing:
>> >    - slicing with dates (looks like "truncate" method does this, but would
>> >      be nice to be able to just use slicing directly)
>>
>> you can use fancy indexing to do this now, e.g:
>>
>> ts.ix[d1:d2]
>>
>> I could push this down into __getitem__ and __setitem__ too without much work
>
> I see. I'd be +1 on pushing it down into __getitem__ and __setitem__

I agree, little harm done. The main annoying detail here is working
with integer labels. __getitem__ needs to be integer-based when you
have integers, while using .ix[...] will do label-based always.

>> > - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>
>> I challenge you to find a (realistic) use case where the missing value
>> support in pandas in inadequate. I'm being completely serious =) But
>> I've been very vocal about my dislike of MaskedArrays in the missing
>> data discussions. They're hard for (normal) people to use, degrade
>> performance, use extra memory, etc. They add a layer of complication
>> for working with time series that strikes me as completely
>> unnecessary.
>
> From my understanding, pandas just uses nans for missing values. So that means
> strings, int's, or anything besides floats are not supported. So that
> is my major issue with it. I agree that masked arrays are overly complicated
> and it is not ideal. Hopefully the improved missing value support in numpy will
> provide the best of both worlds.

It's admittedly a kludge but I use NaN as the universal missing-data
marker for lack of a better alternative (basically I'm trying to
emulate R as much as I can). so you can literally have:

In [93]: df2
Out[93]:
    A     B       C        D         E
0   foo   one    -0.7883   0.7743    False
1   NaN   one    -0.5866   0.06009   False
2   foo   two     0.9312   1.2       True
3   NaN   three  -0.6417   0.3444    False
4   foo   two    -0.8841  -0.08126   False
5   bar   two     1.194   -0.7933    True
6   foo   one    -1.624   -0.1403    NaN
7   foo   three   0.5046   0.5833    True

To cope with this there are functions isnull and notnull which work on
every dtype and can recognize NaNs in non-floating point arrays:

In [96]: df2[notnull(df2['A'])]
Out[96]:
    A     B       C        D         E
0   foo   one    -0.7883   0.7743    False
2   foo   two     0.9312   1.2       True
4   foo   two    -0.8841  -0.08126   False
5   bar   two     1.194   -0.7933    True
6   foo   one    -1.624   -0.1403    NaN
7   foo   three   0.5046   0.5833    True

In [98]: df2['E'].fillna('missing')
Out[98]:
0    foo
1    missing
2    foo
3    missing
4    foo
5    bar
6    foo
7    foo

trying to index with a "boolean" array with NAs gives a slightly
helpful error message:

In [101]: df2[df2['E']]
ValueError: cannot index with vector containing NA / NaN values

but

In [102]: df2[df2['E'].fillna(False)]
Out[102]:
    A     B       C        D        E
2   foo   two     0.9312   1.2      True
5   bar   two     1.194   -0.7933   True
7   foo   three   0.5046   0.5833   True

Really crossing my fingers for favorable NA support in NumPy.

> - Matt
>
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Pierre GM-2

On Jul 27, 2011, at 8:09 PM, Wes McKinney wrote:

> On Wed, Jul 27, 2011 at 1:54 PM, Matt Knox <[hidden email]> wrote:
>>
>> Wes McKinney <wesmckinn <at> gmail.com> writes:
>>
>>>> Frequency conversion flexibility:>
>>>>    - allow you to specify where to place the value - the start or end of the
>>>>      period - when converting from lower frequency to higher frequency (eg.
>>>>      monthly to daily)
>>>
>>> I'll make sure to make this available as an option. down going
>>> low-to-high you have two interpolation options: forward fill (aka
>>> "pad") and back fill, which I think is what you're saying?
>>>
>>
>> I guess I had a bit of a misunderstanding when I wrote this comment because I
>> was framing things in the context of how I think about the scikits.timeseries
>> module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
>> day information at all. So when converting to daily you need to tell it
>> where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
>> decision from wanting to back fill or forward fill.
>>
>> However, since pandas uses regular datetime objects, the day of the month is
>> already embedded in it. A potential drawback of this approach is that to
>> support "start of period" stuff you need to add a separate frequency,
>> effectively doubling the number of frequencies. And if you account for
>> "business day end of month" and "regular day end of month", then you have to
>> quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
>> Similarly for all the quarterly frequencies, annual frequencies, and so on.
>> Whether this is a major problem in practice or not, I don't know.
>
> I see what you mean. I'm going to wait until the dust on the NumPy
> stuff settles and then figure out what to do. Using datetime objects
> is good and bad-- it makes life a lot easier in many ways but some
> things are less clean as a result. Should start documenting all the
> use cases on a wiki somewhere.

That's why we used integers to represent dates. We have rules to convert from integers to date times and back.

>>
>> I think it is missing quarterly frequencies anchored at the other 9 months of
>> the year. If, for example, you work at a weird Canadian Bank like me, then your
>> fiscal year end is October.
>
> For quarterly you need only anchor on Jan/Feb/March right?

No. You need to be able to define your own quarters. For example, it's fairly common in climatology to define a winter as DJF, so your year actually start on March 1st



>
>>>> Indexing:
>>>>    - slicing with dates (looks like "truncate" method does this, but would
>>>>      be nice to be able to just use slicing directly)
>>>
>>> you can use fancy indexing to do this now, e.g:
>>>
>>> ts.ix[d1:d2]
>>>
>>> I could push this down into __getitem__ and __setitem__ too without much work
>>
>> I see. I'd be +1 on pushing it down into __getitem__ and __setitem__
>
> I agree, little harm done. The main annoying detail here is working
> with integer labels. __getitem__ needs to be integer-based when you
> have integers, while using .ix[...] will do label-based always.

Overloading __g/setitem__ isn't always ideal in Python. That was one aspect I tried to push to C but it still needs a lot of work.


>
>>>> - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>>
>>> I challenge you to find a (realistic) use case where the missing value
>>> support in pandas in inadequate. I'm being completely serious =) But
>>> I've been very vocal about my dislike of MaskedArrays in the missing
>>> data discussions. They're hard for (normal) people to use, degrade
>>> performance, use extra memory, etc. They add a layer of complication
>>> for working with time series that strikes me as completely
>>> unnecessary.

</sigh>
Let's wait a bit and see how missing/ignored values are getting supported, shall we ?


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Matt Knox-4
In reply to this post by Wes McKinney

Wes McKinney <wesmckinn <at> gmail.com> writes:

> > I think it is missing quarterly frequencies anchored at the other 9 months
of
> > the year. If, for example, you work at a weird Canadian Bank like me, then
your

> > fiscal year end is October.
>
> For quarterly you need only anchor on Jan/Feb/March right?
>
> In [76]: list(DateRange('1/1/2000', '1/1/2002',
> offset=datetools.BQuarterEnd(startingMonth=1)))
> Out[76]:
> [datetime.datetime(2000, 1, 31, 0, 0),
>  datetime.datetime(2000, 4, 28, 0, 0),
>  datetime.datetime(2000, 7, 31, 0, 0),
>  datetime.datetime(2000, 10, 31, 0, 0),
>  datetime.datetime(2001, 1, 31, 0, 0),
>  datetime.datetime(2001, 4, 30, 0, 0),
>  datetime.datetime(2001, 7, 31, 0, 0),
>  datetime.datetime(2001, 10, 31, 0, 0)]

I guess this again gets back to the fact that it is datetime objects being used
and the series itself doesn't really have any "frequency" information
contained in it in pandas. So in pandas, a March based quarterly frequency
really is identical to a June based quarterly frequency.

My use case for this type of stuff would be "calendarizing" things like
earnings.

For example, lets say I had the following data:
   
Company A - fiscal year end October
2009q1 15.7
2009q2 16.1
2009q3 16.6
etc...

Company B - fiscal year end April
2009q1 12.9
2009q2 11.2
2009q3 13.5
etc...

in the first case, 2009q1 is Nov 2008 - Jan 2009. In the second case it is
May 2008 - July 2008. This can be handled without too much extra work in
pandas, by preconverting your quarters to actual dates. I think it is a bit
less clean than in the timeseries module where I would just specify Q-OCT for
the frequency and then everything is done for me. But it is not something I
would lose sleep over. And the workaround is not that onerous.


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Wes McKinney
On Wed, Jul 27, 2011 at 3:42 PM, Matt Knox <[hidden email]> wrote:

>
> Wes McKinney <wesmckinn <at> gmail.com> writes:
>
>> > I think it is missing quarterly frequencies anchored at the other 9 months
> of
>> > the year. If, for example, you work at a weird Canadian Bank like me, then
> your
>> > fiscal year end is October.
>>
>> For quarterly you need only anchor on Jan/Feb/March right?
>>
>> In [76]: list(DateRange('1/1/2000', '1/1/2002',
>> offset=datetools.BQuarterEnd(startingMonth=1)))
>> Out[76]:
>> [datetime.datetime(2000, 1, 31, 0, 0),
>>  datetime.datetime(2000, 4, 28, 0, 0),
>>  datetime.datetime(2000, 7, 31, 0, 0),
>>  datetime.datetime(2000, 10, 31, 0, 0),
>>  datetime.datetime(2001, 1, 31, 0, 0),
>>  datetime.datetime(2001, 4, 30, 0, 0),
>>  datetime.datetime(2001, 7, 31, 0, 0),
>>  datetime.datetime(2001, 10, 31, 0, 0)]
>
> I guess this again gets back to the fact that it is datetime objects being used
> and the series itself doesn't really have any "frequency" information
> contained in it in pandas. So in pandas, a March based quarterly frequency
> really is identical to a June based quarterly frequency.
>
> My use case for this type of stuff would be "calendarizing" things like
> earnings.
>
> For example, lets say I had the following data:
>
> Company A - fiscal year end October
> 2009q1 15.7
> 2009q2 16.1
> 2009q3 16.6
> etc...
>
> Company B - fiscal year end April
> 2009q1 12.9
> 2009q2 11.2
> 2009q3 13.5
> etc...
>
> in the first case, 2009q1 is Nov 2008 - Jan 2009. In the second case it is
> May 2008 - July 2008. This can be handled without too much extra work in
> pandas, by preconverting your quarters to actual dates. I think it is a bit
> less clean than in the timeseries module where I would just specify Q-OCT for
> the frequency and then everything is done for me. But it is not something I
> would lose sleep over. And the workaround is not that onerous.
>
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Thanks, I've got it now. A little slow on the uptake =)
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Timmie
Administrator
In reply to this post by Pierre GM-2
Hello all,
similar to Dharhas, I was a strong user of the time series scikit from
the very beginning.
Since most of my code for meteorological data evaluations is based on
it, I would be happy to receive infomation on the conclusion and how I
need to adjust my code to upkeep with new developments.

>>>>> - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>>>
>>>> I challenge you to find a (realistic) use case where the missing value
>>>> support in pandas in inadequate. I'm being completely serious =) But
>>>> I've been very vocal about my dislike of MaskedArrays in the missing
>>>> data discussions. They're hard for (normal) people to use, degrade
>>>> performance, use extra memory, etc. They add a layer of complication
>>>> for working with time series that strikes me as completely
>>>> unnecessary.
>
> </sigh>
> Let's wait a bit and see how missing/ignored values are getting supported, shall we ?
How does Pandas deal with missing values?

This pages:
http://pandas.sourceforge.net/missing_data.html?highlight=missing
Is empty

The convenient support for missing data (once date converterters were
out) in timeseries helps a lot to quickly deal with measurement logs or
incomplete data.

Best regards,
Timmie

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Wes McKinney
On Wed, Jul 27, 2011 at 6:31 PM, Tim Michelsen
<[hidden email]> wrote:
> Hello all,
> similar to Dharhas, I was a strong user of the time series scikit from
> the very beginning.
> Since most of my code for meteorological data evaluations is based on
> it, I would be happy to receive infomation on the conclusion and how I
> need to adjust my code to upkeep with new developments.

When it gets to that point I'd be happy to help (including looking at
some of your existing code and data).

>>>>>> - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>>>>
>>>>> I challenge you to find a (realistic) use case where the missing value
>>>>> support in pandas in inadequate. I'm being completely serious =) But
>>>>> I've been very vocal about my dislike of MaskedArrays in the missing
>>>>> data discussions. They're hard for (normal) people to use, degrade
>>>>> performance, use extra memory, etc. They add a layer of complication
>>>>> for working with time series that strikes me as completely
>>>>> unnecessary.
>>
>> </sigh>
>> Let's wait a bit and see how missing/ignored values are getting supported, shall we ?
> How does Pandas deal with missing values?

discussed a bit in my reply here:

http://article.gmane.org/gmane.comp.python.scientific.user/29661

In short using NaN across the dtypes with special functions
isnull/notnull to detect NaN in dtype=object arrays. I'm hopeful this
can be replaced with native NumPy NA support in the relatively near
future...

> This pages:
> http://pandas.sourceforge.net/missing_data.html?highlight=missing
> Is empty
>
> The convenient support for missing data (once date converterters were
> out) in timeseries helps a lot to quickly deal with measurement logs or
> incomplete data.
>
> Best regards,
> Timmie
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Timmie
Administrator
>> Since most of my code for meteorological data evaluations is based on
>> it, I would be happy to receive infomation on the conclusion and how I
>> need to adjust my code to upkeep with new developments.
>
> When it gets to that point I'd be happy to help (including looking at
> some of your existing code and data).
In short my process goes like:
* QC of incoming measurements data
* visualisation and statistics (basics, disribution analysis)
* reporting
* back & forcasting with other (modeled) data
* preparation of result data sets

When it comes to QC I would need:
* check on missing dates (i.e. failure of aquisitition equipment)
* check on double dates (= failure of data logger)
* data integrity and plausability tests with certain filters/flags

All these need to be reported on:
* data recovery
* invalid data by filter/flag type

So far, I have been using the masked arrays. Mainly because it is heaily
 used in the time series scikit and transfering masks from on array to
another is quite once you learned the basics.

Would you work these items out in pandas, as well?

P.S. Your presentation "Time series analysis in Python with statsmodels"
is really cool and has shown me good aspects about the HP filters

Regards,
Timmie

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Timmie
Administrator
In reply to this post by Wes McKinney
>>> It just so happens that Wes' use cases (from my understanding) are
>>> basically the same as mine (finance, etc). So from my own selfish point
>>> of view, the idea of pandas swallowing up the timeseries module and
>>> incorporating its functionality sounds kind of nice since that would
>>> give ME (and probably most of the people that work in the finance
>>> domain)
>>
>> I think that it is really great if the different packages doing time
>> series analysis unite. It will probably give better packages technically,
>> and there is a lot of value to the community in such work.
>
> I agree. I already have 50% or more of the features in
> scikits.timeseries, so this gets back to my fragmentation argument
> (users being stuck with a confusing choice between multiple
> libraries). Let's make it happen!
So what needs to be done to move things forward?
Do we need to draw up a roadmap?
A table with functions that respond to common use cases in natual
science, computing, and economics?

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Wes McKinney
On Sat, Jul 30, 2011 at 11:50 AM, Tim Michelsen
<[hidden email]> wrote:

>>>> It just so happens that Wes' use cases (from my understanding) are
>>>> basically the same as mine (finance, etc). So from my own selfish point
>>>> of view, the idea of pandas swallowing up the timeseries module and
>>>> incorporating its functionality sounds kind of nice since that would
>>>> give ME (and probably most of the people that work in the finance
>>>> domain)
>>>
>>> I think that it is really great if the different packages doing time
>>> series analysis unite. It will probably give better packages technically,
>>> and there is a lot of value to the community in such work.
>>
>> I agree. I already have 50% or more of the features in
>> scikits.timeseries, so this gets back to my fragmentation argument
>> (users being stuck with a confusing choice between multiple
>> libraries). Let's make it happen!
> So what needs to be done to move things forward?
> Do we need to draw up a roadmap?
> A table with functions that respond to common use cases in natual
> science, computing, and economics?
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Having a place to collect concrete use cases (like your list from the
prior e-mail, but with illustrative code snippets) would be good.
You're welcome to start doing it here:

https://github.com/wesm/pandas/wiki

A good place to start, which I can do when I have some time, would be
to start moving the scikits.timeseries code into pandas. There are
several key components

- Date and DateArray stuff, frequency implementations
- masked array time series implementations (record array and not)
- plotting
- reporting, moving window functions, etc.

We need to evaluate Date/DateArray as they relate to numpy.datetime64
and see what can be done. I haven't looked closely but I'm not sure if
all the convenient attribute access stuff (day, month, day_of_week,
weekday, etc.) is available in NumPy yet. I suspect it would be
reasonably straightforward to wrap DateArray so it can be an Index for
a pandas object.

I won't have much time for this until mid-August, but a couple days'
hacking should get most of the pieces into place. I guess we can just
keep around the masked array classes for legacy API support and for
feature completeness.

- Wes
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Timmie
Administrator
> >> I agree. I already have 50% or more of the features in
> >> scikits.timeseries, so this gets back to my fragmentation argument
> >> (users being stuck with a confusing choice between multiple
> >> libraries). Let's make it happen!
> > So what needs to be done to move things forward?
> > Do we need to draw up a roadmap?
> > A table with functions that respond to common use cases in natual
> > science, computing, and economics?
> Having a place to collect concrete use cases (like your list from the
> prior e-mail, but with illustrative code snippets) would be good.
> You're welcome to start doing it here:
>
> https://github.com/wesm/pandas/wiki
Here goes:
https://github.com/wesm/pandas/wiki/Time-Series-Manipulation 

I will fill it with my stuff.
Shall we file feature request directly as issues?

> A good place to start, which I can do when I have some time, would be
> to start moving the scikits.timeseries code into pandas. There are
> several key components
>
> - Date and DateArray stuff, frequency implementations
> - masked array time series implementations (record array and not)
> - plotting
> - reporting, moving window functions, etc.
>
> We need to evaluate Date/DateArray as they relate to numpy.datetime64
> and see what can be done. I haven't looked closely but I'm not sure if
> all the convenient attribute access stuff (day, month, day_of_week,
> weekday, etc.) is available in NumPy yet. I suspect it would be
> reasonably straightforward to wrap DateArray so it can be an Index for
> a pandas object.
>
> I won't have much time for this until mid-August, but a couple days'
> hacking should get most of the pieces into place. I guess we can just
> keep around the masked array classes for legacy API support and for
> feature completeness.
I value very much the work of Pierre and Matt.
But my difficulti with the scikit was that the code is too complex. So I was
only able to contribute helper functions for doc fixes.
Please, lets make it happen that this effort is not a on or 3 man show but
results in something whcih can be maintained by the whole community.

Nevertheless, the timeseries scikit made my work more comfortable and
understadable than I was able to manage with R.

Regards,
Timmie

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Pierre GM-2

On Aug 2, 2011, at 9:37 AM, Tim Michelsen wrote:

>>>> I agree. I already have 50% or more of the features in
>>>> scikits.timeseries, so this gets back to my fragmentation argument
>>>> (users being stuck with a confusing choice between multiple
>>>> libraries). Let's make it happen!
>>> So what needs to be done to move things forward?
>>> Do we need to draw up a roadmap?
>>> A table with functions that respond to common use cases in natual
>>> science, computing, and economics?
>> Having a place to collect concrete use cases (like your list from the
>> prior e-mail, but with illustrative code snippets) would be good.
>> You're welcome to start doing it here:
>>
>> https://github.com/wesm/pandas/wiki
> Here goes:
> https://github.com/wesm/pandas/wiki/Time-Series-Manipulation 
>
> I will fill it with my stuff.
> Shall we file feature request directly as issues?
>
>> A good place to start, which I can do when I have some time, would be
>> to start moving the scikits.timeseries code into pandas. There are
>> several key components
>>
>> - Date and DateArray stuff, frequency implementations
>> - masked array time series implementations (record array and not)
>> - plotting
>> - reporting, moving window functions, etc.
>>
>> We need to evaluate Date/DateArray as they relate to numpy.datetime64
>> and see what can be done. I haven't looked closely but I'm not sure if
>> all the convenient attribute access stuff (day, month, day_of_week,
>> weekday, etc.) is available in NumPy yet. I suspect it would be
>> reasonably straightforward to wrap DateArray so it can be an Index for
>> a pandas object.
>>
>> I won't have much time for this until mid-August, but a couple days'
>> hacking should get most of the pieces into place. I guess we can just
>> keep around the masked array classes for legacy API support and for
>> feature completeness.
> I value very much the work of Pierre and Matt.
> But my difficulti with the scikit was that the code is too complex. So I was
> only able to contribute helper functions for doc fixes.
> Please, lets make it happen that this effort is not a on or 3 man show but
> results in something whcih can be maintained by the whole community.

The apparent complexity of the code comes likely from the fact that some features were coded directly in C (not even Cython) for  efficiency. That, and that it relied on MaskedArray, of course ;)


> Nevertheless, the timeseries scikit made my work more comfortable and
> understadable than I was able to manage with R.

Great ! That was the purpose.
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Wes McKinney
In reply to this post by Timmie
On Sat, Jul 30, 2011 at 7:40 AM, Tim Michelsen
<[hidden email]> wrote:
>>> Since most of my code for meteorological data evaluations is based on
>>> it, I would be happy to receive infomation on the conclusion and how I
>>> need to adjust my code to upkeep with new developments.
>>
>> When it gets to that point I'd be happy to help (including looking at
>> some of your existing code and data).

Sorry I've been out of commission for the last week or so.

> In short my process goes like:
> * QC of incoming measurements data
> * visualisation and statistics (basics, disribution analysis)
> * reporting
> * back & forcasting with other (modeled) data
> * preparation of result data sets
>
> When it comes to QC I would need:
> * check on missing dates (i.e. failure of aquisitition equipment)
> * check on double dates (= failure of data logger)
> * data integrity and plausability tests with certain filters/flags
>
> All these need to be reported on:
> * data recovery
> * invalid data by filter/flag type
>
> So far, I have been using the masked arrays. Mainly because it is heaily
>  used in the time series scikit and transfering masks from on array to
> another is quite once you learned the basics.
>
> Would you work these items out in pandas, as well?

I would need to look at code and see the concrete use cases. As with
anything else, you adapt solutions to your problems based on your
available tools.

> P.S. Your presentation "Time series analysis in Python with statsmodels"
> is really cool and has shown me good aspects about the HP filters
>

Thanks...still lots to do on the TSA front. The filtering work has all
been Skipper's.

> Regards,
> Timmie
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Status of TimeSeries SciKit

Wes McKinney
In reply to this post by Timmie
On Tue, Aug 2, 2011 at 3:37 AM, Tim Michelsen
<[hidden email]> wrote:

>> >> I agree. I already have 50% or more of the features in
>> >> scikits.timeseries, so this gets back to my fragmentation argument
>> >> (users being stuck with a confusing choice between multiple
>> >> libraries). Let's make it happen!
>> > So what needs to be done to move things forward?
>> > Do we need to draw up a roadmap?
>> > A table with functions that respond to common use cases in natual
>> > science, computing, and economics?
>> Having a place to collect concrete use cases (like your list from the
>> prior e-mail, but with illustrative code snippets) would be good.
>> You're welcome to start doing it here:
>>
>> https://github.com/wesm/pandas/wiki
> Here goes:
> https://github.com/wesm/pandas/wiki/Time-Series-Manipulation
>
> I will fill it with my stuff.
> Shall we file feature request directly as issues?

Cool, I will start adding things when I have some time. Feel free to
file features requests as issues tagged with "Enhancement".

>> A good place to start, which I can do when I have some time, would be
>> to start moving the scikits.timeseries code into pandas. There are
>> several key components
>>
>> - Date and DateArray stuff, frequency implementations
>> - masked array time series implementations (record array and not)
>> - plotting
>> - reporting, moving window functions, etc.
>>
>> We need to evaluate Date/DateArray as they relate to numpy.datetime64
>> and see what can be done. I haven't looked closely but I'm not sure if
>> all the convenient attribute access stuff (day, month, day_of_week,
>> weekday, etc.) is available in NumPy yet. I suspect it would be
>> reasonably straightforward to wrap DateArray so it can be an Index for
>> a pandas object.
>>
>> I won't have much time for this until mid-August, but a couple days'
>> hacking should get most of the pieces into place. I guess we can just
>> keep around the masked array classes for legacy API support and for
>> feature completeness.
> I value very much the work of Pierre and Matt.
> But my difficulti with the scikit was that the code is too complex. So I was
> only able to contribute helper functions for doc fixes.
> Please, lets make it happen that this effort is not a on or 3 man show but
> results in something whcih can be maintained by the whole community.

Yes, I agree. I am painfully aware of being one of the only people
consistently working on the data structure front (judging from commit
activity at least) but I would like to get more people involved. I'm
hopeful that increasing awareness to what we're working on (e.g. I've
started blogging about pandas and related things) will draw new people
into the projects.

> Nevertheless, the timeseries scikit made my work more comfortable and
> understadable than I was able to manage with R.
>
> Regards,
> Timmie
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
12