I want to resample a large (400k+) dataset where x are datetime
objects and y are floats. The x data are epoch seconds from the past week. For the purposes of this example, I've crudely downsampled them, choosing every 10 elements (Python prompt changed to "... " to fool Gmane). ... len(t) 43051 ... len(x) 43051 ... pprint([datetime.datetime.fromtimestamp(_) for _ in t[:10]]) [datetime.datetime(2015, 1, 12, 0, 0), datetime.datetime(2015, 1, 12, 0, 0, 46, 742044), datetime.datetime(2015, 1, 12, 0, 1, 3, 320089), datetime.datetime(2015, 1, 12, 0, 1, 23, 700560), datetime.datetime(2015, 1, 12, 0, 1, 44, 583401), datetime.datetime(2015, 1, 12, 0, 1, 57, 733937), datetime.datetime(2015, 1, 12, 0, 2, 38, 30245), datetime.datetime(2015, 1, 12, 0, 3, 35, 336342), datetime.datetime(2015, 1, 12, 0, 4, 23, 833251), datetime.datetime(2015, 1, 12, 0, 4, 48, 272131)] ... pprint([datetime.datetime.fromtimestamp(_) for _ in t[-10:]]) [datetime.datetime(2015, 1, 19, 23, 56, 9, 996926), datetime.datetime(2015, 1, 19, 23, 56, 12, 104080), datetime.datetime(2015, 1, 19, 23, 56, 12, 158963), datetime.datetime(2015, 1, 19, 23, 56, 12, 280701), datetime.datetime(2015, 1, 19, 23, 56, 12, 337853), datetime.datetime(2015, 1, 19, 23, 56, 22, 169709), datetime.datetime(2015, 1, 19, 23, 56, 29, 676865), datetime.datetime(2015, 1, 19, 23, 57, 14, 570601), datetime.datetime(2015, 1, 19, 23, 58, 56, 394975), datetime.datetime(2015, 1, 19, 23, 59, 37, 707367)] So, let's get started, downsampling our 43k points to 250: ... res_x, res_t = signal.resample(x, 250, t) (Final Jeopardy tune plays...) ... If I understand correctly, signal.resample should generate 250 evenly spaced points from each of the inputs. ... len(res_x) 250 ... len(res_t) 250 So far, so good. Now, look at the range of res_t: ... pprint([datetime.datetime.fromtimestamp(_) for _ in res_t[:10]]) [datetime.datetime(2015, 1, 12, 0, 0), datetime.datetime(2015, 1, 12, 2, 14, 9, 166940), datetime.datetime(2015, 1, 12, 4, 28, 18, 333880), datetime.datetime(2015, 1, 12, 6, 42, 27, 500820), datetime.datetime(2015, 1, 12, 8, 56, 36, 667761), datetime.datetime(2015, 1, 12, 11, 10, 45, 834701), datetime.datetime(2015, 1, 12, 13, 24, 55, 1641), datetime.datetime(2015, 1, 12, 15, 39, 4, 168581), datetime.datetime(2015, 1, 12, 17, 53, 13, 335521), datetime.datetime(2015, 1, 12, 20, 7, 22, 502461)] ... pprint([datetime.datetime.fromtimestamp(_) for _ in res_t[-10:]]) [datetime.datetime(2015, 2, 3, 8, 36, 40, 65638), datetime.datetime(2015, 2, 3, 10, 50, 49, 232578), datetime.datetime(2015, 2, 3, 13, 4, 58, 399518), datetime.datetime(2015, 2, 3, 15, 19, 7, 566458), datetime.datetime(2015, 2, 3, 17, 33, 16, 733398), datetime.datetime(2015, 2, 3, 19, 47, 25, 900338), datetime.datetime(2015, 2, 3, 22, 1, 35, 67279), datetime.datetime(2015, 2, 4, 0, 15, 44, 234219), datetime.datetime(2015, 2, 4, 2, 29, 53, 401159), datetime.datetime(2015, 2, 4, 4, 44, 2, 568099)] That doesn't look right at all. I'm sure I'm using an outdated version of scipy: ... scipy.version.version '0.9.0' but it's what I have available (it's a long story). If this is a bug requiring upgrade, I'll beat on the powers that be to get a newer version of scipy. I'm happy to provide my data to anyone who would be willing to try this exercise out using a more recent version. Thanks, Skip Montanaro _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Tue, Jan 20, 2015 at 6:14 PM, Skip Montanaro <[hidden email]> wrote: I want to resample a large (400k+) dataset where x are datetime I doubt that an upgrade will fix your issue; I don't see any bug fixes to signal.resample since 0.9.0 that look relevant. I don't understand that this works for you at all, a quick test with ``t = list_of_datetimes`` gives me: TypeError: unsupported operand type(s) for /: 'datetime.timedelta' and 'float' If you can provide a reproducible example on a generated set of data, that would be the easiest (we can use that as a regression test). Otherwise providing your code with your actual dataset is also OK - if you send me a link or email it to me I'll have a look. Cheers, Ralf _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Skip Montanaro
I managed to download, build and install scipy 0.15.1. I get a
similar (though quantitatively different) result. >>> len(t) 430509 >>> len(x) 430509 >>> res_x, res_t = signal.resample(x[::100], 250, t[::100]) >>> len(res_x) 250 >>> len(res_t) 250 >>> t[-1] 1421733595.509921 >>> res_t[-1] 1422456460.5224724 >>> pprint([datetime.datetime.fromtimestamp(t[0]), datetime.datetime.fromtimestamp(t[-1])]) [datetime.datetime(2015, 1, 12, 0, 0), datetime.datetime(2015, 1, 19, 23, 59, 55, 509921)] >>> pprint([datetime.datetime.fromtimestamp(res_t[0]), datetime.datetime.fromtimestamp(res_t[-1])]) [datetime.datetime(2015, 1, 12, 0, 0), datetime.datetime(2015, 1, 28, 8, 47, 40, 522472)] I assume I'm doing something wrong to cause it to expand the range like that. I didn't see any arguments in the help() output which obviously suggested I could change this particular behavior though. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Tue, Jan 20, 2015 at 2:36 PM, Skip Montanaro <[hidden email]> wrote: I managed to download, build and install scipy 0.15.1. I get a In case it's rounding issues (my guess), you could try to subtract t[0] from t, and add it again after the resample. There is a small chance it helps. Josef
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Skip Montanaro
On Tue, Jan 20, 2015 at 2:36 PM, Skip Montanaro <[hidden email]> wrote: I managed to download, build and install scipy 0.15.1. I get a `resample` assumes the samples are uniformly spaced, but your timestamps are not. Here are your first timestamps (from your first email): In [40]: t Out[40]: [datetime.datetime(2015, 1, 12, 0, 0), datetime.datetime(2015, 1, 12, 0, 0, 46, 742044), datetime.datetime(2015, 1, 12, 0, 1, 3, 320089), datetime.datetime(2015, 1, 12, 0, 1, 23, 700560), datetime.datetime(2015, 1, 12, 0, 1, 44, 583401), datetime.datetime(2015, 1, 12, 0, 1, 57, 733937), datetime.datetime(2015, 1, 12, 0, 2, 38, 30245), datetime.datetime(2015, 1, 12, 0, 3, 35, 336342), datetime.datetime(2015, 1, 12, 0, 4, 23, 833251), datetime.datetime(2015, 1, 12, 0, 4, 48, 272131)] `dt` holds the intervals between each timestamp. For `resample` to work as expected, these should all be the same: In [41]: dt = np.array([delta.total_seconds() for delta in np.diff(d)]) In [42]: dt Out[42]: array([ 46.742044, 16.578045, 20.380471, 20.882841, 13.150536, 40.296308, 57.306097, 48.496909, 24.43888 ]) By the way, it might be just luck that `resample` didn't crash when given a sequence of `datetime.datetime` objects for `t`. I don't think any of the functions in scipy.signal were explicitly designed to handle `datetime` objects. (There are no tests of such input in the test suite.) In this case, it "works" because of the formula used to create the new time values. Because `resample` assumes the input is uniformly sampled, it needs only the first time difference to figure out the new timestamps. Here's how the new time values are computed in `resample` (`Nx` and `num` are the old and new number of samples, respectively): new_t = arange(0, num) * (t[1] - t[0]) * Nx / float(num) + t[0] I.e. new_t = arange(0, num) * new_dt + t[0] where new_dt = (t[1] - t[0]) * Nx / float(num) `t[1] - t[0]` is a `datetime.timedelta` object, and `new_t` ends up as an array (with object dtype) of `datetime.datetime` instances. Warren
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |