This thread started on the numpy list:
http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html I think we should narrow the focus of the package by only including functions that operate on numpy arrays. That would cut out date utilities, label indexing utilities, and binary operations with various join methods on the labels. It would leave us with three categories: faster versions of numpy/scipy nan functions, moving window statistics, and group functions. I suggest we add a fourth category: normalization. FASTER NUMPY/SCIPY NAN FUNCTIONS This work is already underway: http://github.com/kwgoodman/nanny The function signatures for these are easy: we copy numpy, scipy. (I am tempted to change nanstd from scipy's bias=False to ddof=0.) I'd like to use a partial sort for nanmedian. Anyone interested in coding that? dtype: int32, int64, float 64 for now ndim: 1, 2, 3 (need some recursive magic for nd > 3; that's an open project for anyone) MOVING WINDOW STATISTICS I already have doc strings and unit tests (https://github.com/kwgoodman/la/blob/master/la/farray/mov.py). And I have a cython prototype that moves the window backwards so that the stats can be filled in place. (This assumes we make a copy of the data at the top of the function: arr = arr.astype(float)) Proposed function signature: mov_sum(arr, window, axis=-1), mov_nansum(arr, window, axis=-1) If you don't like mov, then: move? roll? I think requesting a minimum number of non-nan elements in a window or else returning NaN is clever. But I do like the simple signature above. Binary moving window functions: mov_nancorr(arr1, arr2, window, axis=-1), etc. Optional: moving window bootstrap estimate of error (std) of the moving statistic. So, what's the std of each erstimate in the mov_median output? Too specialized? dtype: float64 ndim: 1, 2, 3, recursive for nd > 0 NORMALIZATION I already have nd versions of ranking, zscore, quantile, demean, demedian, etc in larry. We should rename to nanzscore etc. ranking and quantile could use some cython love. I don't know, should we cut this category? GROUP FUNCTIONS Input: array, sequence of labels such as a list, axis. For an array of shape (n,m), axis=0, and a list of n labels with d distinct values, group_nanmean would return a (d,m) array. I'd also like a groupfilter_nanmean which would return a (n,m) array and would have an additional, optional input: exclude_self=False. NAME What should we call the package? Numa, numerical analysis with numpy arrays Dana, data analysis with numpy arrays import dana as da (da=data analysis) ARE YOU CRAZY? If you read this far, you are crazy and would be a good fit for this project. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Nov 22, 2010 at 10:35 AM, Keith Goodman <[hidden email]> wrote:
> This thread started on the numpy list: > http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html > > I think we should narrow the focus of the package by only including > functions that operate on numpy arrays. That would cut out date > utilities, label indexing utilities, and binary operations with > various join methods on the labels. It would leave us with three > categories: faster versions of numpy/scipy nan functions, moving > window statistics, and group functions. > > I suggest we add a fourth category: normalization. > > FASTER NUMPY/SCIPY NAN FUNCTIONS > > This work is already underway: http://github.com/kwgoodman/nanny > > The function signatures for these are easy: we copy numpy, scipy. (I > am tempted to change nanstd from scipy's bias=False to ddof=0.) scipy.stats.nanstd is supposed to switch to ddof, so don't copy inconsistent signatures that are supposed to be depreciated. I would like statistics (scipy.stats and statsmodels) to stick with default axis=0. I would be in favor of axis=None for nan extended versions of numpy functions and axis=0 for stats functions as defaults, but since it will be a standalone package with wider usage, I will be able to keep track of axis=-1. Josef > > I'd like to use a partial sort for nanmedian. Anyone interested in coding that? > > dtype: int32, int64, float 64 for now > ndim: 1, 2, 3 (need some recursive magic for nd > 3; that's an open > project for anyone) > > MOVING WINDOW STATISTICS > > I already have doc strings and unit tests > (https://github.com/kwgoodman/la/blob/master/la/farray/mov.py). And I > have a cython prototype that moves the window backwards so that the > stats can be filled in place. (This assumes we make a copy of the data > at the top of the function: arr = arr.astype(float)) > > Proposed function signature: mov_sum(arr, window, axis=-1), > mov_nansum(arr, window, axis=-1) > > If you don't like mov, then: move? roll? > > I think requesting a minimum number of non-nan elements in a window or > else returning NaN is clever. But I do like the simple signature > above. > > Binary moving window functions: mov_nancorr(arr1, arr2, window, axis=-1), etc. > > Optional: moving window bootstrap estimate of error (std) of the > moving statistic. So, what's the std of each erstimate in the > mov_median output? Too specialized? > > dtype: float64 > ndim: 1, 2, 3, recursive for nd > 0 > > NORMALIZATION > > I already have nd versions of ranking, zscore, quantile, demean, > demedian, etc in larry. We should rename to nanzscore etc. > > ranking and quantile could use some cython love. > > I don't know, should we cut this category? > > GROUP FUNCTIONS > > Input: array, sequence of labels such as a list, axis. > > For an array of shape (n,m), axis=0, and a list of n labels with d > distinct values, group_nanmean would return a (d,m) array. I'd also > like a groupfilter_nanmean which would return a (n,m) array and would > have an additional, optional input: exclude_self=False. > > NAME > > What should we call the package? > > Numa, numerical analysis with numpy arrays > Dana, data analysis with numpy arrays > > import dana as da (da=data analysis) > > ARE YOU CRAZY? > > If you read this far, you are crazy and would be a good fit for this project. > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Nov 22, 2010 at 11:52 PM, <[hidden email]> wrote:
I added a patch for nanstd to make this switch to http://projects.scipy.org/scipy/ticket/1200 just yesterday. Unfortunately this can not be done in a backwards-compatible way. So it would be helpful to deprecate the current signature in 0.9.0 if this change is to be made. Ralf
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
On Mon, Nov 22, 2010 at 7:52 AM, <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 10:35 AM, Keith Goodman <[hidden email]> wrote: >> The function signatures for these are easy: we copy numpy, scipy. (I >> am tempted to change nanstd from scipy's bias=False to ddof=0.) > > scipy.stats.nanstd is supposed to switch to ddof, so don't copy > inconsistent signatures that are supposed to be depreciated. Great, I'll use ddof then. > I would like statistics (scipy.stats and statsmodels) to stick with > default axis=0. I put my dates on axis=-1. It is much faster: >> a = np.random.rand(1000,1000) >> timeit a.sum(0) 100 loops, best of 3: 9.01 ms per loop >> timeit a.sum(1) 1000 loops, best of 3: 1.17 ms per loop >> timeit a.std(0) 10 loops, best of 3: 27.2 ms per loop >> timeit a.std(1) 100 loops, best of 3: 11.5 ms per loop But I'd like the default axis to be what a numpy user would expect it to be. > I would be in favor of axis=None for nan extended versions of numpy > functions and axis=0 for stats functions as defaults, but since it > will be a standalone package with wider usage, I will be able to keep > track of axis=-1. What default axis would a numpy/scipy user expect for mov_sum? group_mean? _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
On Mon, Nov 22, 2010 at 7:52 AM, <[hidden email]> wrote:
> I would like statistics (scipy.stats and statsmodels) to stick with > default axis=0. > I would be in favor of axis=None for nan extended versions of numpy > functions and axis=0 for stats functions as defaults, but since it > will be a standalone package with wider usage, I will be able to keep > track of axis=-1. Please let's keep everything using the same default -- it doesn't actually make life simpler if for every function I have to squint and try to remember whether or not it's a "stats function". (Like, what's "mean"?) I think the world already has a sufficient supply of arbitrarily inconsistent scientific APIs. -- Nathaniel _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
On Mon, Nov 22, 2010 at 9:35 AM, Keith Goodman <[hidden email]> wrote:
> This thread started on the numpy list: > http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html > > I think we should narrow the focus of the package by only including > functions that operate on numpy arrays. That might be overly restrictive. What about fast incremental code that is not array based (ie it is real time streaming rather than a post hoc computation on arrays). Eg, a cython ringbuffer with support for nan, percentiles, min, max, mean, std, median, etc.... Eric Firing wrote a ringbuf class that provides this functionality that is very useful, and this packages seems like a perfect place to host something like that. JDH _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nathaniel Smith
On Mon, Nov 22, 2010 at 8:14 AM, Nathaniel Smith <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 7:52 AM, <[hidden email]> wrote: >> I would like statistics (scipy.stats and statsmodels) to stick with >> default axis=0. >> I would be in favor of axis=None for nan extended versions of numpy >> functions and axis=0 for stats functions as defaults, but since it >> will be a standalone package with wider usage, I will be able to keep >> track of axis=-1. > > Please let's keep everything using the same default -- it doesn't > actually make life simpler if for every function I have to squint and > try to remember whether or not it's a "stats function". (Like, what's > "mean"?) > > I think the world already has a sufficient supply of arbitrarily > inconsistent scientific APIs. nanstd, nanmean, etc use axis=None for the default. What would axis=None mean for a moving window sum? _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by John Hunter-4
On Mon, Nov 22, 2010 at 8:16 AM, John Hunter <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 9:35 AM, Keith Goodman <[hidden email]> wrote: >> This thread started on the numpy list: >> http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html >> >> I think we should narrow the focus of the package by only including >> functions that operate on numpy arrays. > > That might be overly restrictive. What about fast incremental code > that is not array based (ie it is real time streaming rather than a > post hoc computation on arrays). Eg, a cython ringbuffer with support > for nan, percentiles, min, max, mean, std, median, etc.... Eric > Firing wrote a ringbuf class that provides this functionality that is > very useful, and this packages seems like a perfect place to host > something like that. That's a new idea to me. My first reaction is that it belongs in a separate package for streaming data. Large packages get tough to maintain and to use. What do others think? _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
Keith Goodman <kwgoodman <at> gmail.com> writes:
> > NAME > > What should we call the package? > > Numa, numerical analysis with numpy arrays > Dana, data analysis with numpy arrays > > import dana as da (da=data analysis) > > ARE YOU CRAZY? > > If you read this far, you are crazy and would be a good fit for this project. > Sounds like a useful toolbox. As it's focused on calculating various statistics on arrays in the presence of NaNs I would find nanstats an informative (if boring) name. -Dave _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
On Mon, Nov 22, 2010 at 10:35 AM, Keith Goodman <[hidden email]> wrote:
> This thread started on the numpy list: > http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html > > I think we should narrow the focus of the package by only including > functions that operate on numpy arrays. That would cut out date > utilities, label indexing utilities, and binary operations with > various join methods on the labels. It would leave us with three > categories: faster versions of numpy/scipy nan functions, moving > window statistics, and group functions. Returning back to the integer questions: It would be nice to have nan handling for integer arrays with a user defined nan, e.g. -9999. That would allow faster operations or avoid having to use floats. Josef _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by dhirschfeld
On Mon, Nov 22, 2010 at 8:52 AM, Dave Hirschfeld
<[hidden email]> wrote: > Keith Goodman <kwgoodman <at> gmail.com> writes: >> >> NAME >> >> What should we call the package? >> >> Numa, numerical analysis with numpy arrays >> Dana, data analysis with numpy arrays >> >> import dana as da (da=data analysis) >> >> ARE YOU CRAZY? >> >> If you read this far, you are crazy and would be a good fit for this project. >> > > Sounds like a useful toolbox. As it's focused on calculating various statistics > on arrays in the presence of NaNs I would find nanstats an informative (if > boring) name. I like the idea of narrowing the focus to NaNs. Then maybe we could drop the nan prefix from the function names. So std instead of nanstd. How about Nancy (NAN + CYthon)? But nanstats is more descriptive. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
On Mon, Nov 22, 2010 at 8:22 AM, Keith Goodman <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 8:14 AM, Nathaniel Smith <[hidden email]> wrote: >> On Mon, Nov 22, 2010 at 7:52 AM, <[hidden email]> wrote: >>> I would like statistics (scipy.stats and statsmodels) to stick with >>> default axis=0. >>> I would be in favor of axis=None for nan extended versions of numpy >>> functions and axis=0 for stats functions as defaults, but since it >>> will be a standalone package with wider usage, I will be able to keep >>> track of axis=-1. >> >> Please let's keep everything using the same default -- it doesn't >> actually make life simpler if for every function I have to squint and >> try to remember whether or not it's a "stats function". (Like, what's >> "mean"?) >> >> I think the world already has a sufficient supply of arbitrarily >> inconsistent scientific APIs. > > nanstd, nanmean, etc use axis=None for the default. Great -- I understood Josef as arguing that they shouldn't. >What would > axis=None mean for a moving window sum? Well, the same as mov_sum(arr.ravel()), I suppose. Probably not very useful for multidimensional arrays, but I'm not sure there's a better default. -- Nathaniel _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
On Mon, Nov 22, 2010 at 12:10 PM, Keith Goodman <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 8:52 AM, Dave Hirschfeld > <[hidden email]> wrote: >> Keith Goodman <kwgoodman <at> gmail.com> writes: >>> >>> NAME >>> >>> What should we call the package? >>> >>> Numa, numerical analysis with numpy arrays >>> Dana, data analysis with numpy arrays >>> >>> import dana as da (da=data analysis) >>> >>> ARE YOU CRAZY? >>> >>> If you read this far, you are crazy and would be a good fit for this project. >>> >> >> Sounds like a useful toolbox. As it's focused on calculating various statistics >> on arrays in the presence of NaNs I would find nanstats an informative (if >> boring) name. > > I like the idea of narrowing the focus to NaNs. Then maybe we could > drop the nan prefix from the function names. So std instead of nanstd. > How about Nancy (NAN + CYthon)? The devs could be known as nancy-boys. (sorry, I couldn't help myself.) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nathaniel Smith
On Mon, Nov 22, 2010 at 9:28 AM, Nathaniel Smith <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 8:22 AM, Keith Goodman <[hidden email]> wrote: >>What would axis=None mean for a moving window sum? > > Well, the same as mov_sum(arr.ravel()), I suppose. Probably not very > useful for multidimensional arrays, but I'm not sure there's a better > default. I guess the choices for the default axis for moving statistics are 0, -1, None. I'd throw out None and then pick either 0 or -1. For group_mean I think axis=0 makes more sense. Wes and Josef prefer axis=0, I think. I'm fine with that but would like to hear more opinions. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nathaniel Smith
On Mon, Nov 22, 2010 at 12:28 PM, Nathaniel Smith <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 8:22 AM, Keith Goodman <[hidden email]> wrote: >> On Mon, Nov 22, 2010 at 8:14 AM, Nathaniel Smith <[hidden email]> wrote: >>> On Mon, Nov 22, 2010 at 7:52 AM, <[hidden email]> wrote: >>>> I would like statistics (scipy.stats and statsmodels) to stick with >>>> default axis=0. >>>> I would be in favor of axis=None for nan extended versions of numpy >>>> functions and axis=0 for stats functions as defaults, but since it >>>> will be a standalone package with wider usage, I will be able to keep >>>> track of axis=-1. >>> >>> Please let's keep everything using the same default -- it doesn't >>> actually make life simpler if for every function I have to squint and >>> try to remember whether or not it's a "stats function". (Like, what's >>> "mean"?) >>> >>> I think the world already has a sufficient supply of arbitrarily >>> inconsistent scientific APIs. >> >> nanstd, nanmean, etc use axis=None for the default. > > Great -- I understood Josef as arguing that they shouldn't. I think nanmean, nanvar, nanstd, nanmax should belong in numpy and follow numpy convention. But when I import scipy.stats, I expect axis=0 as default, especially for statistical tests, and similar, where I usually assume we have observation in rows and variables in columns as in structured arrays or record arrays. np.cov, np.corrcoef usually throw me off, and I am surprised if it prints a 1000x1000 array instead of 4x4. I have a hard time remembering rowvar=1. I would prefer axis=0 or axis=1 for correlations and covariances. So it's mainly a question about the default when axis=None doesn't make much sense. Josef > >>What would >> axis=None mean for a moving window sum? > > Well, the same as mov_sum(arr.ravel()), I suppose. Probably not very > useful for multidimensional arrays, but I'm not sure there's a better > default. > > -- Nathaniel > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
On 11/22/2010 12:47 PM, Keith Goodman wrote:
> For group_mean I think axis=0 makes more sense. Wes and Josef prefer > axis=0, I think. I'm fine with that but would like to hear more > opinions. I'd prefer the following. 1. Whenever the operation can sensibly be applied to a 1d array, make the default: axis=None. 2. If the operation cannot sensibly be applied to a 1d array, provide no default. (I.e., force axis specification.) In other words: remove guessing by the user. Alan Isaac _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Nov 22, 2010 at 1:32 PM, Alan G Isaac <[hidden email]> wrote:
> On 11/22/2010 12:47 PM, Keith Goodman wrote: >> For group_mean I think axis=0 makes more sense. Wes and Josef prefer >> axis=0, I think. I'm fine with that but would like to hear more >> opinions. > > > I'd prefer the following. > > 1. Whenever the operation can sensibly be applied to a 1d array, > make the default: axis=None. > > 2. If the operation cannot sensibly be applied to a 1d array, > provide no default. (I.e., force axis specification.) > > In other words: remove guessing by the user. I like it. Cleaner. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Keith Goodman
On Mon, Nov 22, 2010 at 7:35 AM, Keith Goodman <[hidden email]> wrote:
> This thread started on the numpy list: > http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html Based on the feedback I got on the scipy and numpy lists, I expanded the focus of the Nanny project from A to B, where A = Faster, drop-in replacement of the NaN functions in Numpy and Scipy B = Fast, NaN-aware descriptive statistics of NumPy arrays I also renamed the project from Nanny to dsna (descriptive statistics of numpy arrays) and dropped the nan prefix from all function names (the package is simpler if all functions are NaN aware). A description of the project can be found in the readme file here: http://github.com/kwgoodman/dsna _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Tue, Nov 23, 2010 at 8:23 PM, Keith Goodman <[hidden email]> wrote:
> On Mon, Nov 22, 2010 at 7:35 AM, Keith Goodman <[hidden email]> wrote: >> This thread started on the numpy list: >> http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html > > Based on the feedback I got on the scipy and numpy lists, I expanded > the focus of the Nanny project from A to B, where > > A = Faster, drop-in replacement of the NaN functions in Numpy and Scipy > B = Fast, NaN-aware descriptive statistics of NumPy arrays > > I also renamed the project from Nanny to dsna (descriptive statistics > of numpy arrays) and dropped the nan prefix from all function names > (the package is simpler if all functions are NaN aware). A description > of the project can be found in the readme file here: > > http://github.com/kwgoodman/dsna Nanny did have the advantage of being "catchy" - and easy to remember... ! no chance of remembering a 4 ("random") letter sequence.... If you want to change the name, I suggest including the idea of speed/cython/.. or so -- wasn't that the original idea .... - Sebastian Haase _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Tue, Nov 23, 2010 at 1:09 PM, Sebastian Haase <[hidden email]> wrote:
> On Tue, Nov 23, 2010 at 8:23 PM, Keith Goodman <[hidden email]> wrote: >> On Mon, Nov 22, 2010 at 7:35 AM, Keith Goodman <[hidden email]> wrote: >>> This thread started on the numpy list: >>> http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html >> >> Based on the feedback I got on the scipy and numpy lists, I expanded >> the focus of the Nanny project from A to B, where >> >> A = Faster, drop-in replacement of the NaN functions in Numpy and Scipy >> B = Fast, NaN-aware descriptive statistics of NumPy arrays >> >> I also renamed the project from Nanny to dsna (descriptive statistics >> of numpy arrays) and dropped the nan prefix from all function names >> (the package is simpler if all functions are NaN aware). A description >> of the project can be found in the readme file here: >> >> http://github.com/kwgoodman/dsna > > Nanny did have the advantage of being "catchy" - and easy to remember... ! > no chance of remembering a 4 ("random") letter sequence.... > If you want to change the name, I suggest including the idea of > speed/cython/.. or so -- wasn't that the original idea .... "disnay" maybe? _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |