I'm finding the scipy.stats documentation somewhat difficult to
follow, so maybe the answer to this question is in there... I can't really find it, though. What I have is a sequence of numbers X_i . Two things I'd like to be able to do with this: 1. Create a discrete probability distribution (class rv_discrete) from this data so as to use the utility functions that take rv_discrete objects. The rv_discrete documentation suggests should be easy. I did the following >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test') >>>ddist.pmf(50) array(0.0) Any value I try to get of the pmf seems to be 0. Do I have to explicitly subclass rv_discrete with my data and a _pmf method or something? This seems like a very natural thing to want to do, and hence it seems odd to not have some helper like make_dist(x,name='whatever') . I can take a shot at creating such a function, but I don't want to do so if one exists. 2. Create a continuous probability distribution from something like spline fitting or simple linear interpolation of a the data in X_i. Does this require explict subclassing, or is there a straightforward way to do it that's builtin? I'm not sure if this step is strictly necessary - what I really want to do is be able to draw from the discrete distribution in 1 just by sampling the cdf... maybe this is how it's supposed to work with the discrete distribution, but when I tried to sample it using ddist.rvs, I would always get the input values I specified rather random values sampled from the cdf. I'm on scipy 0.6.0 and numpy 1.0.4 _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user |
Hi Erik
2008/4/28 Erik Tollerud <[hidden email]>: > I'm finding the scipy.stats documentation somewhat difficult to > follow, so maybe the answer to this question is in there... I can't > really find it, though. > > What I have is a sequence of numbers X_i . Two things I'd like to be > able to do with this: > 1. Create a discrete probability distribution (class rv_discrete) from > this data so as to use the utility functions that take rv_discrete > objects. > The rv_discrete documentation suggests should be easy. I did the following > >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test') > >>>ddist.pmf(50) > array(0.0) That should be 1.0/len(x), otherwise all the probabilities are 0. Cheers Stéfan _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Erik Tollerud-2
rv_discrete: most (if not all) of the scipy.stats functions +
numpy.random cannot handle zeros as _inputs_ (don't know whether this is related to getting zero's _out_, but it might be). The zero problem is, I am told, due to the underlying c code, not python. A quick workaround is to substitute any zeros for a small number like 1e-16 On Sun, 2008-04-27 at 16:29 -0700, Erik Tollerud wrote: > I'm finding the scipy.stats documentation somewhat difficult to > follow, so maybe the answer to this question is in there... I can't > really find it, though. > What I have is a sequence of numbers X_i . Two things I'd like to be > able to do with this: > 1. Create a discrete probability distribution (class rv_discrete) from > this data so as to use the utility functions that take rv_discrete > objects. > The rv_discrete documentation suggests should be easy. I did the following > >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test') > >>>ddist.pmf(50) > array(0.0) There are at least 2 ways of using rv_discrete e.g. 2 ways to calculate the next element of a simple Markov Chain with x(n+1)=Norm(0.5 x(n),1) from scipy.stats import rv_discrete from numpy.random import multinomial x = 3 n1 = stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 )[0] print n1 n2 = stats.rv_discrete.rvs( stats.rv_discrete( name='sample', values=([0,1,2],[3/10.,5/10.,2/10.])), 0.5*x, 1.0 )[0] print n2 sample = stats.rv_discrete( name='sample', values=([0,1,2],[3/10.,5/10.,2/10.]) ).rvs( size=10 ) print sample The multinomial distribution from numpy.random is somewhat faster (40 times or so) but has a different idiom: SIZE = 100000 VALUES = [0,1,2,3,4,5,6,7] PROBS = [1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.] The idiom for rv_discrete is rv_discrete( name='sample', values=(VALUES,PROBS) ) The idiom for numpy.multinomial is different; if memory serves, you get frequencies as output instead of the actual values multinomial( SIZE, PROBS ) >>> from numpy.random import multinomial >>> multinomial(100,[ 0.2, 0.4, 0.1, 0.3 ]) array([12, 44, 10, 34]) >>> multinomial( 100, [0.2, 0.0, 0.8, 0.0] ) <-- don't do this ... >>> multinomial( 100, [0.2, 1e-16, 0.8, 1e-16] ) <-- or this >>> multinomial( 100, [0.2-1e-16, 1e-16, 0.8-1e-16, 1e-16] ) <-- ok array([21, 0, 79, 0]) the last one is ok since the probability adds up to 1... painful, but it works > Any value I try to get of the pmf seems to be 0. Do I have to > explicitly subclass rv_discrete with my data and a _pmf method or > something? This seems like a very natural thing to want to do, and > hence it seems odd to not have some helper like > make_dist(x,name='whatever') . I can take a shot at creating such a > function, but I don't want to do so if one exists. > > 2. Create a continuous probability distribution from something like > spline fitting or simple linear interpolation of a the data in X_i. > Does this require explict subclassing, or is there a straightforward > way to do it that's builtin? I'm not sure if this step is strictly > necessary - what I really want to do is be able to draw from the > discrete distribution in 1 just by sampling the cdf... maybe this is > how it's supposed to work with the discrete distribution, but when I > tried to sample it using ddist.rvs, I would always get the input > values I specified rather random values sampled from the cdf. Continuous v's discrete: i found this in ./stats/scstats.py from scipy import stats, r_ from pylab import show, plot import copy # SOURCE: ./stats/scstats.py SPREAD = 10 class cdf( object ): """ Baseclass for the task of determining a sequence of numbers {vi} which is distributed as a random variable X """ def integerDensityFunction( self ): """ Outputs an integer density function: xs (ints) and ys (probabilities) which are the correspondence between the whole numbers on the x axis to the probabilities on the y axis, according to a normal distribution. """ opt = [] for i in r_[-SPREAD:SPREAD:100j]: # 2-tailed test (?) opt.append(( i, stats.norm.cdf(i) )) # ( int, P(int) ) return zip(*opt) # [ (int...), (P...) ] def display( self ): xs, ys = self.integerDensityFunction() plot( xs, ys ) show() if __name__=='__main__': d = cdf() d.display() Continuous: i can only suggest using rv_continuous stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 ).whatever .rvs( shape, loc, scale ) is the random variates .pdf( x, shape, loc, scale ) is the probability density function which, i think, is or should be genuinely continuous > I'm on scipy 0.6.0 and numpy 1.0.4 > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user |
On Mon, Apr 28, 2008 at 12:29 AM, Stéfan van der Walt <[hidden email]> wrote:
> That should be 1.0/len(x), otherwise all the probabilities are 0. > > Cheers > Stéfan I had >>>from __future__ import division above when I actually tested this, so they aren't zero (although you're right they would be otherwise), but you're right that they would be if this was run as-is. I figured the pmf part out, though based on Michael's examples - the pmf SHOULD be zero everywhere other then exactly at one of the values... when I generate the cdf, it has non-zero values. On Mon, Apr 28, 2008 at 2:02 AM, Michael <[hidden email]> wrote: > There are at least 2 ways of using rv_discrete > > e.g. 2 ways to calculate the next element of a simple Markov Chain with > x(n+1)=Norm(0.5 x(n),1) > > from scipy.stats import rv_discrete > from numpy.random import multinomial > > x = 3 > n1 = stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 )[0] > print n1 > n2 = stats.rv_discrete.rvs( stats.rv_discrete( name='sample', > values=([0,1,2],[3/10.,5/10.,2/10.])), 0.5*x, 1.0 )[0] > print n2 > sample = stats.rv_discrete( name='sample', > values=([0,1,2],[3/10.,5/10.,2/10.]) ).rvs( size=10 ) > print sample > > The multinomial distribution from numpy.random is somewhat faster (40 > times or so) but has a different idiom: > > SIZE = 100000 > VALUES = [0,1,2,3,4,5,6,7] > PROBS = [1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.] > > The idiom for rv_discrete is > rv_discrete( name='sample', values=(VALUES,PROBS) ) > > The idiom for numpy.multinomial is different; if memory serves, you get > frequencies as output instead of the actual values > multinomial( SIZE, PROBS ) > > >>> from numpy.random import multinomial > >>> multinomial(100,[ 0.2, 0.4, 0.1, 0.3 ]) > array([12, 44, 10, 34]) > >>> multinomial( 100, [0.2, 0.0, 0.8, 0.0] ) <-- don't do this > ... > >>> multinomial( 100, [0.2, 1e-16, 0.8, 1e-16] ) <-- or this > >>> multinomial( 100, [0.2-1e-16, 1e-16, 0.8-1e-16, 1e-16] ) <-- ok > array([21, 0, 79, 0]) > > the last one is ok since the probability adds up to 1... painful, but it > works As explained above, I think I got the discrete to work (spurred on by your simpler example) - but what's the utility of using the multinomial idiom over the rv_discrete syntax? Is it faster?. > Continuous v's discrete: i found this in ./stats/scstats.py > > from scipy import stats, r_ > from pylab import show, plot > import copy > > # SOURCE: ./stats/scstats.py > > SPREAD = 10 > > class cdf( object ): > """ Baseclass for the task of determining a sequence of numbers {vi} > which is distributed as a random variable X > """ > def integerDensityFunction( self ): > """ > Outputs an integer density function: xs (ints) and ys (probabilities) > which are the correspondence between the whole numbers on the x axis > to the probabilities on the y axis, according to a normal distribution. > """ > opt = [] > for i in r_[-SPREAD:SPREAD:100j]: # 2-tailed test (?) > opt.append(( i, stats.norm.cdf(i) )) # ( int, P(int) ) > return zip(*opt) # [ (int...), (P...) ] > > def display( self ): > xs, ys = self.integerDensityFunction() > plot( xs, ys ) > show() > > if __name__=='__main__': > d = cdf() > d.display() I don't really understand what you're suggesting, but anyway, I don't see a stats/scstats.py file in the scipy directory (at least in 0.6)... > Continuous: i can only suggest using rv_continuous > > stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 ).whatever > > .rvs( shape, loc, scale ) is the random variates > .pdf( x, shape, loc, scale ) is the probability density function which, > i think, is or should be genuinely continuous My interpretation of this is that it is using the normal distribution - I want a distribution that is a smoothed/interpolated version of the discrete distribution I generated above. I take this to mean there's no built-in utility to do this, so I just have to make my own - this seems like a useful thing for data analysis, though, so I may submit it later to be added to SVN. _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user |
2008/5/2 Erik Tollerud <[hidden email]>:
> My interpretation of this is that it is using the normal distribution > - I want a distribution that is a smoothed/interpolated version of the > discrete distribution I generated above. I take this to mean there's > no built-in utility to do this, so I just have to make my own - this > seems like a useful thing for data analysis, though, so I may submit > it later to be added to SVN. Well, this is tricky. You need to decide what it is you want from your distribution: the *way* in which you smooth your distribution function can make a tremendous difference to the results you get. If you're doing statistics, coping with this can be a real challenge. There is one family of techniques, called kernel density estimators, for "smoothing" distributions in a controlled and statistically well-understood fashion. They are implemented in scipy as well. They are used specifically for reconstructing a distribution when you have a collection of samples drawn from it; the resulting pdf is a sum of Gaussians, one centered at each sample. The width is automatically chosen based on the samples. You can also, of course, consruct an arbitrary pdf as a function, and then use numerical integration on it. (If you use splines it can be integrated even more efficiently, though I don't know that you can ensure that they are everywhere positive; I'd be inclined to interpolate the log instead, but then you lose easy integration.) I don't know whether scipy.stats allows you to construct a distribution object, with all its standard methods, from a given pdf, but this would be a vary useful feature. Anne _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |