scipy.stats rv objects from data

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

scipy.stats rv objects from data

Erik Tollerud-2
I'm finding the scipy.stats documentation somewhat difficult to
follow, so maybe the answer to this question is in there... I can't
really find it, though.

What I have is a sequence of numbers X_i . Two things I'd like to be
able to do with this:
1. Create a discrete probability distribution (class rv_discrete) from
this data so as to use the utility functions that take rv_discrete
objects.
The rv_discrete documentation suggests should be easy.  I did the following
>>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test')
>>>ddist.pmf(50)
array(0.0)

Any value I try to get of the pmf seems to be 0.  Do I have to
explicitly subclass rv_discrete with my data and a _pmf method or
something? This seems like a very natural thing to want to do, and
hence it seems odd to not have some helper like
make_dist(x,name='whatever') .  I can take a shot at creating such a
function, but I don't want to do so if one exists.

2. Create a continuous probability distribution from something like
spline fitting or simple linear interpolation of a the data in X_i.
Does this require explict subclassing, or is there a straightforward
way to do it that's builtin?  I'm not sure if this step is strictly
necessary - what I really want to do is be able to draw from the
discrete distribution in 1 just by sampling the cdf... maybe this is
how it's supposed to work with the discrete distribution, but when I
tried to sample it using ddist.rvs, I would always get the input
values I specified rather random values sampled from the cdf.

I'm on scipy 0.6.0 and numpy 1.0.4
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats rv objects from data

Stéfan van der Walt
Hi Erik

2008/4/28 Erik Tollerud <[hidden email]>:

> I'm finding the scipy.stats documentation somewhat difficult to
>  follow, so maybe the answer to this question is in there... I can't
>  really find it, though.
>
>  What I have is a sequence of numbers X_i . Two things I'd like to be
>  able to do with this:
>  1. Create a discrete probability distribution (class rv_discrete) from
>  this data so as to use the utility functions that take rv_discrete
>  objects.
>  The rv_discrete documentation suggests should be easy.  I did the following
>  >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test')
>  >>>ddist.pmf(50)
>  array(0.0)

That should be 1.0/len(x), otherwise all the probabilities are 0.

Cheers
Stéfan
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats rv objects from data

Michael-285
In reply to this post by Erik Tollerud-2
rv_discrete: most (if not all) of the scipy.stats functions +
numpy.random cannot handle zeros as _inputs_ (don't know whether this is
related to getting zero's _out_, but it might be). The zero problem is,
I am told, due to the underlying c code, not python. A quick workaround
is to substitute any zeros for a small number like 1e-16

On Sun, 2008-04-27 at 16:29 -0700, Erik Tollerud wrote:
> I'm finding the scipy.stats documentation somewhat difficult to
> follow, so maybe the answer to this question is in there... I can't
> really find it, though.

> What I have is a sequence of numbers X_i . Two things I'd like to be
> able to do with this:
> 1. Create a discrete probability distribution (class rv_discrete) from
> this data so as to use the utility functions that take rv_discrete
> objects.
> The rv_discrete documentation suggests should be easy.  I did the following
> >>>ddist=rv_discrete(values=(x,[1/len(x) for i in x]),name='test')
> >>>ddist.pmf(50)
> array(0.0)

There are at least 2 ways of using rv_discrete

e.g. 2 ways to calculate the next element of a simple Markov Chain with
x(n+1)=Norm(0.5 x(n),1)

from scipy.stats import rv_discrete
from numpy.random import multinomial

    x = 3
    n1 = stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 )[0]
    print n1
    n2 = stats.rv_discrete.rvs( stats.rv_discrete( name='sample',
values=([0,1,2],[3/10.,5/10.,2/10.])), 0.5*x, 1.0 )[0]
    print n2
    print
    sample = stats.rv_discrete( name='sample',
values=([0,1,2],[3/10.,5/10.,2/10.]) ).rvs( size=10 )
    print sample

The multinomial distribution from numpy.random is somewhat faster (40
times or so) but has a different idiom:

SIZE = 100000
VALUES = [0,1,2,3,4,5,6,7]
PROBS = [1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.]

The idiom for rv_discrete is
        rv_discrete( name='sample', values=(VALUES,PROBS) )

The idiom for numpy.multinomial is different; if memory serves, you get
frequencies as output instead of the actual values
        multinomial( SIZE, PROBS )

>>> from numpy.random import multinomial
>>> multinomial(100,[ 0.2, 0.4, 0.1, 0.3 ])
array([12, 44, 10, 34])
>>> multinomial( 100, [0.2, 0.0, 0.8, 0.0] ) <-- don't do this
...
>>> multinomial( 100, [0.2, 1e-16, 0.8, 1e-16] ) <-- or this
>>> multinomial( 100, [0.2-1e-16, 1e-16, 0.8-1e-16, 1e-16] ) <-- ok
array([21,  0, 79,  0])

the last one is ok since the probability adds up to 1... painful, but it
works

> Any value I try to get of the pmf seems to be 0.  Do I have to
> explicitly subclass rv_discrete with my data and a _pmf method or
> something? This seems like a very natural thing to want to do, and
> hence it seems odd to not have some helper like
> make_dist(x,name='whatever') .  I can take a shot at creating such a
> function, but I don't want to do so if one exists.
>
> 2. Create a continuous probability distribution from something like
> spline fitting or simple linear interpolation of a the data in X_i.
> Does this require explict subclassing, or is there a straightforward
> way to do it that's builtin?  I'm not sure if this step is strictly
> necessary - what I really want to do is be able to draw from the
> discrete distribution in 1 just by sampling the cdf... maybe this is
> how it's supposed to work with the discrete distribution, but when I
> tried to sample it using ddist.rvs, I would always get the input
> values I specified rather random values sampled from the cdf.

Continuous v's discrete: i found this in ./stats/scstats.py

from scipy import stats, r_
from pylab import show, plot
import copy

# SOURCE: ./stats/scstats.py

SPREAD = 10

class cdf( object ):
    """ Baseclass for the task of determining a sequence of numbers {vi}
which is distributed as a random variable X
    """
    def integerDensityFunction( self ):
    """
  Outputs an integer density function: xs (ints) and ys (probabilities)
which are the correspondence between the whole numbers on the x axis
to the probabilities on the y axis, according to a normal distribution.
        """
    opt = []
    for i in r_[-SPREAD:SPREAD:100j]: # 2-tailed test (?)
            opt.append(( i, stats.norm.cdf(i) )) # ( int, P(int) )
    return zip(*opt) # [ (int...), (P...) ]

    def display( self ):
    xs, ys = self.integerDensityFunction()  
        plot( xs, ys )
        show()  

if __name__=='__main__':
    d = cdf()
    d.display()

Continuous: i can only suggest using rv_continuous

        stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 ).whatever

        .rvs( shape, loc, scale ) is the random variates
        .pdf( x, shape, loc, scale ) is the probability density function which,
i think, is or should be genuinely continuous

> I'm on scipy 0.6.0 and numpy 1.0.4
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats rv objects from data

Erik Tollerud-2
On Mon, Apr 28, 2008 at 12:29 AM, Stéfan van der Walt <[hidden email]> wrote:
>  That should be 1.0/len(x), otherwise all the probabilities are 0.
>
>  Cheers
>  Stéfan

I had  >>>from __future__ import division above when I actually tested
this, so they aren't zero (although you're right they would be
otherwise), but you're right that they would be if this was run as-is.
 I figured the pmf part out, though based on Michael's examples - the
pmf SHOULD be zero everywhere other then exactly at one of the
values... when I generate the cdf, it has non-zero values.

On Mon, Apr 28, 2008 at 2:02 AM, Michael <[hidden email]> wrote:

>  There are at least 2 ways of using rv_discrete
>
>  e.g. 2 ways to calculate the next element of a simple Markov Chain with
>  x(n+1)=Norm(0.5 x(n),1)
>
>  from scipy.stats import rv_discrete
>  from numpy.random import multinomial
>
>     x = 3
>     n1 = stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 )[0]
>     print n1
>     n2 = stats.rv_discrete.rvs( stats.rv_discrete( name='sample',
>  values=([0,1,2],[3/10.,5/10.,2/10.])), 0.5*x, 1.0 )[0]
>     print n2
>     print
>     sample = stats.rv_discrete( name='sample',
>  values=([0,1,2],[3/10.,5/10.,2/10.]) ).rvs( size=10 )
>     print sample
>
>  The multinomial distribution from numpy.random is somewhat faster (40
>  times or so) but has a different idiom:
>
>  SIZE = 100000
>  VALUES = [0,1,2,3,4,5,6,7]
>  PROBS = [1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.,1/8.]
>
>  The idiom for rv_discrete is
>         rv_discrete( name='sample', values=(VALUES,PROBS) )
>
>  The idiom for numpy.multinomial is different; if memory serves, you get
>  frequencies as output instead of the actual values
>         multinomial( SIZE, PROBS )
>
>  >>> from numpy.random import multinomial
>  >>> multinomial(100,[ 0.2, 0.4, 0.1, 0.3 ])
>  array([12, 44, 10, 34])
>  >>> multinomial( 100, [0.2, 0.0, 0.8, 0.0] ) <-- don't do this
>  ...
>  >>> multinomial( 100, [0.2, 1e-16, 0.8, 1e-16] ) <-- or this
>  >>> multinomial( 100, [0.2-1e-16, 1e-16, 0.8-1e-16, 1e-16] ) <-- ok
>  array([21,  0, 79,  0])
>
>  the last one is ok since the probability adds up to 1... painful, but it
>  works

As explained above, I think I got the discrete to work (spurred on by
your simpler example) - but what's the utility of using the
multinomial idiom over the rv_discrete syntax? Is it faster?.

>  Continuous v's discrete: i found this in ./stats/scstats.py
>
>  from scipy import stats, r_
>  from pylab import show, plot
>  import copy
>
>  # SOURCE: ./stats/scstats.py
>
>  SPREAD = 10
>
>  class cdf( object ):
>     """ Baseclass for the task of determining a sequence of numbers {vi}
>  which is distributed as a random variable X
>     """
>     def integerDensityFunction( self ):
>         """
>         Outputs an integer density function: xs (ints) and ys (probabilities)
>  which are the           correspondence between the whole numbers on the x axis
>  to the probabilities on the y axis, according to a normal distribution.
>         """
>         opt = []
>         for i in r_[-SPREAD:SPREAD:100j]: # 2-tailed test (?)
>             opt.append(( i, stats.norm.cdf(i) )) # ( int, P(int) )
>         return zip(*opt)                        # [ (int...), (P...) ]
>
>     def display( self ):
>         xs, ys = self.integerDensityFunction()
>         plot( xs, ys )
>         show()
>
>  if __name__=='__main__':
>     d = cdf()
>     d.display()

I don't really understand what you're suggesting, but anyway, I don't
see a stats/scstats.py file in the scipy directory (at least in
0.6)...

>  Continuous: i can only suggest using rv_continuous
>
>         stats.rv_continuous.rvs( stats.norm, 0.5*x, 1.0 ).whatever
>
>         .rvs( shape, loc, scale ) is the random variates
>         .pdf( x, shape, loc, scale ) is the probability density function which,
>  i think, is or should be genuinely continuous

My interpretation of this is that it is using the normal distribution
- I want a distribution that is a smoothed/interpolated version of the
discrete distribution I generated above.  I take this to mean there's
no built-in utility to do this, so I just have to make my own - this
seems like a useful thing for data analysis, though, so I may submit
it later to be added to SVN.
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats rv objects from data

Anne Archibald
2008/5/2 Erik Tollerud <[hidden email]>:

>  My interpretation of this is that it is using the normal distribution
>  - I want a distribution that is a smoothed/interpolated version of the
>  discrete distribution I generated above.  I take this to mean there's
>  no built-in utility to do this, so I just have to make my own - this
>  seems like a useful thing for data analysis, though, so I may submit
>  it later to be added to SVN.

Well, this is tricky. You need to decide what it is you want from your
distribution: the *way* in which you smooth your distribution function
can make a tremendous difference to the results you get. If you're
doing statistics, coping with this can be a real challenge.

There is one family of techniques, called kernel density estimators,
for "smoothing" distributions in a controlled and statistically
well-understood fashion. They are implemented in scipy as well. They
are used specifically for reconstructing a distribution when you have
a collection of samples drawn from it; the resulting pdf is a sum of
Gaussians, one centered at each sample. The width is automatically
chosen based on the samples.

You can also, of course, consruct an arbitrary pdf as a function, and
then use numerical integration on it. (If you use splines it can be
integrated even more efficiently, though I don't know that you can
ensure that they are everywhere positive; I'd be inclined to
interpolate the log instead, but then you lose easy integration.) I
don't know whether scipy.stats allows you to construct a distribution
object, with all its standard methods, from a given pdf, but this
would be a vary useful feature.

Anne
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user