[SciPy-User] fitting discrete probability distributions to data

classic Classic list List threaded Threaded
4 messages Options
c
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] fitting discrete probability distributions to data

c
hi,

i have some data:

A) a 1d array (dimensions 1x50), made by summing the columns of a 2d array (dimensions ~20k x 50). 

B) a 1D array that is just a particular row of that 2d array

i need to fit a sum of 2 negative binomial distributions to A), and to fit a single negative binomial distrib. to B).

i have spent a while now reading the documentation for numpy.stats and the statsmodel package and various stack overflow posts, etc.. but i do not yet understand how to go about fitting a discrete probability distribution to a vector of data.

specific subquestions:

- do i need to load data in as a pandas df? an ndarray? does it not matter?

- i understand endog and exog in the context of the examples given in the docs (where you have one column that you want to use to predict some other column) but not what they should be in the case where i basically am trying to fit a curve to the normalized histogram of my data

- if someone can explain how to fit with statsmodels' "Negative Binomial (http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.html#statsmodels.discrete.discrete_model.NegativeBinomial) that would be a good start. but i do also need to know how to fit to a sum of two of these, or possibly a sum of two other discrete distributions

- is the patsy formula syntax relevant here? i have never used R and could not find an example of the "R-like" syntax that is similar enough to my use case to parse how it works

- honestly i don't know what i'm doing, please help!

if these questions reveal grave ignorance, or are not directly relevant enough to scipy for this mailing list, i apologize and thanks for bearing with me. i barely know how to flip a coin, this stuff is new to me.

thanks a lot
c

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: fitting discrete probability distributions to data

josef.pktd


On Wed, Mar 11, 2015 at 7:14 PM, c <[hidden email]> wrote:
hi,

i have some data:

A) a 1d array (dimensions 1x50), made by summing the columns of a 2d array (dimensions ~20k x 50). 

B) a 1D array that is just a particular row of that 2d array

i need to fit a sum of 2 negative binomial distributions to A), and to fit a single negative binomial distrib. to B).

i have spent a while now reading the documentation for numpy.stats and the statsmodel package and various stack overflow posts, etc.. but i do not yet understand how to go about fitting a discrete probability distribution to a vector of data.

Do  you have the data in the form of histograms (counts) or the original data ?

statsmodels can only estimate based on the original data which is assumed to consist of observations drawn from a Negative Binomial distribution.  Fitting histogram and fitting mixtures of distributions is not supported "out of the box", and would require some custom models.

If you just want to fit a distribution to a histogram or discrete counts, then using curve_fit or leastsq is one possibility.

Josef

 

specific subquestions:

- do i need to load data in as a pandas df? an ndarray? does it not matter?

- i understand endog and exog in the context of the examples given in the docs (where you have one column that you want to use to predict some other column) but not what they should be in the case where i basically am trying to fit a curve to the normalized histogram of my data

- if someone can explain how to fit with statsmodels' "Negative Binomial (http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.html#statsmodels.discrete.discrete_model.NegativeBinomial) that would be a good start. but i do also need to know how to fit to a sum of two of these, or possibly a sum of two other discrete distributions

- is the patsy formula syntax relevant here? i have never used R and could not find an example of the "R-like" syntax that is similar enough to my use case to parse how it works

- honestly i don't know what i'm doing, please help!

if these questions reveal grave ignorance, or are not directly relevant enough to scipy for this mailing list, i apologize and thanks for bearing with me. i barely know how to flip a coin, this stuff is new to me.

thanks a lot
c

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
c
Reply | Threaded
Open this post in threaded view
|

Re: fitting discrete probability distributions to data

c
yup, i have the original data

On Thu, Mar 12, 2015 at 1:42 AM, <[hidden email]> wrote:


On Wed, Mar 11, 2015 at 7:14 PM, c <[hidden email]> wrote:
hi,

i have some data:

A) a 1d array (dimensions 1x50), made by summing the columns of a 2d array (dimensions ~20k x 50). 

B) a 1D array that is just a particular row of that 2d array

i need to fit a sum of 2 negative binomial distributions to A), and to fit a single negative binomial distrib. to B).

i have spent a while now reading the documentation for numpy.stats and the statsmodel package and various stack overflow posts, etc.. but i do not yet understand how to go about fitting a discrete probability distribution to a vector of data.

Do  you have the data in the form of histograms (counts) or the original data ?

statsmodels can only estimate based on the original data which is assumed to consist of observations drawn from a Negative Binomial distribution.  Fitting histogram and fitting mixtures of distributions is not supported "out of the box", and would require some custom models.

If you just want to fit a distribution to a histogram or discrete counts, then using curve_fit or leastsq is one possibility.

Josef

 

specific subquestions:

- do i need to load data in as a pandas df? an ndarray? does it not matter?

- i understand endog and exog in the context of the examples given in the docs (where you have one column that you want to use to predict some other column) but not what they should be in the case where i basically am trying to fit a curve to the normalized histogram of my data

- if someone can explain how to fit with statsmodels' "Negative Binomial (http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.html#statsmodels.discrete.discrete_model.NegativeBinomial) that would be a good start. but i do also need to know how to fit to a sum of two of these, or possibly a sum of two other discrete distributions

- is the patsy formula syntax relevant here? i have never used R and could not find an example of the "R-like" syntax that is similar enough to my use case to parse how it works

- honestly i don't know what i'm doing, please help!

if these questions reveal grave ignorance, or are not directly relevant enough to scipy for this mailing list, i apologize and thanks for bearing with me. i barely know how to flip a coin, this stuff is new to me.

thanks a lot
c

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: fitting discrete probability distributions to data

josef.pktd
(please comment inline or post at the bottom in scipy related mailing lists)

On Wed, Mar 11, 2015 at 7:49 PM, c <[hidden email]> wrote:
yup, i have the original data

To estimate a single Negative Binomial, you can use statsmodels.NegativeBinomial and regress on a constant.
endog is your negative binomial data, exog = np.ones(len(data))

and 
result = sm.NegativeBinomial(data, exog).fit()

result.params has the estimated parameters but they are in a mean-dispersion parameterization used for regression, not in the "standard" parameterization of a Negative Binomial  distribution.

There is somewhere (!?) a helper function to transform the params into the standard form as used for example by scipy.stats.negbin

estimating the mixture of two or more NegativeBinomial distributions takes a bit of work.

Josef



 

On Thu, Mar 12, 2015 at 1:42 AM, <[hidden email]> wrote:


On Wed, Mar 11, 2015 at 7:14 PM, c <[hidden email]> wrote:
hi,

i have some data:

A) a 1d array (dimensions 1x50), made by summing the columns of a 2d array (dimensions ~20k x 50). 

B) a 1D array that is just a particular row of that 2d array

i need to fit a sum of 2 negative binomial distributions to A), and to fit a single negative binomial distrib. to B).

i have spent a while now reading the documentation for numpy.stats and the statsmodel package and various stack overflow posts, etc.. but i do not yet understand how to go about fitting a discrete probability distribution to a vector of data.

Do  you have the data in the form of histograms (counts) or the original data ?

statsmodels can only estimate based on the original data which is assumed to consist of observations drawn from a Negative Binomial distribution.  Fitting histogram and fitting mixtures of distributions is not supported "out of the box", and would require some custom models.

If you just want to fit a distribution to a histogram or discrete counts, then using curve_fit or leastsq is one possibility.

Josef

 

specific subquestions:

- do i need to load data in as a pandas df? an ndarray? does it not matter?

- i understand endog and exog in the context of the examples given in the docs (where you have one column that you want to use to predict some other column) but not what they should be in the case where i basically am trying to fit a curve to the normalized histogram of my data

- if someone can explain how to fit with statsmodels' "Negative Binomial (http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.html#statsmodels.discrete.discrete_model.NegativeBinomial) that would be a good start. but i do also need to know how to fit to a sum of two of these, or possibly a sum of two other discrete distributions

- is the patsy formula syntax relevant here? i have never used R and could not find an example of the "R-like" syntax that is similar enough to my use case to parse how it works

- honestly i don't know what i'm doing, please help!

if these questions reveal grave ignorance, or are not directly relevant enough to scipy for this mailing list, i apologize and thanks for bearing with me. i barely know how to flip a coin, this stuff is new to me.

thanks a lot
c

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user