[SciPy-User] Robust fitting of an exponential distribution subpopulation

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] Robust fitting of an exponential distribution subpopulation

Antonino Ingargiola
Hi to the list,

I'm seeking the advise of the scientific python community to solve the following fitting problem. Both suggestions on the methodology and on particular software packages are appreciated.

I often encounter the need to fit a sample containing a (dominant) exponentially-distributed sub-population. Mostly the non-exponential samples (from an unknown distribution) are distributed close to the origin of the exponential distribution, therefore a simple approach I used so far is selecting all the samples higher than a threshold and fitting the exponential "tail" with MLE.

The problem is that the choice of the threshold is somewhat arbitrary and moreover there can be a small set of outlier on the extreme right-side of the distribution that would bias the MLE fit.

To improve the accuracy, I'm thinking of using (if necessary implementing) some kind of robust fitting procedure. For example using a scheme in which the outlier are identified by putting a threshold on the residual and then this threshold is optimized using some "goodness of fit" cost function. If this approach reasonable?

I am surely not the first to tackle this problem, so I would appreciated some suggestion and specific pointers to help me getting started.

Thank you,
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

kgulliks
Antonio,

The statsmodels package has a robust linear model module that I have used before. You will have to transform your data to be linear first by taking the log of the y-axis.



Kevin Gullikson

On Wed, Mar 11, 2015 at 12:04 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi to the list,

I'm seeking the advise of the scientific python community to solve the following fitting problem. Both suggestions on the methodology and on particular software packages are appreciated.

I often encounter the need to fit a sample containing a (dominant) exponentially-distributed sub-population. Mostly the non-exponential samples (from an unknown distribution) are distributed close to the origin of the exponential distribution, therefore a simple approach I used so far is selecting all the samples higher than a threshold and fitting the exponential "tail" with MLE.

The problem is that the choice of the threshold is somewhat arbitrary and moreover there can be a small set of outlier on the extreme right-side of the distribution that would bias the MLE fit.

To improve the accuracy, I'm thinking of using (if necessary implementing) some kind of robust fitting procedure. For example using a scheme in which the outlier are identified by putting a threshold on the residual and then this threshold is optimized using some "goodness of fit" cost function. If this approach reasonable?

I am surely not the first to tackle this problem, so I would appreciated some suggestion and specific pointers to help me getting started.

Thank you,
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

Antonino Ingargiola
Hi Kevin,

If I apply the log transform to the sample to linearize the models, what is the correct way to weight the residuals? Without weighting residual close to the tail will be amplified and bias the fit.

Antonio

On Wed, Mar 11, 2015 at 11:08 AM, Kevin Gullikson <[hidden email]> wrote:
Antonio,

The statsmodels package has a robust linear model module that I have used before. You will have to transform your data to be linear first by taking the log of the y-axis.



Kevin Gullikson

On Wed, Mar 11, 2015 at 12:04 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi to the list,

I'm seeking the advise of the scientific python community to solve the following fitting problem. Both suggestions on the methodology and on particular software packages are appreciated.

I often encounter the need to fit a sample containing a (dominant) exponentially-distributed sub-population. Mostly the non-exponential samples (from an unknown distribution) are distributed close to the origin of the exponential distribution, therefore a simple approach I used so far is selecting all the samples higher than a threshold and fitting the exponential "tail" with MLE.

The problem is that the choice of the threshold is somewhat arbitrary and moreover there can be a small set of outlier on the extreme right-side of the distribution that would bias the MLE fit.

To improve the accuracy, I'm thinking of using (if necessary implementing) some kind of robust fitting procedure. For example using a scheme in which the outlier are identified by putting a threshold on the residual and then this threshold is optimized using some "goodness of fit" cost function. If this approach reasonable?

I am surely not the first to tackle this problem, so I would appreciated some suggestion and specific pointers to help me getting started.

Thank you,
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

josef.pktd


On Wed, Mar 11, 2015 at 7:36 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi Kevin,

If I apply the log transform to the sample to linearize the models, what is the correct way to weight the residuals? Without weighting residual close to the tail will be amplified and bias the fit.

In RLM the robust linear model the weights are automatically chosen to downweight extreme residuals. The weighting scheme depends on the "norm" which defines the shape of the objective and of the weight function.

RLM produces an unbiased estimator of the mean or mean function for symmetric distribution and is calibrated for the normal distribution. I don't know how well this is approximated by the log of an exponentially distributed variable, but it won't exactly satisfy the assumptions.

There should be a more direct way of estimating the parameter for the exponential distribution in a robust way, but I never tried. 
(one idea would be to estimate a trimmed mean and use the estimated distribution to correct for the trimming. scipy.stats.distributions have an `expect` method that can be used to calculate the mean of a trimmed distribution, i.e. conditional on lower and upper bounds)


What's your sample size?   
(for very large sample sizes one approach that is sometimes used, is to fit a distribution to the central part of a histogram) 

Josef

 

Antonio

On Wed, Mar 11, 2015 at 11:08 AM, Kevin Gullikson <[hidden email]> wrote:
Antonio,

The statsmodels package has a robust linear model module that I have used before. You will have to transform your data to be linear first by taking the log of the y-axis.



Kevin Gullikson

On Wed, Mar 11, 2015 at 12:04 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi to the list,

I'm seeking the advise of the scientific python community to solve the following fitting problem. Both suggestions on the methodology and on particular software packages are appreciated.

I often encounter the need to fit a sample containing a (dominant) exponentially-distributed sub-population. Mostly the non-exponential samples (from an unknown distribution) are distributed close to the origin of the exponential distribution, therefore a simple approach I used so far is selecting all the samples higher than a threshold and fitting the exponential "tail" with MLE.

The problem is that the choice of the threshold is somewhat arbitrary and moreover there can be a small set of outlier on the extreme right-side of the distribution that would bias the MLE fit.

To improve the accuracy, I'm thinking of using (if necessary implementing) some kind of robust fitting procedure. For example using a scheme in which the outlier are identified by putting a threshold on the residual and then this threshold is optimized using some "goodness of fit" cost function. If this approach reasonable?

I am surely not the first to tackle this problem, so I would appreciated some suggestion and specific pointers to help me getting started.

Thank you,
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

Mark Daoust
In reply to this post by Antonino Ingargiola
What about the EM algorithm. You could fit a mixture of an exponential and a [whatever] distribution? since you seem to already believe that that's what it is?

Isn't it almost what you already have in mind already, just with soft thresholding?






Mark Daoust

On Wed, Mar 11, 2015 at 7:36 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi Kevin,

If I apply the log transform to the sample to linearize the models, what is the correct way to weight the residuals? Without weighting residual close to the tail will be amplified and bias the fit.

Antonio

On Wed, Mar 11, 2015 at 11:08 AM, Kevin Gullikson <[hidden email]> wrote:
Antonio,

The statsmodels package has a robust linear model module that I have used before. You will have to transform your data to be linear first by taking the log of the y-axis.



Kevin Gullikson

On Wed, Mar 11, 2015 at 12:04 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi to the list,

I'm seeking the advise of the scientific python community to solve the following fitting problem. Both suggestions on the methodology and on particular software packages are appreciated.

I often encounter the need to fit a sample containing a (dominant) exponentially-distributed sub-population. Mostly the non-exponential samples (from an unknown distribution) are distributed close to the origin of the exponential distribution, therefore a simple approach I used so far is selecting all the samples higher than a threshold and fitting the exponential "tail" with MLE.

The problem is that the choice of the threshold is somewhat arbitrary and moreover there can be a small set of outlier on the extreme right-side of the distribution that would bias the MLE fit.

To improve the accuracy, I'm thinking of using (if necessary implementing) some kind of robust fitting procedure. For example using a scheme in which the outlier are identified by putting a threshold on the residual and then this threshold is optimized using some "goodness of fit" cost function. If this approach reasonable?

I am surely not the first to tackle this problem, so I would appreciated some suggestion and specific pointers to help me getting started.

Thank you,
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

Andrew Nelson
I'm not sure if this helps, but if you can transform standard deviations as follows:

f = a * np.log(b * A)

then

sigma_f = np.abs(a * sigma_A / A)
 

On 12 March 2015 at 11:15, Mark Daoust <[hidden email]> wrote:
What about the EM algorithm. You could fit a mixture of an exponential and a [whatever] distribution? since you seem to already believe that that's what it is?

Isn't it almost what you already have in mind already, just with soft thresholding?






Mark Daoust

On Wed, Mar 11, 2015 at 7:36 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi Kevin,

If I apply the log transform to the sample to linearize the models, what is the correct way to weight the residuals? Without weighting residual close to the tail will be amplified and bias the fit.

Antonio

On Wed, Mar 11, 2015 at 11:08 AM, Kevin Gullikson <[hidden email]> wrote:
Antonio,

The statsmodels package has a robust linear model module that I have used before. You will have to transform your data to be linear first by taking the log of the y-axis.



Kevin Gullikson

On Wed, Mar 11, 2015 at 12:04 PM, Antonino Ingargiola <[hidden email]> wrote:
Hi to the list,

I'm seeking the advise of the scientific python community to solve the following fitting problem. Both suggestions on the methodology and on particular software packages are appreciated.

I often encounter the need to fit a sample containing a (dominant) exponentially-distributed sub-population. Mostly the non-exponential samples (from an unknown distribution) are distributed close to the origin of the exponential distribution, therefore a simple approach I used so far is selecting all the samples higher than a threshold and fitting the exponential "tail" with MLE.

The problem is that the choice of the threshold is somewhat arbitrary and moreover there can be a small set of outlier on the extreme right-side of the distribution that would bias the MLE fit.

To improve the accuracy, I'm thinking of using (if necessary implementing) some kind of robust fitting procedure. For example using a scheme in which the outlier are identified by putting a threshold on the residual and then this threshold is optimized using some "goodness of fit" cost function. If this approach reasonable?

I am surely not the first to tackle this problem, so I would appreciated some suggestion and specific pointers to help me getting started.

Thank you,
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user




--
_____________________________________
Dr. Andrew Nelson


_____________________________________

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

Antonino Ingargiola
In reply to this post by josef.pktd
On Wed, Mar 11, 2015 at 5:15 PM, <[hidden email]> wrote:
 
RLM produces an unbiased estimator of the mean or mean function for symmetric distribution and is calibrated for the normal distribution. I don't know how well this is approximated by the log of an exponentially distributed variable, but it won't exactly satisfy the assumptions.

Yes, RLM will heavily weight the tail that before the transform is almost 0 and after becomes arbitrary large negative values.

There should be a more direct way of estimating the parameter for the exponential distribution in a robust way, but I never tried. 
(one idea would be to estimate a trimmed mean and use the estimated distribution to correct for the trimming. scipy.stats.distributions have an `expect` method that can be used to calculate the mean of a trimmed distribution, i.e. conditional on lower and upper bounds)

Thanks, I wasn't aware of the expect method. However I already tried to trim the distribution (on the left-side). But the thresholds are arbitrary and I would like to make the fitting not dependent on them.
 
What's your sample size?   
(for very large sample sizes one approach that is sometimes used, is to fit a distribution to the central part of a histogram)

The sample size is from a few 100s to a few 1000s, not really huge. The problem is that I don't know if there is a "robust" a criterion to trim the distribution, and then what would be the accuracy of such a fit.
 
Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Robust fitting of an exponential distribution subpopulation

Antonino Ingargiola
In reply to this post by Mark Daoust
On Wed, Mar 11, 2015 at 5:15 PM, Mark Daoust <[hidden email]> wrote:
What about the EM algorithm. You could fit a mixture of an exponential and a [whatever] distribution? since you seem to already believe that that's what it is?

Yes EM would be great, except that I don't know how to model the outliers :(. And while in different samples I'm always looking at the exponential tail, the distribution of the outliers can be completely different.

Isn't it almost what you already have in mind already, just with soft thresholding? 

Yes almost. If only I had a way to identify the "sweet" purely exponential part of the distribution in some robust way :).

Honestly, I was thinking of doing a curve fit of the empirical CDF, select the samples with residuals below a threshold and re-perform the fit on the sub-population iteratively in some way.  But there are a few important details I'm unsure of. Is using the ECDF a good approach when the sample is contaminated? Should I use a test of exponentiality for the samples selection (like Kolgomorov-Smirnov)?

I could try to delve into these details but I don't want to reinvent the wheel or end up with a custom sub-optimal solution.

Antonio

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user