[SciPy-User] adding linear fitting routine

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] adding linear fitting routine

David Pine
I would like to get some feedback and generate some discussion about a least squares fitting routine I submitted last Friday  [please see adding linear fitting routine (29 Nov 2013)].  I know that everybody is very busy, but it would be helpful to get some feedback and, I hope, eventually to get this routine added to one of the basic numpy/scipy libraries.

David Pine

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Daniele Nicolodi
On 03/12/2013 11:19, David J Pine wrote:
> I would like to get some feedback and generate some discussion about a
> least squares fitting routine I submitted last Friday  [please
> see adding linear fitting routine
> <http://comments.gmane.org/gmane.comp.python.scientific.devel/18442> (29
> Nov 2013)].  I know that everybody is very busy, but it would be helpful
> to get some feedback and, I hope, eventually to get this routine added
> to one of the basic numpy/scipy libraries.


I think that adding least squares fitting routine which handles
correctly uncertainties and computes the covariance matrix is a good
idea. I wanted to do that myself since quite a while.

However, I think that a generalization to arbitrary degree polynomials
would be much more useful.  A linfit function may be added as a
convenience wrapper.  Actually it would be nice to have something that
works on arbitrary orthogonal bases, but it may be difficult to design a
general interface for such a thing.

Regarding your pull request, I don't really think that your code can be
much faster than the general purpose lest square fitting already in
scipy or numpy, modulo some bug somewhere.  You justify that saying that
your solution is faster because it does not invert a matrix, but this is
exactly what you are doing, except that you do not write the math in a
matrix formalism.

Furthermore, I didn't have a very close look but I don't understand what
the `relsigma` parameter is supposed to do, and I would rename the
`sigmay` parameter `yerr`.

Cheers,
Daniele

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine
Daniele,

Thank you for your feedback. Regarding the points you raise:

1. Generalization to arbitrary degree polynomials.  This already exists in numpy.polyfit.  One limitation of polyfit is that it does not currently allow the user to provide absolute uncertainties in the data, but there has been some discussion of adding this capability.

2. Generalization to arbitrary orthogonal bases. There currently exists in numpy fitting routines to various polynomials including chebfit, legfit, lagfit, hermfit, hermefit.  I am not aware of a fitting routine in numpy/scipy that works on arbitrary bases.

3. Speed. As far as I know there is no bug in any of the tested software.  The unit test test_linfit.py (https://github.com/djpine/linfit) times the various fitting routines.  You can run it yourself (and check the code--maybe you can spot some errors) but on my laptop I get the results printed at the end of this message.

linfit does not call a matrix inversion routine.  Instead it calculates the best fit slope and y-intercept directly.  By contrast, polyfit does call a matrix inversion routine (numpy.linalg.lstsq), which has a certain amount of overhead that linfit avoids.  This may be why polyfit is slower than linfit.

4. relsigma.  Other than using no weighting at all, there are basically two ways that people weight data in a least squares fit.
    (a) provide explicit absolute estimates of the errors (uncertainties) for each data point.  This is what physical scientists often do.  Setting relsigma=False tells linfit to use this method of weighting the data.  If the error estimates are accurate, then the covariance matrix provides estimates of the uncertainties in the fitting parameters (the slope & y-intercept).
    (b) provide relative estimates of the errors (uncertainties) for each data point (it's assumed that the absolute errors are not known but relative uncertainties between difference data points is known).  This is what social scientists often do.  When only the relative uncertainties are known, the covariance matrix needs to be rescaled in order to obtain accurate estimates of the uncertainties in the fitting parameters.  Setting relsigma=True tells linfit to use this method of weighting the data.

5.  Renaming `sigmay` parameter `yerr`.  Either choice is fine with me but I used `sigmay` to be (mostly) consistent with scipy.optimize.curve_fit.

---------------------------------

Results of timing tests from test_linfit.py

test_linfit.py
....
Compare linfit to scipy.linalg.lstsq with relative individually weighted data points
     10 data points: linfit is faster than scipy.linalg.lstsq by 1.26 times
    100 data points: linfit is faster than scipy.linalg.lstsq by 2.33 times
   1000 data points: linfit is faster than scipy.linalg.lstsq by 12 times
  10000 data points: linfit is faster than scipy.linalg.lstsq by 31.8 times
.
Compare linfit to scipy.linalg.lstsq with unweighted data points
     10 data points: linfit is faster than scipy.linalg.lstsq by 2.4 times
    100 data points: linfit is faster than scipy.linalg.lstsq by 2.5 times
   1000 data points: linfit is faster than scipy.linalg.lstsq by 2.9 times
  10000 data points: linfit is faster than scipy.linalg.lstsq by 3.5 times
 100000 data points: linfit is faster than scipy.linalg.lstsq by 4.4 times
1000000 data points: linfit is faster than scipy.linalg.lstsq by 4.6 times
.
Compare linfit to scipy.stats.linregress with unweighted data points
     10 data points: linfit is faster than scipy.stats.linregress by 5.2 times
    100 data points: linfit is faster than scipy.stats.linregress by 5.1 times
   1000 data points: linfit is faster than scipy.stats.linregress by 4.7 times
  10000 data points: linfit is faster than scipy.stats.linregress by 2.9 times
 100000 data points: linfit is faster than scipy.stats.linregress by 1.8 times
1000000 data points: linfit is faster than scipy.stats.linregress by 1.1 times
.
Compare linfit to polyfit with relative individually weighted data points
     10 data points: linfit is faster than numpy.polyfit by 2.6 times
    100 data points: linfit is faster than numpy.polyfit by 2.5 times
   1000 data points: linfit is faster than numpy.polyfit by 4.4 times
  10000 data points: linfit is faster than numpy.polyfit by 3.1 times
 100000 data points: linfit is faster than numpy.polyfit by 3.5 times
1000000 data points: linfit is faster than numpy.polyfit by 1.9 times
.
Compare linfit to polyfit with unweighted data points
     10 data points: linfit is faster than numpy.polyfit by 3 times
    100 data points: linfit is faster than numpy.polyfit by 3.5 times
   1000 data points: linfit is faster than numpy.polyfit by 4.3 times
  10000 data points: linfit is faster than numpy.polyfit by 6 times
 100000 data points: linfit is faster than numpy.polyfit by 9.5 times
1000000 data points: linfit is faster than numpy.polyfit by 7.1 times
.....
----------------------------------------------------------------------





On Wed, Dec 4, 2013 at 12:13 PM, Daniele Nicolodi <[hidden email]> wrote:
On 03/12/2013 11:19, David J Pine wrote:
> I would like to get some feedback and generate some discussion about a
> least squares fitting routine I submitted last Friday  [please
> see adding linear fitting routine
> <http://comments.gmane.org/gmane.comp.python.scientific.devel/18442> (29
> Nov 2013)].  I know that everybody is very busy, but it would be helpful
> to get some feedback and, I hope, eventually to get this routine added
> to one of the basic numpy/scipy libraries.


I think that adding least squares fitting routine which handles
correctly uncertainties and computes the covariance matrix is a good
idea. I wanted to do that myself since quite a while.

However, I think that a generalization to arbitrary degree polynomials
would be much more useful.  A linfit function may be added as a
convenience wrapper.  Actually it would be nice to have something that
works on arbitrary orthogonal bases, but it may be difficult to design a
general interface for such a thing.

Regarding your pull request, I don't really think that your code can be
much faster than the general purpose lest square fitting already in
scipy or numpy, modulo some bug somewhere.  You justify that saying that
your solution is faster because it does not invert a matrix, but this is
exactly what you are doing, except that you do not write the math in a
matrix formalism.

Furthermore, I didn't have a very close look but I don't understand what
the `relsigma` parameter is supposed to do, and I would rename the
`sigmay` parameter `yerr`.

Cheers,
Daniele

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Daniele Nicolodi
On 04/12/2013 13:43, David J Pine wrote:
> linfit does not call a matrix inversion routine.  Instead it calculates
> the best fit slope and y-intercept directly.  By contrast, polyfit does
> call a matrix inversion routine (numpy.linalg.lstsq), which has a
> certain amount of overhead that linfit avoids.  This may be why polyfit
> is slower than linfit.

A least squares fit is a matrix inversion. What you do is a matrix
inversion, except that the notation you use does not make this clear.
What you can discuss is the method you use for the inversion.  I would
have to have a closer look to the test...

> 4. relsigma.  Other than using no weighting at all, there are basically
> two ways that people weight data in a least squares fit.
>     (a) provide explicit absolute estimates of the errors
> (uncertainties) for each data point.  This is what physical scientists
> often do.  Setting relsigma=False tells linfit to use this method of
> weighting the data.  If the error estimates are accurate, then the
> covariance matrix provides estimates of the uncertainties in the fitting
> parameters (the slope & y-intercept).
>     (b) provide relative estimates of the errors (uncertainties) for
> each data point (it's assumed that the absolute errors are not known but
> relative uncertainties between difference data points is known).  This
> is what social scientists often do.  When only the relative
> uncertainties are known, the covariance matrix needs to be rescaled in
> order to obtain accurate estimates of the uncertainties in the fitting
> parameters.  Setting relsigma=True tells linfit to use this method of
> weighting the data.

This is not really clear from the docstring (plus it optional but no
default value is specified in the docstring), and made even less obvious
by the name of the parameter used to specify the uncertainties.

I would prefer two independent and mutually exclusive parameters for the
two cases, 'sigma' and 'relsigma' are one option if you want to be
compatible with the (ugly, IMHO) parameter name used by curve_fit.

Cheers,
Daniele

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Daniele Nicolodi
In reply to this post by David Pine
On 04/12/2013 13:43, David J Pine wrote:
> 1. Generalization to arbitrary degree polynomials.  This already exists
> in numpy.polyfit.  One limitation of polyfit is that it does not
> currently allow the user to provide absolute uncertainties in the data,
> but there has been some discussion of adding this capability.

This is a huge limitation IMHO.  Furthermore, polyfit() allows only to
fit polynomials up to an arbitrary order, not polynomials of arbitrary
order (it is not possible to fit y = d * x**3 but only y = a + b * x + c
* x**2 + d * x**3).

Cheers,
Daniele

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: [SciPy-Dev] adding linear fitting routine

Daπid
In reply to this post by David Pine
On 3 December 2013 11:19, David J Pine <[hidden email]> wrote:
I would like to get some feedback and generate some discussion about a least squares fitting routine I submitted last Friday 

On the wishlist level, I would like to see a complete model fitting, considering errors in both axis and correlation, and the option for a robust fitting system. See details, for example, here:

http://arxiv.org/abs/1008.4686

I haven't really needed it myself, so I haven't taken the time to implement it yet.


/David.

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine
In reply to this post by Daniele Nicolodi
Daniele


On Dec 4, 2013, at 1:55 PM, Daniele Nicolodi <[hidden email]> wrote:

> On 04/12/2013 13:43, David J Pine wrote:
>> linfit does not call a matrix inversion routine.  Instead it calculates
>> the best fit slope and y-intercept directly.  By contrast, polyfit does
>> call a matrix inversion routine (numpy.linalg.lstsq), which has a
>> certain amount of overhead that linfit avoids.  This may be why polyfit
>> is slower than linfit.
>
> A least squares fit is a matrix inversion. What you do is a matrix
> inversion, except that the notation you use does not make this clear.
> What you can discuss is the method you use for the inversion.  I would
> have to have a closer look to the test...

I assure you that I understand the mathematics.  Specifically, I understand
that you can view the mathematics used in linfit as implementing matrix
inversion.  That is not the point.  The point is that polyfit calls a matrix
inversion routine, which invokes computational machinery that is slow
compared to just doing the algebra directly without calling a matrix
inversion routine.  I hope this is clear.

>
>> 4. relsigma.  Other than using no weighting at all, there are basically
>> two ways that people weight data in a least squares fit.
>>    (a) provide explicit absolute estimates of the errors
>> (uncertainties) for each data point.  This is what physical scientists
>> often do.  Setting relsigma=False tells linfit to use this method of
>> weighting the data.  If the error estimates are accurate, then the
>> covariance matrix provides estimates of the uncertainties in the fitting
>> parameters (the slope & y-intercept).
>>    (b) provide relative estimates of the errors (uncertainties) for
>> each data point (it's assumed that the absolute errors are not known but
>> relative uncertainties between difference data points is known).  This
>> is what social scientists often do.  When only the relative
>> uncertainties are known, the covariance matrix needs to be rescaled in
>> order to obtain accurate estimates of the uncertainties in the fitting
>> parameters.  Setting relsigma=True tells linfit to use this method of
>> weighting the data.
>
> This is not really clear from the docstring (plus it optional but no
> default value is specified in the docstring), and made even less obvious
> by the name of the parameter used to specify the uncertainties.

It's specified in the function definition:

def linfit(x, y, sigmay=None, relsigma=True, cov=False, chisq=False, residuals=False)

which is the way it's always done in the online numpy and scipy documentation.
However, I can additionally specify it in the docstring under the parameter
definition.

>
> I would prefer two independent and mutually exclusive parameters for the
> two cases, 'sigma' and 'relsigma' are one option if you want to be
> compatible with the (ugly, IMHO) parameter name used by curve_fit.

Here I disagree.  sigmay is an array of error values.  resigma is a boolean that simply
tells linfit whether to treat the sigmay values as relative (relsigma=True, the default)
or as absolute (relsigma=False).

David
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

josef.pktd
On Wed, Dec 4, 2013 at 8:58 AM, David Pine <[hidden email]> wrote:

> Daniele
>
>
> On Dec 4, 2013, at 1:55 PM, Daniele Nicolodi <[hidden email]> wrote:
>
>> On 04/12/2013 13:43, David J Pine wrote:
>>> linfit does not call a matrix inversion routine.  Instead it calculates
>>> the best fit slope and y-intercept directly.  By contrast, polyfit does
>>> call a matrix inversion routine (numpy.linalg.lstsq), which has a
>>> certain amount of overhead that linfit avoids.  This may be why polyfit
>>> is slower than linfit.
>>
>> A least squares fit is a matrix inversion. What you do is a matrix
>> inversion, except that the notation you use does not make this clear.
>> What you can discuss is the method you use for the inversion.  I would
>> have to have a closer look to the test...
>
> I assure you that I understand the mathematics.  Specifically, I understand
> that you can view the mathematics used in linfit as implementing matrix
> inversion.  That is not the point.  The point is that polyfit calls a matrix
> inversion routine, which invokes computational machinery that is slow
> compared to just doing the algebra directly without calling a matrix
> inversion routine.  I hope this is clear.
>
>>
>>> 4. relsigma.  Other than using no weighting at all, there are basically
>>> two ways that people weight data in a least squares fit.
>>>    (a) provide explicit absolute estimates of the errors
>>> (uncertainties) for each data point.  This is what physical scientists
>>> often do.  Setting relsigma=False tells linfit to use this method of
>>> weighting the data.  If the error estimates are accurate, then the
>>> covariance matrix provides estimates of the uncertainties in the fitting
>>> parameters (the slope & y-intercept).
>>>    (b) provide relative estimates of the errors (uncertainties) for
>>> each data point (it's assumed that the absolute errors are not known but
>>> relative uncertainties between difference data points is known).  This
>>> is what social scientists often do.  When only the relative
>>> uncertainties are known, the covariance matrix needs to be rescaled in
>>> order to obtain accurate estimates of the uncertainties in the fitting
>>> parameters.  Setting relsigma=True tells linfit to use this method of
>>> weighting the data.
>>
>> This is not really clear from the docstring (plus it optional but no
>> default value is specified in the docstring), and made even less obvious
>> by the name of the parameter used to specify the uncertainties.
>
> It's specified in the function definition:
>
> def linfit(x, y, sigmay=None, relsigma=True, cov=False, chisq=False, residuals=False)
>
> which is the way it's always done in the online numpy and scipy documentation.
> However, I can additionally specify it in the docstring under the parameter
> definition.
>
>>
>> I would prefer two independent and mutually exclusive parameters for the
>> two cases, 'sigma' and 'relsigma' are one option if you want to be
>> compatible with the (ugly, IMHO) parameter name used by curve_fit.
>
> Here I disagree.  sigmay is an array of error values.  resigma is a boolean that simply
> tells linfit whether to treat the sigmay values as relative (relsigma=True, the default)
> or as absolute (relsigma=False).

linfit looks like an enhanced version of linregress, which also has
only one regressor, but doesn't have weights
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

relsigma is similar to the new `absolute_sigma` in curve_fit
https://github.com/scipy/scipy/pull/3098


I think linregress could be rewritten to include these improvements.

Otherwise I keep out of any fitting debates, because I think `odr` is
better for handling measurement errors in the x variables, and
statsmodels is better for everything else (mainly linear only so far)
and `lmfit` for nonlinear LS.
There might be a case for stripped down convenience functions or
special case functions.

Josef

>
> David
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine

On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:


linfit looks like an enhanced version of linregress, which also has
only one regressor, but doesn't have weights
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

The problem with this is that the statistical tests that linregress uses--the r-value, p-value, & stderr-- are not really compatible with the weighted chi-squared fitting that linfit does.  The  r-value, p-value, & stderr are statistical tests that are used mostly in the social sciences (see http://en.wikipedia.org/wiki/Coefficient_of_determination).  Looking at linregress, it's clear that it was written with that community in mind.

By contrast, linfit (and curve_fit) use the chi-squared measure of goodness of fit, which is explicitly made to be used with weighted data.  In my opinion, trying to satisfy the needs of both communities with one function will result in inefficient code and confusion in both user communities.  linfit naturally goes with the curve_fit and polyfit functions, and is implemented consistent with those fitting routines.  linregress is really a different animal, with statistical tests normally used with unweighted data, and I suspect that the community that uses it will be put off by the "improvements" made by linfit.


relsigma is similar to the new `absolute_sigma` in curve_fit
https://github.com/scipy/scipy/pull/3098

That's right.  linfit implements essentially the same functionality that is being implemented in curve_fit



I think linregress could be rewritten to include these improvements.

Otherwise I keep out of any fitting debates, because I think `odr` is
better for handling measurement errors in the x variables, and
statsmodels is better for everything else (mainly linear only so far)
and `lmfit` for nonlinear LS.
There might be a case for stripped down convenience functions or
special case functions.

Josef


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

josef.pktd
On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:

>
> On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
>
>
> linfit looks like an enhanced version of linregress, which also has
> only one regressor, but doesn't have weights
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>
>
> The problem with this is that the statistical tests that linregress
> uses--the r-value, p-value, & stderr-- are not really compatible with the
> weighted chi-squared fitting that linfit does.  The  r-value, p-value, &
> stderr are statistical tests that are used mostly in the social sciences
> (see http://en.wikipedia.org/wiki/Coefficient_of_determination).  Looking at
> linregress, it's clear that it was written with that community in mind.
>
> By contrast, linfit (and curve_fit) use the chi-squared measure of goodness
> of fit, which is explicitly made to be used with weighted data.  In my
> opinion, trying to satisfy the needs of both communities with one function
> will result in inefficient code and confusion in both user communities.
> linfit naturally goes with the curve_fit and polyfit functions, and is
> implemented consistent with those fitting routines.  linregress is really a
> different animal, with statistical tests normally used with unweighted data,
> and I suspect that the community that uses it will be put off by the
> "improvements" made by linfit.

except for setting absolute_sigma to True or relsigma to False and
returning redchisq instead of rsquared, there is no real difference.
It's still just weighted least squares with fixed or estimated scale.
(In statsmodels we have most of the same statistics returned after WLS
as after OLS. However, allowing for a fixed scale is still not built
in.)

You still return the cov of the parameter estimates, so users can
still calculate std_err and pvalue themselves in `linfit`.

In my interpretation of the discussions around curve_fit, it seems to
me that it is now a version that both communities can use.
The only problem I see is that linfit/linregress get a bit ugly if
there are many optional returns.

Josef

>
>
> relsigma is similar to the new `absolute_sigma` in curve_fit
> https://github.com/scipy/scipy/pull/3098
>
>
> That's right.  linfit implements essentially the same functionality that is
> being implemented in curve_fit
>
>
>
> I think linregress could be rewritten to include these improvements.
>
> Otherwise I keep out of any fitting debates, because I think `odr` is
> better for handling measurement errors in the x variables, and
> statsmodels is better for everything else (mainly linear only so far)
> and `lmfit` for nonlinear LS.
> There might be a case for stripped down convenience functions or
> special case functions.
>
> Josef
>
>
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine
Josef,

Ok, so what would you propose?  That we essentially replace linregress with linfit, and then let people calculate std_err and pvalue themselves from the covariance matrix that `linfit` returns?  or something else?  By the way, that's what I chose to do for the estimates of the uncertainties in the fitting parameters--to let the user calculate the uncertainties in the fitting parameters from square roots the diagonal elements of the covariance matrix.  In my opinion, that results in a cleaner less cluttered function.

David

David


On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote:
On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:
>
> On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
>
>
> linfit looks like an enhanced version of linregress, which also has
> only one regressor, but doesn't have weights
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>
>
> The problem with this is that the statistical tests that linregress
> uses--the r-value, p-value, & stderr-- are not really compatible with the
> weighted chi-squared fitting that linfit does.  The  r-value, p-value, &
> stderr are statistical tests that are used mostly in the social sciences
> (see http://en.wikipedia.org/wiki/Coefficient_of_determination).  Looking at
> linregress, it's clear that it was written with that community in mind.
>
> By contrast, linfit (and curve_fit) use the chi-squared measure of goodness
> of fit, which is explicitly made to be used with weighted data.  In my
> opinion, trying to satisfy the needs of both communities with one function
> will result in inefficient code and confusion in both user communities.
> linfit naturally goes with the curve_fit and polyfit functions, and is
> implemented consistent with those fitting routines.  linregress is really a
> different animal, with statistical tests normally used with unweighted data,
> and I suspect that the community that uses it will be put off by the
> "improvements" made by linfit.

except for setting absolute_sigma to True or relsigma to False and
returning redchisq instead of rsquared, there is no real difference.
It's still just weighted least squares with fixed or estimated scale.
(In statsmodels we have most of the same statistics returned after WLS
as after OLS. However, allowing for a fixed scale is still not built
in.)

You still return the cov of the parameter estimates, so users can
still calculate std_err and pvalue themselves in `linfit`.

In my interpretation of the discussions around curve_fit, it seems to
me that it is now a version that both communities can use.
The only problem I see is that linfit/linregress get a bit ugly if
there are many optional returns.

Josef

>
>
> relsigma is similar to the new `absolute_sigma` in curve_fit
> https://github.com/scipy/scipy/pull/3098
>
>
> That's right.  linfit implements essentially the same functionality that is
> being implemented in curve_fit
>
>
>
> I think linregress could be rewritten to include these improvements.
>
> Otherwise I keep out of any fitting debates, because I think `odr` is
> better for handling measurement errors in the x variables, and
> statsmodels is better for everything else (mainly linear only so far)
> and `lmfit` for nonlinear LS.
> There might be a case for stripped down convenience functions or
> special case functions.
>
> Josef
>
>
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine
Josef,

Actually, I just rechecked curve_fit and it returns only the optimal fitting parameters and the covariance matrix.  I could pare down linfit so that it returns only those quantities and leave it to the user to calculate chi0squared and the residuals.  I suppose that's the cleanest way to go.

David


On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote:
Josef,

Ok, so what would you propose?  That we essentially replace linregress with linfit, and then let people calculate std_err and pvalue themselves from the covariance matrix that `linfit` returns?  or something else?  By the way, that's what I chose to do for the estimates of the uncertainties in the fitting parameters--to let the user calculate the uncertainties in the fitting parameters from square roots the diagonal elements of the covariance matrix.  In my opinion, that results in a cleaner less cluttered function.

David

David


On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote:
On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:
>
> On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
>
>
> linfit looks like an enhanced version of linregress, which also has
> only one regressor, but doesn't have weights
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>
>
> The problem with this is that the statistical tests that linregress
> uses--the r-value, p-value, & stderr-- are not really compatible with the
> weighted chi-squared fitting that linfit does.  The  r-value, p-value, &
> stderr are statistical tests that are used mostly in the social sciences
> (see http://en.wikipedia.org/wiki/Coefficient_of_determination).  Looking at
> linregress, it's clear that it was written with that community in mind.
>
> By contrast, linfit (and curve_fit) use the chi-squared measure of goodness
> of fit, which is explicitly made to be used with weighted data.  In my
> opinion, trying to satisfy the needs of both communities with one function
> will result in inefficient code and confusion in both user communities.
> linfit naturally goes with the curve_fit and polyfit functions, and is
> implemented consistent with those fitting routines.  linregress is really a
> different animal, with statistical tests normally used with unweighted data,
> and I suspect that the community that uses it will be put off by the
> "improvements" made by linfit.

except for setting absolute_sigma to True or relsigma to False and
returning redchisq instead of rsquared, there is no real difference.
It's still just weighted least squares with fixed or estimated scale.
(In statsmodels we have most of the same statistics returned after WLS
as after OLS. However, allowing for a fixed scale is still not built
in.)

You still return the cov of the parameter estimates, so users can
still calculate std_err and pvalue themselves in `linfit`.

In my interpretation of the discussions around curve_fit, it seems to
me that it is now a version that both communities can use.
The only problem I see is that linfit/linregress get a bit ugly if
there are many optional returns.

Josef

>
>
> relsigma is similar to the new `absolute_sigma` in curve_fit
> https://github.com/scipy/scipy/pull/3098
>
>
> That's right.  linfit implements essentially the same functionality that is
> being implemented in curve_fit
>
>
>
> I think linregress could be rewritten to include these improvements.
>
> Otherwise I keep out of any fitting debates, because I think `odr` is
> better for handling measurement errors in the x variables, and
> statsmodels is better for everything else (mainly linear only so far)
> and `lmfit` for nonlinear LS.
> There might be a case for stripped down convenience functions or
> special case functions.
>
> Josef
>
>
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user



_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

josef.pktd
On Wed, Dec 4, 2013 at 12:53 PM, David J Pine <[hidden email]> wrote:

> Josef,
>
> Actually, I just rechecked curve_fit and it returns only the optimal fitting
> parameters and the covariance matrix.  I could pare down linfit so that it
> returns only those quantities and leave it to the user to calculate
> chi0squared and the residuals.  I suppose that's the cleanest way to go.
>
> David
>
>
> On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote:
>>
>> Josef,
>>
>> Ok, so what would you propose?  That we essentially replace linregress
>> with linfit, and then let people calculate std_err and pvalue themselves
>> from the covariance matrix that `linfit` returns?  or something else?  By
>> the way, that's what I chose to do for the estimates of the uncertainties in
>> the fitting parameters--to let the user calculate the uncertainties in the
>> fitting parameters from square roots the diagonal elements of the covariance
>> matrix.  In my opinion, that results in a cleaner less cluttered function.

Please reply inline so we have the sub-threads together.

two thoughts:

- I'm getting more and more averse to functions that return "numbers"
  scipy.optimize minimize returns a dictionary
  In statsmodels we return a special class instance, that can
calculate lazily all the extra things a user might want.
  And were we don't do that yet like in hypothesis test, we want to
change it soon.
  The two main problems with returning numbers are that it cannot be
changed in a backwards compatible way, and, second, if we want to
offer a user to calculate additional optional results, then we need
return_this, return_that, return_something_else, ....

- The main usage of stats.linregress that I have seen in random looks
at various packages, is just to get quick fit of a line without (m)any
extras. In this case just returning the parameters and maybe some
other minimal cheap extras is fine.


I don't know if we want linfit to provide a one-stop shopping center,
or just to provide some minimal results and leave the rest to the
user.

(In statsmodels I also often don't know what I should do. I follow the
scipy tradition and return some numbers, only to change my mind later
when I see what additional results could be easily calculated within
the function, but I don't get access to the required calculations.)

Josef

>>
>> David
>>
>> David
>>
>>
>> On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote:
>>>
>>> On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:
>>> >
>>> > On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
>>> >
>>> >
>>> > linfit looks like an enhanced version of linregress, which also has
>>> > only one regressor, but doesn't have weights
>>> >
>>> > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>>> >
>>> >
>>> > The problem with this is that the statistical tests that linregress
>>> > uses--the r-value, p-value, & stderr-- are not really compatible with
>>> > the
>>> > weighted chi-squared fitting that linfit does.  The  r-value, p-value,
>>> > &
>>> > stderr are statistical tests that are used mostly in the social
>>> > sciences
>>> > (see http://en.wikipedia.org/wiki/Coefficient_of_determination).
>>> > Looking at
>>> > linregress, it's clear that it was written with that community in mind.
>>> >
>>> > By contrast, linfit (and curve_fit) use the chi-squared measure of
>>> > goodness
>>> > of fit, which is explicitly made to be used with weighted data.  In my
>>> > opinion, trying to satisfy the needs of both communities with one
>>> > function
>>> > will result in inefficient code and confusion in both user communities.
>>> > linfit naturally goes with the curve_fit and polyfit functions, and is
>>> > implemented consistent with those fitting routines.  linregress is
>>> > really a
>>> > different animal, with statistical tests normally used with unweighted
>>> > data,
>>> > and I suspect that the community that uses it will be put off by the
>>> > "improvements" made by linfit.
>>>
>>> except for setting absolute_sigma to True or relsigma to False and
>>> returning redchisq instead of rsquared, there is no real difference.
>>> It's still just weighted least squares with fixed or estimated scale.
>>> (In statsmodels we have most of the same statistics returned after WLS
>>> as after OLS. However, allowing for a fixed scale is still not built
>>> in.)
>>>
>>> You still return the cov of the parameter estimates, so users can
>>> still calculate std_err and pvalue themselves in `linfit`.
>>>
>>> In my interpretation of the discussions around curve_fit, it seems to
>>> me that it is now a version that both communities can use.
>>> The only problem I see is that linfit/linregress get a bit ugly if
>>> there are many optional returns.
>>>
>>> Josef
>>>
>>> >
>>> >
>>> > relsigma is similar to the new `absolute_sigma` in curve_fit
>>> > https://github.com/scipy/scipy/pull/3098
>>> >
>>> >
>>> > That's right.  linfit implements essentially the same functionality
>>> > that is
>>> > being implemented in curve_fit
>>> >
>>> >
>>> >
>>> > I think linregress could be rewritten to include these improvements.
>>> >
>>> > Otherwise I keep out of any fitting debates, because I think `odr` is
>>> > better for handling measurement errors in the x variables, and
>>> > statsmodels is better for everything else (mainly linear only so far)
>>> > and `lmfit` for nonlinear LS.
>>> > There might be a case for stripped down convenience functions or
>>> > special case functions.
>>> >
>>> > Josef
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > SciPy-User mailing list
>>> > [hidden email]
>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>> >
>>> _______________________________________________
>>> SciPy-User mailing list
>>> [hidden email]
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Matt Newville
Hi David, Josef,

On Wed, Dec 4, 2013 at 12:15 PM,  <[hidden email]> wrote:

> On Wed, Dec 4, 2013 at 12:53 PM, David J Pine <[hidden email]> wrote:
>> Josef,
>>
>> Actually, I just rechecked curve_fit and it returns only the optimal fitting
>> parameters and the covariance matrix.  I could pare down linfit so that it
>> returns only those quantities and leave it to the user to calculate
>> chi0squared and the residuals.  I suppose that's the cleanest way to go.
>>
>> David
>>
>>
>> On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote:
>>>
>>> Josef,
>>>
>>> Ok, so what would you propose?  That we essentially replace linregress
>>> with linfit, and then let people calculate std_err and pvalue themselves
>>> from the covariance matrix that `linfit` returns?  or something else?  By
>>> the way, that's what I chose to do for the estimates of the uncertainties in
>>> the fitting parameters--to let the user calculate the uncertainties in the
>>> fitting parameters from square roots the diagonal elements of the covariance
>>> matrix.  In my opinion, that results in a cleaner less cluttered function.
>
> Please reply inline so we have the sub-threads together.
>
> two thoughts:
>
> - I'm getting more and more averse to functions that return "numbers"
>   scipy.optimize minimize returns a dictionary
>   In statsmodels we return a special class instance, that can
> calculate lazily all the extra things a user might want.
>   And were we don't do that yet like in hypothesis test, we want to
> change it soon.
>   The two main problems with returning numbers are that it cannot be
> changed in a backwards compatible way, and, second, if we want to
> offer a user to calculate additional optional results, then we need
> return_this, return_that, return_something_else, ....
>
> - The main usage of stats.linregress that I have seen in random looks
> at various packages, is just to get quick fit of a line without (m)any
> extras. In this case just returning the parameters and maybe some
> other minimal cheap extras is fine.
>
>
> I don't know if we want linfit to provide a one-stop shopping center,
> or just to provide some minimal results and leave the rest to the
> user.
>
> (In statsmodels I also often don't know what I should do. I follow the
> scipy tradition and return some numbers, only to change my mind later
> when I see what additional results could be easily calculated within
> the function, but I don't get access to the required calculations.)
>
> Josef
>
>>>
>>> David
>>>
>>> David
>>>
>>>
>>> On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote:
>>>>
>>>> On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:
>>>> >
>>>> > On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
>>>> >
>>>> >
>>>> > linfit looks like an enhanced version of linregress, which also has
>>>> > only one regressor, but doesn't have weights
>>>> >
>>>> > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>>>> >
>>>> >
>>>> > The problem with this is that the statistical tests that linregress
>>>> > uses--the r-value, p-value, & stderr-- are not really compatible with
>>>> > the
>>>> > weighted chi-squared fitting that linfit does.  The  r-value, p-value,
>>>> > &
>>>> > stderr are statistical tests that are used mostly in the social
>>>> > sciences
>>>> > (see http://en.wikipedia.org/wiki/Coefficient_of_determination).
>>>> > Looking at
>>>> > linregress, it's clear that it was written with that community in mind.
>>>> >
>>>> > By contrast, linfit (and curve_fit) use the chi-squared measure of
>>>> > goodness
>>>> > of fit, which is explicitly made to be used with weighted data.  In my
>>>> > opinion, trying to satisfy the needs of both communities with one
>>>> > function
>>>> > will result in inefficient code and confusion in both user communities.
>>>> > linfit naturally goes with the curve_fit and polyfit functions, and is
>>>> > implemented consistent with those fitting routines.  linregress is
>>>> > really a
>>>> > different animal, with statistical tests normally used with unweighted
>>>> > data,
>>>> > and I suspect that the community that uses it will be put off by the
>>>> > "improvements" made by linfit.
>>>>
>>>> except for setting absolute_sigma to True or relsigma to False and
>>>> returning redchisq instead of rsquared, there is no real difference.
>>>> It's still just weighted least squares with fixed or estimated scale.
>>>> (In statsmodels we have most of the same statistics returned after WLS
>>>> as after OLS. However, allowing for a fixed scale is still not built
>>>> in.)
>>>>
>>>> You still return the cov of the parameter estimates, so users can
>>>> still calculate std_err and pvalue themselves in `linfit`.
>>>>
>>>> In my interpretation of the discussions around curve_fit, it seems to
>>>> me that it is now a version that both communities can use.
>>>> The only problem I see is that linfit/linregress get a bit ugly if
>>>> there are many optional returns.
>>>>
>>>> Josef
>>>>
>>>> >
>>>> >
>>>> > relsigma is similar to the new `absolute_sigma` in curve_fit
>>>> > https://github.com/scipy/scipy/pull/3098
>>>> >
>>>> >
>>>> > That's right.  linfit implements essentially the same functionality
>>>> > that is
>>>> > being implemented in curve_fit
>>>> >
>>>> >
>>>> >
>>>> > I think linregress could be rewritten to include these improvements.
>>>> >
>>>> > Otherwise I keep out of any fitting debates, because I think `odr` is
>>>> > better for handling measurement errors in the x variables, and
>>>> > statsmodels is better for everything else (mainly linear only so far)
>>>> > and `lmfit` for nonlinear LS.
>>>> > There might be a case for stripped down convenience functions or
>>>> > special case functions.
>>>> >
>>>> > Josef
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > SciPy-User mailing list
>>>> > [hidden email]
>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> [hidden email]
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

I'd very much like to see having linfit() available in scipy.
Would it be reasonable to add David's linfit() as "the more complete"
version, and refactor linregress() to use linfit() and return it's
current return tuple derived from the linfit results?    Perhaps that
what Josef is suggesting too.  FWIW, I would generally prefer getting
a dictionary of results as a return value instead of a tuple with more
than 2 items.

--Matt Newville
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine
I guess my preference would be to write have linfit() be as similar to curve_fit() in outputs (and inputs in so far as it makes sense), and then if we decide we prefer another way of doing either the inputs or the outputs, then to do them in concert.  I think there is real value in making the user interfaces of linfit() and curve_fit() consistent--it make the user's experience so much less confusing. As of right now, I am agnostic about whether or not the function returns a dictionary of results--although I am unsure of what you have in mind.  How would you structure a dictionary of results?

David


On Wed, Dec 4, 2013 at 7:24 PM, Matt Newville <[hidden email]> wrote:
Hi David, Josef,

On Wed, Dec 4, 2013 at 12:15 PM,  <[hidden email]> wrote:
> On Wed, Dec 4, 2013 at 12:53 PM, David J Pine <[hidden email]> wrote:
>> Josef,
>>
>> Actually, I just rechecked curve_fit and it returns only the optimal fitting
>> parameters and the covariance matrix.  I could pare down linfit so that it
>> returns only those quantities and leave it to the user to calculate
>> chi0squared and the residuals.  I suppose that's the cleanest way to go.
>>
>> David
>>
>>
>> On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote:
>>>
>>> Josef,
>>>
>>> Ok, so what would you propose?  That we essentially replace linregress
>>> with linfit, and then let people calculate std_err and pvalue themselves
>>> from the covariance matrix that `linfit` returns?  or something else?  By
>>> the way, that's what I chose to do for the estimates of the uncertainties in
>>> the fitting parameters--to let the user calculate the uncertainties in the
>>> fitting parameters from square roots the diagonal elements of the covariance
>>> matrix.  In my opinion, that results in a cleaner less cluttered function.
>
> Please reply inline so we have the sub-threads together.
>
> two thoughts:
>
> - I'm getting more and more averse to functions that return "numbers"
>   scipy.optimize minimize returns a dictionary
>   In statsmodels we return a special class instance, that can
> calculate lazily all the extra things a user might want.
>   And were we don't do that yet like in hypothesis test, we want to
> change it soon.
>   The two main problems with returning numbers are that it cannot be
> changed in a backwards compatible way, and, second, if we want to
> offer a user to calculate additional optional results, then we need
> return_this, return_that, return_something_else, ....
>
> - The main usage of stats.linregress that I have seen in random looks
> at various packages, is just to get quick fit of a line without (m)any
> extras. In this case just returning the parameters and maybe some
> other minimal cheap extras is fine.
>
>
> I don't know if we want linfit to provide a one-stop shopping center,
> or just to provide some minimal results and leave the rest to the
> user.
>
> (In statsmodels I also often don't know what I should do. I follow the
> scipy tradition and return some numbers, only to change my mind later
> when I see what additional results could be easily calculated within
> the function, but I don't get access to the required calculations.)
>
> Josef
>
>>>
>>> David
>>>
>>> David
>>>
>>>
>>> On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote:
>>>>
>>>> On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:
>>>> >
>>>> > On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
>>>> >
>>>> >
>>>> > linfit looks like an enhanced version of linregress, which also has
>>>> > only one regressor, but doesn't have weights
>>>> >
>>>> > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>>>> >
>>>> >
>>>> > The problem with this is that the statistical tests that linregress
>>>> > uses--the r-value, p-value, & stderr-- are not really compatible with
>>>> > the
>>>> > weighted chi-squared fitting that linfit does.  The  r-value, p-value,
>>>> > &
>>>> > stderr are statistical tests that are used mostly in the social
>>>> > sciences
>>>> > (see http://en.wikipedia.org/wiki/Coefficient_of_determination).
>>>> > Looking at
>>>> > linregress, it's clear that it was written with that community in mind.
>>>> >
>>>> > By contrast, linfit (and curve_fit) use the chi-squared measure of
>>>> > goodness
>>>> > of fit, which is explicitly made to be used with weighted data.  In my
>>>> > opinion, trying to satisfy the needs of both communities with one
>>>> > function
>>>> > will result in inefficient code and confusion in both user communities.
>>>> > linfit naturally goes with the curve_fit and polyfit functions, and is
>>>> > implemented consistent with those fitting routines.  linregress is
>>>> > really a
>>>> > different animal, with statistical tests normally used with unweighted
>>>> > data,
>>>> > and I suspect that the community that uses it will be put off by the
>>>> > "improvements" made by linfit.
>>>>
>>>> except for setting absolute_sigma to True or relsigma to False and
>>>> returning redchisq instead of rsquared, there is no real difference.
>>>> It's still just weighted least squares with fixed or estimated scale.
>>>> (In statsmodels we have most of the same statistics returned after WLS
>>>> as after OLS. However, allowing for a fixed scale is still not built
>>>> in.)
>>>>
>>>> You still return the cov of the parameter estimates, so users can
>>>> still calculate std_err and pvalue themselves in `linfit`.
>>>>
>>>> In my interpretation of the discussions around curve_fit, it seems to
>>>> me that it is now a version that both communities can use.
>>>> The only problem I see is that linfit/linregress get a bit ugly if
>>>> there are many optional returns.
>>>>
>>>> Josef
>>>>
>>>> >
>>>> >
>>>> > relsigma is similar to the new `absolute_sigma` in curve_fit
>>>> > https://github.com/scipy/scipy/pull/3098
>>>> >
>>>> >
>>>> > That's right.  linfit implements essentially the same functionality
>>>> > that is
>>>> > being implemented in curve_fit
>>>> >
>>>> >
>>>> >
>>>> > I think linregress could be rewritten to include these improvements.
>>>> >
>>>> > Otherwise I keep out of any fitting debates, because I think `odr` is
>>>> > better for handling measurement errors in the x variables, and
>>>> > statsmodels is better for everything else (mainly linear only so far)
>>>> > and `lmfit` for nonlinear LS.
>>>> > There might be a case for stripped down convenience functions or
>>>> > special case functions.
>>>> >
>>>> > Josef
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > SciPy-User mailing list
>>>> > [hidden email]
>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> [hidden email]
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

I'd very much like to see having linfit() available in scipy.
Would it be reasonable to add David's linfit() as "the more complete"
version, and refactor linregress() to use linfit() and return it's
current return tuple derived from the linfit results?    Perhaps that
what Josef is suggesting too.  FWIW, I would generally prefer getting
a dictionary of results as a return value instead of a tuple with more
than 2 items.

--Matt Newville
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Matt Newville
Hi David,

On Wed, Dec 4, 2013 at 1:13 PM, David J Pine <[hidden email]> wrote:

> I guess my preference would be to write have linfit() be as similar to
> curve_fit() in outputs (and inputs in so far as it makes sense), and then if
> we decide we prefer another way of doing either the inputs or the outputs,
> then to do them in concert.  I think there is real value in making the user
> interfaces of linfit() and curve_fit() consistent--it make the user's
> experience so much less confusing. As of right now, I am agnostic about
> whether or not the function returns a dictionary of results--although I am
> unsure of what you have in mind.  How would you structure a dictionary of
> results?
>

Using    return (pbest, covar)  seems reasonable.  But, if you
returned a dictionary, you could include a chi-square statistic and a
residuals array.

scipy.optimize.leastsq() returns 5 items: (pbest, covar, infodict, mesg, ier)
with infodict being a dict with items 'nfev', 'fvec', 'fjac', 'ipvt',
and 'qtf'.   I think it's too late to change it, but it would have
been nicer (IMHO) if it had returned a single dict instead:

   return {'best_values': pbest, 'covar': covar, 'nfev':
infodict['nfev'], 'fvec': infodict['fvec'],
             'fjac': infodict['fjac'], 'ipvt': infodict['ipvt'],
'qtf': infodict['qtf'],  'mesg': mesg,  'ier': ier}

Similarly, linregress() returns a 5 element tuple.  The problem with
these is that you end up with long assignments
    slope, intercept, r_value, p_value, stderr =
scipy.stats.linregress(xdata, ydata)

in fact, you sort of have to do this, even for a quick and dirty
result when slope and intercept are all that would be used later on.
The central problem is these 5 returned values are now in your local
namespace, but they are not really independent values.  Instead, you
could think about
     regression = scipy.stats.linregress(xdata, ydata)

and get to any of the values from computing the regression you want.
 In short, if you
had linfit() return a dictionary of values, you could put many
statistics in it, and people who wanted to ignore some of them would
be able to do so.

FWIW, a named tuple would be fine alternative.  I don't know if
backward compatibility would prevent that in scipy.   Anyway, it's
just a suggestion....

--Matt
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

David Pine
Ok, here are my thoughts about how to do the returns.  They are informed by (1) the speed of linfit and (2) the above discussion.

(1) Speed. linfit runs fastest when the residuals are not calculated.  Calculating the residuals generally slows down linfit by a factor of 2 to 3 -- and it's the only thing that really slows it down.  After that, all additional calculations consume negligible time.  The residuals are calculated if: (a) residuals=True or chisq=True, or (b) if cov=True AND relsigma=True.  Note that (b) means that the residuals are not calculated when cov=True AND relsigma=False (and residuals=False or chisq=False).

(2) The consensus of the discussion seems to be that when a lot of things are returned by linfit, it's better to return everything as a dictionary.

So here is what I propose:

If the user only wants the optimal fitting parameters, or the user wants only the optimal fitting parameters and the covariance matrix, these can be returns as arrays.

Otherwise, everything is returned, and returned as a dictionary.

If we adopted this, then the only question is what the default setting would be, say return_all=False or return_all=True.  I guess I would opt for return_all=False, the less verbose return option.

Adopting these way of doing things would simplify the arguments of linfit, which would now look like

linfit(x, y, sigmay=None, relsigma=True, return_all=False)


I would also modify linfit to calculate the r-value, p-value, and the stderr, which would all be returned in dictionary format when return_all=True.


How does this sound?


David






On Wed, Dec 4, 2013 at 9:15 PM, Matt Newville <[hidden email]> wrote:
Hi David,

On Wed, Dec 4, 2013 at 1:13 PM, David J Pine <[hidden email]> wrote:
> I guess my preference would be to write have linfit() be as similar to
> curve_fit() in outputs (and inputs in so far as it makes sense), and then if
> we decide we prefer another way of doing either the inputs or the outputs,
> then to do them in concert.  I think there is real value in making the user
> interfaces of linfit() and curve_fit() consistent--it make the user's
> experience so much less confusing. As of right now, I am agnostic about
> whether or not the function returns a dictionary of results--although I am
> unsure of what you have in mind.  How would you structure a dictionary of
> results?
>

Using    return (pbest, covar)  seems reasonable.  But, if you
returned a dictionary, you could include a chi-square statistic and a
residuals array.

scipy.optimize.leastsq() returns 5 items: (pbest, covar, infodict, mesg, ier)
with infodict being a dict with items 'nfev', 'fvec', 'fjac', 'ipvt',
and 'qtf'.   I think it's too late to change it, but it would have
been nicer (IMHO) if it had returned a single dict instead:

   return {'best_values': pbest, 'covar': covar, 'nfev':
infodict['nfev'], 'fvec': infodict['fvec'],
             'fjac': infodict['fjac'], 'ipvt': infodict['ipvt'],
'qtf': infodict['qtf'],  'mesg': mesg,  'ier': ier}

Similarly, linregress() returns a 5 element tuple.  The problem with
these is that you end up with long assignments
    slope, intercept, r_value, p_value, stderr =
scipy.stats.linregress(xdata, ydata)

in fact, you sort of have to do this, even for a quick and dirty
result when slope and intercept are all that would be used later on.
The central problem is these 5 returned values are now in your local
namespace, but they are not really independent values.  Instead, you
could think about
     regression = scipy.stats.linregress(xdata, ydata)

and get to any of the values from computing the regression you want.
 In short, if you
had linfit() return a dictionary of values, you could put many
statistics in it, and people who wanted to ignore some of them would
be able to do so.

FWIW, a named tuple would be fine alternative.  I don't know if
backward compatibility would prevent that in scipy.   Anyway, it's
just a suggestion....

--Matt
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Daniele Nicolodi
On 04/12/2013 23:00, David J Pine wrote:
> Otherwise, everything is returned, and returned as a dictionary.

I'll repeat myself: a named tuple is the way to go, not a dictionary.

Cheers,
Daniele

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

josef.pktd
On Wed, Dec 4, 2013 at 5:29 PM, Daniele Nicolodi <[hidden email]> wrote:
> On 04/12/2013 23:00, David J Pine wrote:
>> Otherwise, everything is returned, and returned as a dictionary.
>
> I'll repeat myself: a named tuple is the way to go, not a dictionary.

namedtuples have the disadvantage that users use tuple unpacking and
then it breaks again backwards compatibility if any return is changed
in future.

Josef


>
> Cheers,
> Daniele
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: adding linear fitting routine

Daniele Nicolodi
On 04/12/2013 23:42, [hidden email] wrote:
> On Wed, Dec 4, 2013 at 5:29 PM, Daniele Nicolodi <[hidden email]> wrote:
>> On 04/12/2013 23:00, David J Pine wrote:
>>> Otherwise, everything is returned, and returned as a dictionary.
>>
>> I'll repeat myself: a named tuple is the way to go, not a dictionary.
>
> namedtuples have the disadvantage that users use tuple unpacking and
> then it breaks again backwards compatibility if any return is changed
> in future.

I frankly don't see how someone may want to extend the interface of a
linear fitting routine to return more information in the future.  I
think all the information required to design the interface right is
already available.

Cheers,
Daniele

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
12