I would like to get some feedback and generate some discussion about a least squares fitting routine I submitted last Friday [please see adding linear fitting routine (29 Nov 2013)]. I know that everybody is very busy, but it would be helpful to get some feedback and, I hope, eventually to get this routine added to one of the basic numpy/scipy libraries.
David Pine _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On 03/12/2013 11:19, David J Pine wrote:
> I would like to get some feedback and generate some discussion about a > least squares fitting routine I submitted last Friday [please > see adding linear fitting routine > <http://comments.gmane.org/gmane.comp.python.scientific.devel/18442> (29 > Nov 2013)]. I know that everybody is very busy, but it would be helpful > to get some feedback and, I hope, eventually to get this routine added > to one of the basic numpy/scipy libraries. I think that adding least squares fitting routine which handles correctly uncertainties and computes the covariance matrix is a good idea. I wanted to do that myself since quite a while. However, I think that a generalization to arbitrary degree polynomials would be much more useful. A linfit function may be added as a convenience wrapper. Actually it would be nice to have something that works on arbitrary orthogonal bases, but it may be difficult to design a general interface for such a thing. Regarding your pull request, I don't really think that your code can be much faster than the general purpose lest square fitting already in scipy or numpy, modulo some bug somewhere. You justify that saying that your solution is faster because it does not invert a matrix, but this is exactly what you are doing, except that you do not write the math in a matrix formalism. Furthermore, I didn't have a very close look but I don't understand what the `relsigma` parameter is supposed to do, and I would rename the `sigmay` parameter `yerr`. Cheers, Daniele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Daniele,
Thank you for your feedback. Regarding the points you raise: 1. Generalization to arbitrary degree polynomials. This already exists in numpy.polyfit. One limitation of polyfit is that it does not currently allow the user to provide absolute uncertainties in the data, but there has been some discussion of adding this capability.
2. Generalization to arbitrary orthogonal bases. There currently exists in numpy fitting routines to various polynomials including chebfit, legfit, lagfit, hermfit, hermefit. I am not aware of a fitting routine in numpy/scipy that works on arbitrary bases.
3. Speed. As far as I know there is no bug in any of the tested software. The unit test test_linfit.py (https://github.com/djpine/linfit) times the various fitting routines. You can run it yourself (and check the code--maybe you can spot some errors) but on my laptop I get the results printed at the end of this message.
linfit does not call a matrix inversion routine. Instead it calculates the best fit slope and y-intercept directly. By contrast, polyfit does call a matrix inversion routine (numpy.linalg.lstsq), which has a certain amount of overhead that linfit avoids. This may be why polyfit is slower than linfit.
4. relsigma. Other than using no weighting at all, there are basically two ways that people weight data in a least squares fit. (a) provide explicit absolute estimates of the errors (uncertainties) for each data point. This is what physical scientists often do. Setting relsigma=False tells linfit to use this method of weighting the data. If the error estimates are accurate, then the covariance matrix provides estimates of the uncertainties in the fitting parameters (the slope & y-intercept).
(b) provide relative estimates of the errors (uncertainties) for each data point (it's assumed that the absolute errors are not known but relative uncertainties between difference data points is known). This is what social scientists often do. When only the relative uncertainties are known, the covariance matrix needs to be rescaled in order to obtain accurate estimates of the uncertainties in the fitting parameters. Setting relsigma=True tells linfit to use this method of weighting the data.
5. Renaming `sigmay` parameter `yerr`. Either choice is fine with me but I used `sigmay` to be (mostly) consistent with scipy.optimize.curve_fit.
--------------------------------- Results of timing tests from test_linfit.py test_linfit.py .... Compare linfit to scipy.linalg.lstsq with relative individually weighted data points
10 data points: linfit is faster than scipy.linalg.lstsq by 1.26 times 100 data points: linfit is faster than scipy.linalg.lstsq by 2.33 times 1000 data points: linfit is faster than scipy.linalg.lstsq by 12 times
10000 data points: linfit is faster than scipy.linalg.lstsq by 31.8 times . Compare linfit to scipy.linalg.lstsq with unweighted data points 10 data points: linfit is faster than scipy.linalg.lstsq by 2.4 times
100 data points: linfit is faster than scipy.linalg.lstsq by 2.5 times 1000 data points: linfit is faster than scipy.linalg.lstsq by 2.9 times 10000 data points: linfit is faster than scipy.linalg.lstsq by 3.5 times
100000 data points: linfit is faster than scipy.linalg.lstsq by 4.4 times 1000000 data points: linfit is faster than scipy.linalg.lstsq by 4.6 times . Compare linfit to scipy.stats.linregress with unweighted data points
10 data points: linfit is faster than scipy.stats.linregress by 5.2 times 100 data points: linfit is faster than scipy.stats.linregress by 5.1 times 1000 data points: linfit is faster than scipy.stats.linregress by 4.7 times
10000 data points: linfit is faster than scipy.stats.linregress by 2.9 times 100000 data points: linfit is faster than scipy.stats.linregress by 1.8 times 1000000 data points: linfit is faster than scipy.stats.linregress by 1.1 times
. Compare linfit to polyfit with relative individually weighted data points 10 data points: linfit is faster than numpy.polyfit by 2.6 times 100 data points: linfit is faster than numpy.polyfit by 2.5 times
1000 data points: linfit is faster than numpy.polyfit by 4.4 times 10000 data points: linfit is faster than numpy.polyfit by 3.1 times 100000 data points: linfit is faster than numpy.polyfit by 3.5 times
1000000 data points: linfit is faster than numpy.polyfit by 1.9 times . Compare linfit to polyfit with unweighted data points 10 data points: linfit is faster than numpy.polyfit by 3 times
100 data points: linfit is faster than numpy.polyfit by 3.5 times 1000 data points: linfit is faster than numpy.polyfit by 4.3 times 10000 data points: linfit is faster than numpy.polyfit by 6 times
100000 data points: linfit is faster than numpy.polyfit by 9.5 times 1000000 data points: linfit is faster than numpy.polyfit by 7.1 times ..... ----------------------------------------------------------------------
On Wed, Dec 4, 2013 at 12:13 PM, Daniele Nicolodi <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On 04/12/2013 13:43, David J Pine wrote:
> linfit does not call a matrix inversion routine. Instead it calculates > the best fit slope and y-intercept directly. By contrast, polyfit does > call a matrix inversion routine (numpy.linalg.lstsq), which has a > certain amount of overhead that linfit avoids. This may be why polyfit > is slower than linfit. A least squares fit is a matrix inversion. What you do is a matrix inversion, except that the notation you use does not make this clear. What you can discuss is the method you use for the inversion. I would have to have a closer look to the test... > 4. relsigma. Other than using no weighting at all, there are basically > two ways that people weight data in a least squares fit. > (a) provide explicit absolute estimates of the errors > (uncertainties) for each data point. This is what physical scientists > often do. Setting relsigma=False tells linfit to use this method of > weighting the data. If the error estimates are accurate, then the > covariance matrix provides estimates of the uncertainties in the fitting > parameters (the slope & y-intercept). > (b) provide relative estimates of the errors (uncertainties) for > each data point (it's assumed that the absolute errors are not known but > relative uncertainties between difference data points is known). This > is what social scientists often do. When only the relative > uncertainties are known, the covariance matrix needs to be rescaled in > order to obtain accurate estimates of the uncertainties in the fitting > parameters. Setting relsigma=True tells linfit to use this method of > weighting the data. This is not really clear from the docstring (plus it optional but no default value is specified in the docstring), and made even less obvious by the name of the parameter used to specify the uncertainties. I would prefer two independent and mutually exclusive parameters for the two cases, 'sigma' and 'relsigma' are one option if you want to be compatible with the (ugly, IMHO) parameter name used by curve_fit. Cheers, Daniele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by David Pine
On 04/12/2013 13:43, David J Pine wrote:
> 1. Generalization to arbitrary degree polynomials. This already exists > in numpy.polyfit. One limitation of polyfit is that it does not > currently allow the user to provide absolute uncertainties in the data, > but there has been some discussion of adding this capability. This is a huge limitation IMHO. Furthermore, polyfit() allows only to fit polynomials up to an arbitrary order, not polynomials of arbitrary order (it is not possible to fit y = d * x**3 but only y = a + b * x + c * x**2 + d * x**3). Cheers, Daniele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by David Pine
On 3 December 2013 11:19, David J Pine <[hidden email]> wrote: I would like to get some feedback and generate some discussion about a least squares fitting routine I submitted last Friday On the wishlist level, I would like to see a complete model fitting, considering errors in both axis and correlation, and the option for a robust fitting system. See details, for example, here: http://arxiv.org/abs/1008.4686 I haven't really needed it myself, so I haven't taken the time to implement it yet. /David. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Daniele Nicolodi
Daniele
On Dec 4, 2013, at 1:55 PM, Daniele Nicolodi <[hidden email]> wrote: > On 04/12/2013 13:43, David J Pine wrote: >> linfit does not call a matrix inversion routine. Instead it calculates >> the best fit slope and y-intercept directly. By contrast, polyfit does >> call a matrix inversion routine (numpy.linalg.lstsq), which has a >> certain amount of overhead that linfit avoids. This may be why polyfit >> is slower than linfit. > > A least squares fit is a matrix inversion. What you do is a matrix > inversion, except that the notation you use does not make this clear. > What you can discuss is the method you use for the inversion. I would > have to have a closer look to the test... I assure you that I understand the mathematics. Specifically, I understand that you can view the mathematics used in linfit as implementing matrix inversion. That is not the point. The point is that polyfit calls a matrix inversion routine, which invokes computational machinery that is slow compared to just doing the algebra directly without calling a matrix inversion routine. I hope this is clear. > >> 4. relsigma. Other than using no weighting at all, there are basically >> two ways that people weight data in a least squares fit. >> (a) provide explicit absolute estimates of the errors >> (uncertainties) for each data point. This is what physical scientists >> often do. Setting relsigma=False tells linfit to use this method of >> weighting the data. If the error estimates are accurate, then the >> covariance matrix provides estimates of the uncertainties in the fitting >> parameters (the slope & y-intercept). >> (b) provide relative estimates of the errors (uncertainties) for >> each data point (it's assumed that the absolute errors are not known but >> relative uncertainties between difference data points is known). This >> is what social scientists often do. When only the relative >> uncertainties are known, the covariance matrix needs to be rescaled in >> order to obtain accurate estimates of the uncertainties in the fitting >> parameters. Setting relsigma=True tells linfit to use this method of >> weighting the data. > > This is not really clear from the docstring (plus it optional but no > default value is specified in the docstring), and made even less obvious > by the name of the parameter used to specify the uncertainties. It's specified in the function definition: def linfit(x, y, sigmay=None, relsigma=True, cov=False, chisq=False, residuals=False) which is the way it's always done in the online numpy and scipy documentation. However, I can additionally specify it in the docstring under the parameter definition. > > I would prefer two independent and mutually exclusive parameters for the > two cases, 'sigma' and 'relsigma' are one option if you want to be > compatible with the (ugly, IMHO) parameter name used by curve_fit. Here I disagree. sigmay is an array of error values. resigma is a boolean that simply tells linfit whether to treat the sigmay values as relative (relsigma=True, the default) or as absolute (relsigma=False). David _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Wed, Dec 4, 2013 at 8:58 AM, David Pine <[hidden email]> wrote:
> Daniele > > > On Dec 4, 2013, at 1:55 PM, Daniele Nicolodi <[hidden email]> wrote: > >> On 04/12/2013 13:43, David J Pine wrote: >>> linfit does not call a matrix inversion routine. Instead it calculates >>> the best fit slope and y-intercept directly. By contrast, polyfit does >>> call a matrix inversion routine (numpy.linalg.lstsq), which has a >>> certain amount of overhead that linfit avoids. This may be why polyfit >>> is slower than linfit. >> >> A least squares fit is a matrix inversion. What you do is a matrix >> inversion, except that the notation you use does not make this clear. >> What you can discuss is the method you use for the inversion. I would >> have to have a closer look to the test... > > I assure you that I understand the mathematics. Specifically, I understand > that you can view the mathematics used in linfit as implementing matrix > inversion. That is not the point. The point is that polyfit calls a matrix > inversion routine, which invokes computational machinery that is slow > compared to just doing the algebra directly without calling a matrix > inversion routine. I hope this is clear. > >> >>> 4. relsigma. Other than using no weighting at all, there are basically >>> two ways that people weight data in a least squares fit. >>> (a) provide explicit absolute estimates of the errors >>> (uncertainties) for each data point. This is what physical scientists >>> often do. Setting relsigma=False tells linfit to use this method of >>> weighting the data. If the error estimates are accurate, then the >>> covariance matrix provides estimates of the uncertainties in the fitting >>> parameters (the slope & y-intercept). >>> (b) provide relative estimates of the errors (uncertainties) for >>> each data point (it's assumed that the absolute errors are not known but >>> relative uncertainties between difference data points is known). This >>> is what social scientists often do. When only the relative >>> uncertainties are known, the covariance matrix needs to be rescaled in >>> order to obtain accurate estimates of the uncertainties in the fitting >>> parameters. Setting relsigma=True tells linfit to use this method of >>> weighting the data. >> >> This is not really clear from the docstring (plus it optional but no >> default value is specified in the docstring), and made even less obvious >> by the name of the parameter used to specify the uncertainties. > > It's specified in the function definition: > > def linfit(x, y, sigmay=None, relsigma=True, cov=False, chisq=False, residuals=False) > > which is the way it's always done in the online numpy and scipy documentation. > However, I can additionally specify it in the docstring under the parameter > definition. > >> >> I would prefer two independent and mutually exclusive parameters for the >> two cases, 'sigma' and 'relsigma' are one option if you want to be >> compatible with the (ugly, IMHO) parameter name used by curve_fit. > > Here I disagree. sigmay is an array of error values. resigma is a boolean that simply > tells linfit whether to treat the sigmay values as relative (relsigma=True, the default) > or as absolute (relsigma=False). linfit looks like an enhanced version of linregress, which also has only one regressor, but doesn't have weights http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html relsigma is similar to the new `absolute_sigma` in curve_fit https://github.com/scipy/scipy/pull/3098 I think linregress could be rewritten to include these improvements. Otherwise I keep out of any fitting debates, because I think `odr` is better for handling measurement errors in the x variables, and statsmodels is better for everything else (mainly linear only so far) and `lmfit` for nonlinear LS. There might be a case for stripped down convenience functions or special case functions. Josef > > David > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Dec 4, 2013, at 3:24 PM, [hidden email] wrote:
The problem with this is that the statistical tests that linregress uses--the r-value, p-value, & stderr-- are not really compatible with the weighted chi-squared fitting that linfit does. The r-value, p-value, & stderr are statistical tests that are used mostly in the social sciences (see http://en.wikipedia.org/wiki/Coefficient_of_determination). Looking at linregress, it's clear that it was written with that community in mind. By contrast, linfit (and curve_fit) use the chi-squared measure of goodness of fit, which is explicitly made to be used with weighted data. In my opinion, trying to satisfy the needs of both communities with one function will result in inefficient code and confusion in both user communities. linfit naturally goes with the curve_fit and polyfit functions, and is implemented consistent with those fitting routines. linregress is really a different animal, with statistical tests normally used with unweighted data, and I suspect that the community that uses it will be put off by the "improvements" made by linfit.
That's right. linfit implements essentially the same functionality that is being implemented in curve_fit
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote:
> > On Dec 4, 2013, at 3:24 PM, [hidden email] wrote: > > > linfit looks like an enhanced version of linregress, which also has > only one regressor, but doesn't have weights > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html > > > The problem with this is that the statistical tests that linregress > uses--the r-value, p-value, & stderr-- are not really compatible with the > weighted chi-squared fitting that linfit does. The r-value, p-value, & > stderr are statistical tests that are used mostly in the social sciences > (see http://en.wikipedia.org/wiki/Coefficient_of_determination). Looking at > linregress, it's clear that it was written with that community in mind. > > By contrast, linfit (and curve_fit) use the chi-squared measure of goodness > of fit, which is explicitly made to be used with weighted data. In my > opinion, trying to satisfy the needs of both communities with one function > will result in inefficient code and confusion in both user communities. > linfit naturally goes with the curve_fit and polyfit functions, and is > implemented consistent with those fitting routines. linregress is really a > different animal, with statistical tests normally used with unweighted data, > and I suspect that the community that uses it will be put off by the > "improvements" made by linfit. except for setting absolute_sigma to True or relsigma to False and returning redchisq instead of rsquared, there is no real difference. It's still just weighted least squares with fixed or estimated scale. (In statsmodels we have most of the same statistics returned after WLS as after OLS. However, allowing for a fixed scale is still not built in.) You still return the cov of the parameter estimates, so users can still calculate std_err and pvalue themselves in `linfit`. In my interpretation of the discussions around curve_fit, it seems to me that it is now a version that both communities can use. The only problem I see is that linfit/linregress get a bit ugly if there are many optional returns. Josef > > > relsigma is similar to the new `absolute_sigma` in curve_fit > https://github.com/scipy/scipy/pull/3098 > > > That's right. linfit implements essentially the same functionality that is > being implemented in curve_fit > > > > I think linregress could be rewritten to include these improvements. > > Otherwise I keep out of any fitting debates, because I think `odr` is > better for handling measurement errors in the x variables, and > statsmodels is better for everything else (mainly linear only so far) > and `lmfit` for nonlinear LS. > There might be a case for stripped down convenience functions or > special case functions. > > Josef > > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Josef, Ok, so what would you propose? That we essentially replace linregress with linfit, and then let people calculate std_err and pvalue themselves from the covariance matrix that `linfit` returns? or something else? By the way, that's what I chose to do for the estimates of the uncertainties in the fitting parameters--to let the user calculate the uncertainties in the fitting parameters from square roots the diagonal elements of the covariance matrix. In my opinion, that results in a cleaner less cluttered function.
David David On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Josef, Actually, I just rechecked curve_fit and it returns only the optimal fitting parameters and the covariance matrix. I could pare down linfit so that it returns only those quantities and leave it to the user to calculate chi0squared and the residuals. I suppose that's the cleanest way to go.
David On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Wed, Dec 4, 2013 at 12:53 PM, David J Pine <[hidden email]> wrote:
> Josef, > > Actually, I just rechecked curve_fit and it returns only the optimal fitting > parameters and the covariance matrix. I could pare down linfit so that it > returns only those quantities and leave it to the user to calculate > chi0squared and the residuals. I suppose that's the cleanest way to go. > > David > > > On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote: >> >> Josef, >> >> Ok, so what would you propose? That we essentially replace linregress >> with linfit, and then let people calculate std_err and pvalue themselves >> from the covariance matrix that `linfit` returns? or something else? By >> the way, that's what I chose to do for the estimates of the uncertainties in >> the fitting parameters--to let the user calculate the uncertainties in the >> fitting parameters from square roots the diagonal elements of the covariance >> matrix. In my opinion, that results in a cleaner less cluttered function. Please reply inline so we have the sub-threads together. two thoughts: - I'm getting more and more averse to functions that return "numbers" scipy.optimize minimize returns a dictionary In statsmodels we return a special class instance, that can calculate lazily all the extra things a user might want. And were we don't do that yet like in hypothesis test, we want to change it soon. The two main problems with returning numbers are that it cannot be changed in a backwards compatible way, and, second, if we want to offer a user to calculate additional optional results, then we need return_this, return_that, return_something_else, .... - The main usage of stats.linregress that I have seen in random looks at various packages, is just to get quick fit of a line without (m)any extras. In this case just returning the parameters and maybe some other minimal cheap extras is fine. I don't know if we want linfit to provide a one-stop shopping center, or just to provide some minimal results and leave the rest to the user. (In statsmodels I also often don't know what I should do. I follow the scipy tradition and return some numbers, only to change my mind later when I see what additional results could be easily calculated within the function, but I don't get access to the required calculations.) Josef >> >> David >> >> David >> >> >> On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote: >>> >>> On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote: >>> > >>> > On Dec 4, 2013, at 3:24 PM, [hidden email] wrote: >>> > >>> > >>> > linfit looks like an enhanced version of linregress, which also has >>> > only one regressor, but doesn't have weights >>> > >>> > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html >>> > >>> > >>> > The problem with this is that the statistical tests that linregress >>> > uses--the r-value, p-value, & stderr-- are not really compatible with >>> > the >>> > weighted chi-squared fitting that linfit does. The r-value, p-value, >>> > & >>> > stderr are statistical tests that are used mostly in the social >>> > sciences >>> > (see http://en.wikipedia.org/wiki/Coefficient_of_determination). >>> > Looking at >>> > linregress, it's clear that it was written with that community in mind. >>> > >>> > By contrast, linfit (and curve_fit) use the chi-squared measure of >>> > goodness >>> > of fit, which is explicitly made to be used with weighted data. In my >>> > opinion, trying to satisfy the needs of both communities with one >>> > function >>> > will result in inefficient code and confusion in both user communities. >>> > linfit naturally goes with the curve_fit and polyfit functions, and is >>> > implemented consistent with those fitting routines. linregress is >>> > really a >>> > different animal, with statistical tests normally used with unweighted >>> > data, >>> > and I suspect that the community that uses it will be put off by the >>> > "improvements" made by linfit. >>> >>> except for setting absolute_sigma to True or relsigma to False and >>> returning redchisq instead of rsquared, there is no real difference. >>> It's still just weighted least squares with fixed or estimated scale. >>> (In statsmodels we have most of the same statistics returned after WLS >>> as after OLS. However, allowing for a fixed scale is still not built >>> in.) >>> >>> You still return the cov of the parameter estimates, so users can >>> still calculate std_err and pvalue themselves in `linfit`. >>> >>> In my interpretation of the discussions around curve_fit, it seems to >>> me that it is now a version that both communities can use. >>> The only problem I see is that linfit/linregress get a bit ugly if >>> there are many optional returns. >>> >>> Josef >>> >>> > >>> > >>> > relsigma is similar to the new `absolute_sigma` in curve_fit >>> > https://github.com/scipy/scipy/pull/3098 >>> > >>> > >>> > That's right. linfit implements essentially the same functionality >>> > that is >>> > being implemented in curve_fit >>> > >>> > >>> > >>> > I think linregress could be rewritten to include these improvements. >>> > >>> > Otherwise I keep out of any fitting debates, because I think `odr` is >>> > better for handling measurement errors in the x variables, and >>> > statsmodels is better for everything else (mainly linear only so far) >>> > and `lmfit` for nonlinear LS. >>> > There might be a case for stripped down convenience functions or >>> > special case functions. >>> > >>> > Josef >>> > >>> > >>> > >>> > _______________________________________________ >>> > SciPy-User mailing list >>> > [hidden email] >>> > http://mail.scipy.org/mailman/listinfo/scipy-user >>> > >>> _______________________________________________ >>> SciPy-User mailing list >>> [hidden email] >>> http://mail.scipy.org/mailman/listinfo/scipy-user >> >> > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Hi David, Josef,
On Wed, Dec 4, 2013 at 12:15 PM, <[hidden email]> wrote: > On Wed, Dec 4, 2013 at 12:53 PM, David J Pine <[hidden email]> wrote: >> Josef, >> >> Actually, I just rechecked curve_fit and it returns only the optimal fitting >> parameters and the covariance matrix. I could pare down linfit so that it >> returns only those quantities and leave it to the user to calculate >> chi0squared and the residuals. I suppose that's the cleanest way to go. >> >> David >> >> >> On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <[hidden email]> wrote: >>> >>> Josef, >>> >>> Ok, so what would you propose? That we essentially replace linregress >>> with linfit, and then let people calculate std_err and pvalue themselves >>> from the covariance matrix that `linfit` returns? or something else? By >>> the way, that's what I chose to do for the estimates of the uncertainties in >>> the fitting parameters--to let the user calculate the uncertainties in the >>> fitting parameters from square roots the diagonal elements of the covariance >>> matrix. In my opinion, that results in a cleaner less cluttered function. > > Please reply inline so we have the sub-threads together. > > two thoughts: > > - I'm getting more and more averse to functions that return "numbers" > scipy.optimize minimize returns a dictionary > In statsmodels we return a special class instance, that can > calculate lazily all the extra things a user might want. > And were we don't do that yet like in hypothesis test, we want to > change it soon. > The two main problems with returning numbers are that it cannot be > changed in a backwards compatible way, and, second, if we want to > offer a user to calculate additional optional results, then we need > return_this, return_that, return_something_else, .... > > - The main usage of stats.linregress that I have seen in random looks > at various packages, is just to get quick fit of a line without (m)any > extras. In this case just returning the parameters and maybe some > other minimal cheap extras is fine. > > > I don't know if we want linfit to provide a one-stop shopping center, > or just to provide some minimal results and leave the rest to the > user. > > (In statsmodels I also often don't know what I should do. I follow the > scipy tradition and return some numbers, only to change my mind later > when I see what additional results could be easily calculated within > the function, but I don't get access to the required calculations.) > > Josef > >>> >>> David >>> >>> David >>> >>> >>> On Wed, Dec 4, 2013 at 6:03 PM, <[hidden email]> wrote: >>>> >>>> On Wed, Dec 4, 2013 at 11:29 AM, David Pine <[hidden email]> wrote: >>>> > >>>> > On Dec 4, 2013, at 3:24 PM, [hidden email] wrote: >>>> > >>>> > >>>> > linfit looks like an enhanced version of linregress, which also has >>>> > only one regressor, but doesn't have weights >>>> > >>>> > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html >>>> > >>>> > >>>> > The problem with this is that the statistical tests that linregress >>>> > uses--the r-value, p-value, & stderr-- are not really compatible with >>>> > the >>>> > weighted chi-squared fitting that linfit does. The r-value, p-value, >>>> > & >>>> > stderr are statistical tests that are used mostly in the social >>>> > sciences >>>> > (see http://en.wikipedia.org/wiki/Coefficient_of_determination). >>>> > Looking at >>>> > linregress, it's clear that it was written with that community in mind. >>>> > >>>> > By contrast, linfit (and curve_fit) use the chi-squared measure of >>>> > goodness >>>> > of fit, which is explicitly made to be used with weighted data. In my >>>> > opinion, trying to satisfy the needs of both communities with one >>>> > function >>>> > will result in inefficient code and confusion in both user communities. >>>> > linfit naturally goes with the curve_fit and polyfit functions, and is >>>> > implemented consistent with those fitting routines. linregress is >>>> > really a >>>> > different animal, with statistical tests normally used with unweighted >>>> > data, >>>> > and I suspect that the community that uses it will be put off by the >>>> > "improvements" made by linfit. >>>> >>>> except for setting absolute_sigma to True or relsigma to False and >>>> returning redchisq instead of rsquared, there is no real difference. >>>> It's still just weighted least squares with fixed or estimated scale. >>>> (In statsmodels we have most of the same statistics returned after WLS >>>> as after OLS. However, allowing for a fixed scale is still not built >>>> in.) >>>> >>>> You still return the cov of the parameter estimates, so users can >>>> still calculate std_err and pvalue themselves in `linfit`. >>>> >>>> In my interpretation of the discussions around curve_fit, it seems to >>>> me that it is now a version that both communities can use. >>>> The only problem I see is that linfit/linregress get a bit ugly if >>>> there are many optional returns. >>>> >>>> Josef >>>> >>>> > >>>> > >>>> > relsigma is similar to the new `absolute_sigma` in curve_fit >>>> > https://github.com/scipy/scipy/pull/3098 >>>> > >>>> > >>>> > That's right. linfit implements essentially the same functionality >>>> > that is >>>> > being implemented in curve_fit >>>> > >>>> > >>>> > >>>> > I think linregress could be rewritten to include these improvements. >>>> > >>>> > Otherwise I keep out of any fitting debates, because I think `odr` is >>>> > better for handling measurement errors in the x variables, and >>>> > statsmodels is better for everything else (mainly linear only so far) >>>> > and `lmfit` for nonlinear LS. >>>> > There might be a case for stripped down convenience functions or >>>> > special case functions. >>>> > >>>> > Josef >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > SciPy-User mailing list >>>> > [hidden email] >>>> > http://mail.scipy.org/mailman/listinfo/scipy-user >>>> > >>>> _______________________________________________ >>>> SciPy-User mailing list >>>> [hidden email] >>>> http://mail.scipy.org/mailman/listinfo/scipy-user >>> >>> >> >> >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user >> > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > I'd very much like to see having linfit() available in scipy. Would it be reasonable to add David's linfit() as "the more complete" version, and refactor linregress() to use linfit() and return it's current return tuple derived from the linfit results? Perhaps that what Josef is suggesting too. FWIW, I would generally prefer getting a dictionary of results as a return value instead of a tuple with more than 2 items. --Matt Newville _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
I guess my preference would be to write have linfit() be as similar to curve_fit() in outputs (and inputs in so far as it makes sense), and then if we decide we prefer another way of doing either the inputs or the outputs, then to do them in concert. I think there is real value in making the user interfaces of linfit() and curve_fit() consistent--it make the user's experience so much less confusing. As of right now, I am agnostic about whether or not the function returns a dictionary of results--although I am unsure of what you have in mind. How would you structure a dictionary of results? David On Wed, Dec 4, 2013 at 7:24 PM, Matt Newville <[hidden email]> wrote: Hi David, Josef, _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Hi David,
On Wed, Dec 4, 2013 at 1:13 PM, David J Pine <[hidden email]> wrote: > I guess my preference would be to write have linfit() be as similar to > curve_fit() in outputs (and inputs in so far as it makes sense), and then if > we decide we prefer another way of doing either the inputs or the outputs, > then to do them in concert. I think there is real value in making the user > interfaces of linfit() and curve_fit() consistent--it make the user's > experience so much less confusing. As of right now, I am agnostic about > whether or not the function returns a dictionary of results--although I am > unsure of what you have in mind. How would you structure a dictionary of > results? > Using return (pbest, covar) seems reasonable. But, if you returned a dictionary, you could include a chi-square statistic and a residuals array. scipy.optimize.leastsq() returns 5 items: (pbest, covar, infodict, mesg, ier) with infodict being a dict with items 'nfev', 'fvec', 'fjac', 'ipvt', and 'qtf'. I think it's too late to change it, but it would have been nicer (IMHO) if it had returned a single dict instead: return {'best_values': pbest, 'covar': covar, 'nfev': infodict['nfev'], 'fvec': infodict['fvec'], 'fjac': infodict['fjac'], 'ipvt': infodict['ipvt'], 'qtf': infodict['qtf'], 'mesg': mesg, 'ier': ier} Similarly, linregress() returns a 5 element tuple. The problem with these is that you end up with long assignments slope, intercept, r_value, p_value, stderr = scipy.stats.linregress(xdata, ydata) in fact, you sort of have to do this, even for a quick and dirty result when slope and intercept are all that would be used later on. The central problem is these 5 returned values are now in your local namespace, but they are not really independent values. Instead, you could think about regression = scipy.stats.linregress(xdata, ydata) and get to any of the values from computing the regression you want. In short, if you had linfit() return a dictionary of values, you could put many statistics in it, and people who wanted to ignore some of them would be able to do so. FWIW, a named tuple would be fine alternative. I don't know if backward compatibility would prevent that in scipy. Anyway, it's just a suggestion.... --Matt _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Ok, here are my thoughts about how to do the returns. They are informed by (1) the speed of linfit and (2) the above discussion. (1) Speed. linfit runs fastest when the residuals are not calculated. Calculating the residuals generally slows down linfit by a factor of 2 to 3 -- and it's the only thing that really slows it down. After that, all additional calculations consume negligible time. The residuals are calculated if: (a) residuals=True or chisq=True, or (b) if cov=True AND relsigma=True. Note that (b) means that the residuals are not calculated when cov=True AND relsigma=False (and residuals=False or chisq=False).
(2) The consensus of the discussion seems to be that when a lot of things are returned by linfit, it's better to return everything as a dictionary. So here is what I propose:
If the user only wants the optimal fitting parameters, or the user wants only the optimal fitting parameters and the covariance matrix, these can be returns as arrays. Otherwise, everything is returned, and returned as a dictionary.
If we adopted this, then the only question is what the default setting would be, say return_all=False or return_all=True. I guess I would opt for return_all=False, the less verbose return option.
Adopting these way of doing things would simplify the arguments of linfit, which would now look like linfit(x, y, sigmay=None, relsigma=True, return_all=False) I would also modify linfit to calculate the r-value, p-value, and the stderr, which would all be returned in dictionary format when return_all=True. How does this sound? David
On Wed, Dec 4, 2013 at 9:15 PM, Matt Newville <[hidden email]> wrote: Hi David, _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On 04/12/2013 23:00, David J Pine wrote:
> Otherwise, everything is returned, and returned as a dictionary. I'll repeat myself: a named tuple is the way to go, not a dictionary. Cheers, Daniele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Wed, Dec 4, 2013 at 5:29 PM, Daniele Nicolodi <[hidden email]> wrote:
> On 04/12/2013 23:00, David J Pine wrote: >> Otherwise, everything is returned, and returned as a dictionary. > > I'll repeat myself: a named tuple is the way to go, not a dictionary. namedtuples have the disadvantage that users use tuple unpacking and then it breaks again backwards compatibility if any return is changed in future. Josef > > Cheers, > Daniele > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On 04/12/2013 23:42, [hidden email] wrote:
> On Wed, Dec 4, 2013 at 5:29 PM, Daniele Nicolodi <[hidden email]> wrote: >> On 04/12/2013 23:00, David J Pine wrote: >>> Otherwise, everything is returned, and returned as a dictionary. >> >> I'll repeat myself: a named tuple is the way to go, not a dictionary. > > namedtuples have the disadvantage that users use tuple unpacking and > then it breaks again backwards compatibility if any return is changed > in future. I frankly don't see how someone may want to extend the interface of a linear fitting routine to return more information in the future. I think all the information required to design the interface right is already available. Cheers, Daniele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |