I would like to submit a routine to scipy for performing least squares fitting of a straight line f(x) = ax + b to an (x,y) data set. There are a number of ways of doing this currently using scipy or numpy but all have serious drawbacks. Here is what is currently available, as far as I can tell, and what seem to me to be their drawbacks.
1. numpy.polyfit : a. It is slower than it needs to be. polyfit uses matrix methods that are needed to find best fits to general polynomials (quadratic, cubic, quartic, and higher orders), but matrix methods are overkill when you just want to fit a straight line f(x)= ax + b to data set. A direct approach can yield fits significantly faster. b. polyfit currently does not allow using absolute error estimates for weighting the data; only relative error estimates are currently possible. This can be fixed, but for the moment it's a problem. c. New or inexperienced uses are unlikely to look for a routine to fit straight lines in a routine that is advertised as being for polynomials. This is a more important point than it may seem. Fitting data to a straight line is probably the most common curve fitting task performed, and the only one that many users will ever use. It makes sense to cater to such users by providing them with a routine that does what they want in as clear and straightforward a manner as possible. I am a physics professor and have seen the confusion first hand with a wide spectrum of students who are new to Python. It should not be this hard for them. 2. scipy.linalg.lstsq a. Using linalg.lstsq to fit a straight line is clunky and very slow (on the order of 10 times slower than polyfit, which is already slower than it needs to be). b. While linalg.lstsq can be used to fit data with error estimates (i.e. using weighting), how to do this is far from obvious. It's unlikely that anyone but an expert would figure out how to do it. c. linalg.lstsq requires the use of matrices, which will be unfamiliar to some users. Moreover, it should not be necessary to use matrices when the task at hand only involves one-dimensional arrays. 3. scipy.curve_fit a. This is a nonlinear fitting routine. As such, it searches for the global minimum in the objective function (chi-squared) rather than just calculating where the global minimum is using the analytical expressions for the best fits. It's the wrong method for the problem, although it will work. Questions: What do others in the scientific Python community think about the need for such a routine? Where should routine to fit data to a straight line go? It would seem to me that it should go in the scipy.optimize package, but I wonder what others think. David Pine _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Mar 24, 2013 at 7:53 PM, David Pine <[hidden email]> wrote:
> I would like to submit a routine to scipy for performing least squares fitting of a straight line f(x) = ax + b to an (x,y) data set. There are a number of ways of doing this currently using scipy or numpy but all have serious drawbacks. Here is what is currently available, as far as I can tell, and what seem to me to be their drawbacks. > > 1. numpy.polyfit : > a. It is slower than it needs to be. polyfit uses matrix methods that are needed to find best fits to general polynomials (quadratic, cubic, quartic, and higher orders), but matrix methods are overkill when you just want to fit a straight line f(x)= ax + b to data set. A direct approach can yield fits significantly faster. > b. polyfit currently does not allow using absolute error estimates for weighting the data; only relative error estimates are currently possible. This can be fixed, but for the moment it's a problem. > c. New or inexperienced uses are unlikely to look for a routine to fit straight lines in a routine that is advertised as being for polynomials. This is a more important point than it may seem. Fitting data to a straight line is probably the most common curve fitting task performed, and the only one that many users will ever use. It makes sense to cater to such users by providing them with a routine that does what they want in as clear and straightforward a manner as possible. I am a physics professor and have seen the confusion first hand with a wide spectrum of students who are new to Python. It should not be this hard for them. > > 2. scipy.linalg.lstsq > a. Using linalg.lstsq to fit a straight line is clunky and very slow (on the order of 10 times slower than polyfit, which is already slower than it needs to be). > b. While linalg.lstsq can be used to fit data with error estimates (i.e. using weighting), how to do this is far from obvious. It's unlikely that anyone but an expert would figure out how to do it. > c. linalg.lstsq requires the use of matrices, which will be unfamiliar to some users. Moreover, it should not be necessary to use matrices when the task at hand only involves one-dimensional arrays. > > 3. scipy.curve_fit > a. This is a nonlinear fitting routine. As such, it searches for the global minimum in the objective function (chi-squared) rather than just calculating where the global minimum is using the analytical expressions for the best fits. It's the wrong method for the problem, although it will work. > > Questions: What do others in the scientific Python community think about the need for such a routine? Where should routine to fit data to a straight line go? It would seem to me that it should go in the scipy.optimize package, but I wonder what others think. scipy.stats.linregress if there is only one x Josef > David Pine > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by David Pine
On 3/24/13, David Pine <[hidden email]> wrote:
> I would like to submit a routine to scipy for performing least squares > fitting of a straight line f(x) = ax + b to an (x,y) data set. There are a > number of ways of doing this currently using scipy or numpy but all have > serious drawbacks. Here is what is currently available, as far as I can > tell, and what seem to me to be their drawbacks. > > 1. numpy.polyfit : > a. It is slower than it needs to be. polyfit uses matrix methods that > are needed to find best fits to general polynomials (quadratic, cubic, > quartic, and higher orders), but matrix methods are overkill when you just > want to fit a straight line f(x)= ax + b to data set. A direct approach can > yield fits significantly faster. > b. polyfit currently does not allow using absolute error estimates for > weighting the data; only relative error estimates are currently possible. > This can be fixed, but for the moment it's a problem. > c. New or inexperienced uses are unlikely to look for a routine to fit > straight lines in a routine that is advertised as being for polynomials. > This is a more important point than it may seem. Fitting data to a straight > line is probably the most common curve fitting task performed, and the only > one that many users will ever use. It makes sense to cater to such users by > providing them with a routine that does what they want in as clear and > straightforward a manner as possible. I am a physics professor and have > seen the confusion first hand with a wide spectrum of students who are new > to Python. It should not be this hard for them. > > 2. scipy.linalg.lstsq > a. Using linalg.lstsq to fit a straight line is clunky and very slow > (on the order of 10 times slower than polyfit, which is already slower than > it needs to be). > b. While linalg.lstsq can be used to fit data with error estimates > (i.e. using weighting), how to do this is far from obvious. It's unlikely > that anyone but an expert would figure out how to do it. > c. linalg.lstsq requires the use of matrices, which will be unfamiliar > to some users. Moreover, it should not be necessary to use matrices when > the task at hand only involves one-dimensional arrays. > > 3. scipy.curve_fit > a. This is a nonlinear fitting routine. As such, it searches for the > global minimum in the objective function (chi-squared) rather than just > calculating where the global minimum is using the analytical expressions for > the best fits. It's the wrong method for the problem, although it will > work. > > Questions: What do others in the scientific Python community think about > the need for such a routine? Where should routine to fit data to a > straight line go? It would seem to me that it should go in the > scipy.optimize package, but I wonder what others think. > > David Pine David, There is also scipy.stats.linregress, which is a basic 1-D (ie. x and y are 1-D vectors) linear regression. Warren > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
scipy.stats.linregress does not do the job. First, it does not allow for weighting of the data, relative, absolute, or otherwise. It does not return the covariance matrix, which provides estimates of the uncertainties in the fitting parameters, nor does it return chi-squared, which is the standard measure of the quality of the fit in the physical sciences.
David On Mar 24, 2013, at 8:19 PM, [hidden email] wrote: > On Sun, Mar 24, 2013 at 7:53 PM, David Pine <[hidden email]> wrote: >> I would like to submit a routine to scipy for performing least squares fitting of a straight line f(x) = ax + b to an (x,y) data set. There are a number of ways of doing this currently using scipy or numpy but all have serious drawbacks. Here is what is currently available, as far as I can tell, and what seem to me to be their drawbacks. >> >> 1. numpy.polyfit : >> a. It is slower than it needs to be. polyfit uses matrix methods that are needed to find best fits to general polynomials (quadratic, cubic, quartic, and higher orders), but matrix methods are overkill when you just want to fit a straight line f(x)= ax + b to data set. A direct approach can yield fits significantly faster. >> b. polyfit currently does not allow using absolute error estimates for weighting the data; only relative error estimates are currently possible. This can be fixed, but for the moment it's a problem. >> c. New or inexperienced uses are unlikely to look for a routine to fit straight lines in a routine that is advertised as being for polynomials. This is a more important point than it may seem. Fitting data to a straight line is probably the most common curve fitting task performed, and the only one that many users will ever use. It makes sense to cater to such users by providing them with a routine that does what they want in as clear and straightforward a manner as possible. I am a physics professor and have seen the confusion first hand with a wide spectrum of students who are new to Python. It should not be this hard for them. >> >> 2. scipy.linalg.lstsq >> a. Using linalg.lstsq to fit a straight line is clunky and very slow (on the order of 10 times slower than polyfit, which is already slower than it needs to be). >> b. While linalg.lstsq can be used to fit data with error estimates (i.e. using weighting), how to do this is far from obvious. It's unlikely that anyone but an expert would figure out how to do it. >> c. linalg.lstsq requires the use of matrices, which will be unfamiliar to some users. Moreover, it should not be necessary to use matrices when the task at hand only involves one-dimensional arrays. >> >> 3. scipy.curve_fit >> a. This is a nonlinear fitting routine. As such, it searches for the global minimum in the objective function (chi-squared) rather than just calculating where the global minimum is using the analytical expressions for the best fits. It's the wrong method for the problem, although it will work. >> >> Questions: What do others in the scientific Python community think about the need for such a routine? Where should routine to fit data to a straight line go? It would seem to me that it should go in the scipy.optimize package, but I wonder what others think. > > scipy.stats.linregress if there is only one x > > Josef > > >> David Pine >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, 24 Mar 2013 21:27:09 -0400
David Pine <[hidden email]> wrote: > scipy.stats.linregress does not do the job. First, it does not allow for weighting of the data, relative, absolute, or otherwise. It does not return the covariance matrix, which provides estimates of the uncertainties in the fitting parameters, nor does it return chi-squared, which is the standard measure of the quality of the fit in the physical sciences. I implemented a linear regression according to what I found in stats textbooks (and various pages of wikipedia) ... (arrays are 2D, considered as m datasets of length n Sx = (w * x).sum(axis= -1) Sy = (w * y).sum(axis= -1) Sxx = (w * x * x).sum(axis= -1) Sxy = (w * y * x).sum(axis= -1) Syy = (w * y * y).sum(axis= -1) Sw = w.sum(axis= -1) slope = (Sw * Sxy - Sx * Sy) / (Sw * Sxx - Sx * Sx) intercept = (Sy - Sx * slope) / Sw df = n - 2 r_num = ssxym = (Sw * Sxy) - (Sx * Sy) ssxm = Sw * Sxx - Sx * Sx ssym = Sw * Syy - Sy * Sy r_den = numpy.sqrt(ssxm * ssym) correlationR = r_num / r_den correlationR[r_den == 0] = 0.0 correlationR[correlationR > 1.0] = 1.0 # Numerical errors correlationR[correlationR < -1.0] = -1.0 # Numerical errors sterrest = numpy.sqrt((1.0 - correlationR * correlationR) * ssym / ssxm / df) Hope this helps -- Jérôme Kieffer Data analysis unit - ESRF _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by David Pine
On Mon, Mar 25, 2013 at 2:27 AM, David Pine <[hidden email]> wrote: scipy.stats.linregress does not do the job. First, it does not allow for weighting of the data, relative, absolute, or otherwise. Weighting could be easily added to linregress, that would be useful. It does not return the covariance matrix, which provides estimates of the uncertainties in the fitting parameters, nor does it return chi-squared, which is the standard measure of the quality of the fit in the physical sciences. These are a little trickier to add, would require adding a full_output=False input because just adding a return value would break backwards compatibility. But it's also possible. Would you like to have a go at adding your suggestions to linregress? There's still the issue that if you enhance stats.linregress it will be hard to find. Fitting tools are scattered all over, and it's not clear what the best solution is. A doc section giving an overview of all fit functionality would be a good start. See https://github.com/scipy/scipy/pull/448 for a recent discussion on that topic. Ralf David _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |