# [SciPy-User] calculating numerous linear regressions quickly Classic List Threaded 9 messages Open this post in threaded view
|

## [SciPy-User] calculating numerous linear regressions quickly

 Given some geospatial grid with a time dimension V[t, lat, lon], I want to compute the trend at each spatial point in the domain. Essentially I am trying to compute many linear regressions in the form: y = mx+b where y is the predicted value of V, x is the time coordinate array. The coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, or t) will be the same for each regression. I want to gather the regression coefficients (m,b), correlation, and p-value for the temporal trend at each spatial point. This can be directly accomplished by repeatedly calling stats.linregress inside of a loop for every [lat, lon] point in the domain, but it is not efficient. The challenge is that I need to compute a lot of them quickly and a python loop is proving very slow. I feel like there should be some version of stats.linregress that accepts and returns multidimensional without being forced into using a python loop. Suggestions? Thanks, Bryan _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user bwoods.vcf (355 bytes) Download Attachment
Open this post in threaded view
|

## Re: calculating numerous linear regressions quickly

 Hi Bryan, The discussion/links here might be useful: http://stackoverflow.com/questions/20343500/efficient-1d-linear-regression-for-each-element-of-3d-numpy-array-David On Jan 13, 2014, at 11:49 AM, Bryan Woods <[hidden email]> wrote: > Given some geospatial grid with a time dimension V[t, lat, lon], I want to compute the trend at each spatial point in the domain. Essentially I am trying to compute many linear regressions in the form: > y = mx+b > where y is the predicted value of V, x is the time coordinate array. The coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, or t) will be the same for each regression. I want to gather the regression coefficients (m,b), correlation, and p-value for the temporal trend at each spatial point. This can be directly accomplished by repeatedly calling stats.linregress inside of a loop for every [lat, lon] point in the domain, but it is not efficient. > > The challenge is that I need to compute a lot of them quickly and a python loop is proving very slow. I feel like there should be some version of stats.linregress that accepts and returns multidimensional without being forced into using a python loop. Suggestions? > > Thanks, > Bryan > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: calculating numerous linear regressions quickly

 In reply to this post by Bryan Woods-3 On Mon, Jan 13, 2014 at 2:49 PM, Bryan Woods <[hidden email]> wrote: > Given some geospatial grid with a time dimension V[t, lat, lon], I want to > compute the trend at each spatial point in the domain. Essentially I am > trying to compute many linear regressions in the form: > y = mx+b > where y is the predicted value of V, x is the time coordinate array. The > coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, > or t) will be the same for each regression. I want to gather the regression > coefficients (m,b), correlation, and p-value for the temporal trend at each > spatial point. This can be directly accomplished by repeatedly calling > stats.linregress inside of a loop for every [lat, lon] point in the domain, > but it is not efficient. > > The challenge is that I need to compute a lot of them quickly and a python > loop is proving very slow. I feel like there should be some version of > stats.linregress that accepts and returns multidimensional without being > forced into using a python loop. Suggestions? That can be done completely without loops. reshape the grid to 2d (t, nlat*nlong)  -> Y trend = np.vander(t, 2) (m,b) = np.linalg.pinv(trend).dot(Y) and then a few more array operations to get the other statistics. I can try to do it later if needed. Josef > > Thanks, > Bryan > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user> _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: calculating numerous linear regressions quickly

 In reply to this post by David Shean-2 Since the points are equally spaced and you want the coefficients of a low order polynomial you should be able to use a analytical solution to the linear least squares problem.  The Savitzky-Golay filter can be used to calculate the derivatives which are related to the polynomials coefficients. S-G isn't in Scipy but there is a cookbook page in the wiki: http://wiki.scipy.org/Cookbook/SavitzkyGolay     - Jonathan Helmus On 01/13/2014 02:02 PM, David Shean wrote: > Hi Bryan, > The discussion/links here might be useful: > http://stackoverflow.com/questions/20343500/efficient-1d-linear-regression-for-each-element-of-3d-numpy-array> -David > > On Jan 13, 2014, at 11:49 AM, Bryan Woods <[hidden email]> wrote: > >> Given some geospatial grid with a time dimension V[t, lat, lon], I want to compute the trend at each spatial point in the domain. Essentially I am trying to compute many linear regressions in the form: >> y = mx+b >> where y is the predicted value of V, x is the time coordinate array. The coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, or t) will be the same for each regression. I want to gather the regression coefficients (m,b), correlation, and p-value for the temporal trend at each spatial point. This can be directly accomplished by repeatedly calling stats.linregress inside of a loop for every [lat, lon] point in the domain, but it is not efficient. >> >> The challenge is that I need to compute a lot of them quickly and a python loop is proving very slow. I feel like there should be some version of stats.linregress that accepts and returns multidimensional without being forced into using a python loop. Suggestions? >> >> Thanks, >> Bryan >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user> _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: calculating numerous linear regressions quickly

 Le 13/01/2014 15:09, Jonathan Helmus a écrit : > Since the points are equally spaced and you want the coefficients of a > low order polynomial you should be able to use a analytical solution to > the linear least squares problem.  The Savitzky-Golay filter can be used > to calculate the derivatives which are related to the polynomials > coefficients. S-G isn't in Scipy but there is a cookbook page in the > wiki: http://wiki.scipy.org/Cookbook/SavitzkyGolayIt's in master actually: https://github.com/scipy/scipy/blob/ed6b0fbd339fa3b48e56116760e2fba8d583fcaa/scipy/signal/_savitzky_golay.pyBest. François. -- François Boulogne. http://www.sciunto.orgGPG fingerprint: 25F6 C971 4875 A6C1 EDD1  75C8 1AA7 216E 32D5 F22F _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: calculating numerous linear regressions quickly

 In reply to this post by josef.pktd On Mon, Jan 13, 2014 at 3:06 PM,  <[hidden email]> wrote: > On Mon, Jan 13, 2014 at 2:49 PM, Bryan Woods <[hidden email]> wrote: >> Given some geospatial grid with a time dimension V[t, lat, lon], I want to >> compute the trend at each spatial point in the domain. Essentially I am >> trying to compute many linear regressions in the form: >> y = mx+b >> where y is the predicted value of V, x is the time coordinate array. The >> coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, >> or t) will be the same for each regression. I want to gather the regression >> coefficients (m,b), correlation, and p-value for the temporal trend at each >> spatial point. This can be directly accomplished by repeatedly calling >> stats.linregress inside of a loop for every [lat, lon] point in the domain, >> but it is not efficient. >> >> The challenge is that I need to compute a lot of them quickly and a python >> loop is proving very slow. I feel like there should be some version of >> stats.linregress that accepts and returns multidimensional without being >> forced into using a python loop. Suggestions? > > That can be done completely without loops. > > reshape the grid to 2d (t, nlat*nlong)  -> Y > trend = np.vander(t, 2) > (m,b) = np.linalg.pinv(trend).dot(Y) > > and then a few more array operations to get the other statistics. > > I can try to do it later if needed. some quickly written version this is for general x matrix as long as it is common to all regression That's pretty much how statsmodels calculates OLS (sm.OLS is not vectorized but has many other goodies instead). A version that uses that there is only one slope coefficient might be a bit faster, but I don't have that memorized. That would be copying linregress and vectorizing it. Josef ------------------------------- # -*- coding: utf-8 -*- """multivariate OLS, vectorized independent OLS with common exog Created on Mon Jan 13 19:33:44 2014 Author: Josef Perktold """ import numpy as np from statsmodels.regression.linear_model import OLS nobs = 20 k_y = 30 # generate trend trend = np.linspace(-1, 1, nobs) x = np.vander(trend, 2) assert x.shape == (nobs, 2) # generate random sample beta = 1 + np.random.randn(2, k_y) y_true = np.dot(x, beta) y = y_true + 0.5 * np.random.randn(*y_true.shape) ######## estimate # x is common design matrix (nobs, k_vars) # y are independent/response variables (nobs, k_y) xpinv = np.linalg.pinv(x) params = xpinv.dot(y) # get some additional results xxinv = np.dot(xpinv, xpinv.T) fitted = np.dot(x, params) resid = y - fitted y_mean = y.mean(0) tss = ((y - y_mean)**2).sum(0) rss = (resid**2).sum(0) r_squared = 1 - rss / tss df_resid = nobs - 2 mse = rss / df_resid bse = np.sqrt(np.diag(xxinv)[:, None] * mse)  # standard error of params ######## # reference statsmodels OLS from collections import defaultdict as DefaultDict results = DefaultDict(list) for yj in y.T:     res = OLS(yj, x).fit()     results['params'].append(res.params)     results['bse'].append(res.bse)     results['r_squared'].append(res.rsquared)     results['mse'].append(res.mse_resid) print '\nparameters' print np.column_stack((params.T, results['params'])) print '\nstandard error of parameter estimates' print np.column_stack((bse.T, results['bse'])) print '\nR-squared' print np.column_stack((r_squared.T, results['r_squared'])) from numpy.testing import assert_allclose assert_allclose(params.T, results['params'], rtol=1e-13) assert_allclose(bse.T, results['bse'], rtol=1e-13) assert_allclose(r_squared.T, results['r_squared'], rtol=1e-13) --------------- > > Josef > >> >> Thanks, >> Bryan >> >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user>> _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: calculating numerous linear regressions quickly

 In reply to this post by François Boulogne-3 On Mon, 13 Jan 2014 17:10:40 -0500 François Boulogne <[hidden email]> wrote: > It's in master actually: > https://github.com/scipy/scipy/blob/ed6b0fbd339fa3b48e56116760e2fba8d583fcaa/scipy/signal/_savitzky_golay.pyHi, I was wondering if the 2D-Savitsly-Golay were of some interest for some of us: http://research.microsoft.com/en-us/um/people/jckrumm/SavGol/SavGol.htmI never used them personnaly but I often use the 1D version and I found the idea interesting for cheap de-noising continuous signal. Cheers, -- Jérôme Kieffer Data analysis unit - ESRF _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user