[SciPy-User] calculating numerous linear regressions quickly

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] calculating numerous linear regressions quickly

Bryan Woods-3
Given some geospatial grid with a time dimension V[t, lat, lon], I want
to compute the trend at each spatial point in the domain. Essentially I
am trying to compute many linear regressions in the form:
y = mx+b
where y is the predicted value of V, x is the time coordinate array. The
coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor
(x, or t) will be the same for each regression. I want to gather the
regression coefficients (m,b), correlation, and p-value for the temporal
trend at each spatial point. This can be directly accomplished by
repeatedly calling stats.linregress inside of a loop for every [lat,
lon] point in the domain, but it is not efficient.

The challenge is that I need to compute a lot of them quickly and a
python loop is proving very slow. I feel like there should be some
version of stats.linregress that accepts and returns multidimensional
without being forced into using a python loop. Suggestions?

Thanks,
Bryan

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user

bwoods.vcf (355 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

David Shean-2
Hi Bryan,
The discussion/links here might be useful:
http://stackoverflow.com/questions/20343500/efficient-1d-linear-regression-for-each-element-of-3d-numpy-array
-David

On Jan 13, 2014, at 11:49 AM, Bryan Woods <[hidden email]> wrote:

> Given some geospatial grid with a time dimension V[t, lat, lon], I want to compute the trend at each spatial point in the domain. Essentially I am trying to compute many linear regressions in the form:
> y = mx+b
> where y is the predicted value of V, x is the time coordinate array. The coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, or t) will be the same for each regression. I want to gather the regression coefficients (m,b), correlation, and p-value for the temporal trend at each spatial point. This can be directly accomplished by repeatedly calling stats.linregress inside of a loop for every [lat, lon] point in the domain, but it is not efficient.
>
> The challenge is that I need to compute a lot of them quickly and a python loop is proving very slow. I feel like there should be some version of stats.linregress that accepts and returns multidimensional without being forced into using a python loop. Suggestions?
>
> Thanks,
> Bryan
> <bwoods.vcf>_______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

josef.pktd
In reply to this post by Bryan Woods-3
On Mon, Jan 13, 2014 at 2:49 PM, Bryan Woods <[hidden email]> wrote:

> Given some geospatial grid with a time dimension V[t, lat, lon], I want to
> compute the trend at each spatial point in the domain. Essentially I am
> trying to compute many linear regressions in the form:
> y = mx+b
> where y is the predicted value of V, x is the time coordinate array. The
> coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x,
> or t) will be the same for each regression. I want to gather the regression
> coefficients (m,b), correlation, and p-value for the temporal trend at each
> spatial point. This can be directly accomplished by repeatedly calling
> stats.linregress inside of a loop for every [lat, lon] point in the domain,
> but it is not efficient.
>
> The challenge is that I need to compute a lot of them quickly and a python
> loop is proving very slow. I feel like there should be some version of
> stats.linregress that accepts and returns multidimensional without being
> forced into using a python loop. Suggestions?

That can be done completely without loops.

reshape the grid to 2d (t, nlat*nlong)  -> Y
trend = np.vander(t, 2)
(m,b) = np.linalg.pinv(trend).dot(Y)

and then a few more array operations to get the other statistics.

I can try to do it later if needed.

Josef

>
> Thanks,
> Bryan
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

Jonathan Helmus
In reply to this post by David Shean-2
Since the points are equally spaced and you want the coefficients of a
low order polynomial you should be able to use a analytical solution to
the linear least squares problem.  The Savitzky-Golay filter can be used
to calculate the derivatives which are related to the polynomials
coefficients. S-G isn't in Scipy but there is a cookbook page in the
wiki: http://wiki.scipy.org/Cookbook/SavitzkyGolay

     - Jonathan Helmus

On 01/13/2014 02:02 PM, David Shean wrote:

> Hi Bryan,
> The discussion/links here might be useful:
> http://stackoverflow.com/questions/20343500/efficient-1d-linear-regression-for-each-element-of-3d-numpy-array
> -David
>
> On Jan 13, 2014, at 11:49 AM, Bryan Woods <[hidden email]> wrote:
>
>> Given some geospatial grid with a time dimension V[t, lat, lon], I want to compute the trend at each spatial point in the domain. Essentially I am trying to compute many linear regressions in the form:
>> y = mx+b
>> where y is the predicted value of V, x is the time coordinate array. The coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x, or t) will be the same for each regression. I want to gather the regression coefficients (m,b), correlation, and p-value for the temporal trend at each spatial point. This can be directly accomplished by repeatedly calling stats.linregress inside of a loop for every [lat, lon] point in the domain, but it is not efficient.
>>
>> The challenge is that I need to compute a lot of them quickly and a python loop is proving very slow. I feel like there should be some version of stats.linregress that accepts and returns multidimensional without being forced into using a python loop. Suggestions?
>>
>> Thanks,
>> Bryan
>> <bwoods.vcf>_______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

François Boulogne-3
Le 13/01/2014 15:09, Jonathan Helmus a écrit :
> Since the points are equally spaced and you want the coefficients of a
> low order polynomial you should be able to use a analytical solution to
> the linear least squares problem.  The Savitzky-Golay filter can be used
> to calculate the derivatives which are related to the polynomials
> coefficients. S-G isn't in Scipy but there is a cookbook page in the
> wiki: http://wiki.scipy.org/Cookbook/SavitzkyGolay

It's in master actually:
https://github.com/scipy/scipy/blob/ed6b0fbd339fa3b48e56116760e2fba8d583fcaa/scipy/signal/_savitzky_golay.py

Best.
François.

--
François Boulogne.
http://www.sciunto.org
GPG fingerprint: 25F6 C971 4875 A6C1 EDD1  75C8 1AA7 216E 32D5 F22F

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

josef.pktd
In reply to this post by josef.pktd
On Mon, Jan 13, 2014 at 3:06 PM,  <[hidden email]> wrote:

> On Mon, Jan 13, 2014 at 2:49 PM, Bryan Woods <[hidden email]> wrote:
>> Given some geospatial grid with a time dimension V[t, lat, lon], I want to
>> compute the trend at each spatial point in the domain. Essentially I am
>> trying to compute many linear regressions in the form:
>> y = mx+b
>> where y is the predicted value of V, x is the time coordinate array. The
>> coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x,
>> or t) will be the same for each regression. I want to gather the regression
>> coefficients (m,b), correlation, and p-value for the temporal trend at each
>> spatial point. This can be directly accomplished by repeatedly calling
>> stats.linregress inside of a loop for every [lat, lon] point in the domain,
>> but it is not efficient.
>>
>> The challenge is that I need to compute a lot of them quickly and a python
>> loop is proving very slow. I feel like there should be some version of
>> stats.linregress that accepts and returns multidimensional without being
>> forced into using a python loop. Suggestions?
>
> That can be done completely without loops.
>
> reshape the grid to 2d (t, nlat*nlong)  -> Y
> trend = np.vander(t, 2)
> (m,b) = np.linalg.pinv(trend).dot(Y)
>
> and then a few more array operations to get the other statistics.
>
> I can try to do it later if needed.


some quickly written version
this is for general x matrix as long as it is common to all regression
That's pretty much how statsmodels calculates OLS (sm.OLS is not
vectorized but has many other goodies instead).

A version that uses that there is only one slope coefficient might be
a bit faster, but I don't have that memorized.
That would be copying linregress and vectorizing it.

Josef

-------------------------------
# -*- coding: utf-8 -*-
"""multivariate OLS, vectorized independent OLS with common exog
Created on Mon Jan 13 19:33:44 2014
Author: Josef Perktold
"""
import numpy as np
from statsmodels.regression.linear_model import OLS

nobs = 20
k_y = 30
# generate trend
trend = np.linspace(-1, 1, nobs)
x = np.vander(trend, 2)
assert x.shape == (nobs, 2)

# generate random sample
beta = 1 + np.random.randn(2, k_y)
y_true = np.dot(x, beta)
y = y_true + 0.5 * np.random.randn(*y_true.shape)

######## estimate
# x is common design matrix (nobs, k_vars)
# y are independent/response variables (nobs, k_y)
xpinv = np.linalg.pinv(x)
params = xpinv.dot(y)

# get some additional results
xxinv = np.dot(xpinv, xpinv.T)
fitted = np.dot(x, params)
resid = y - fitted
y_mean = y.mean(0)
tss = ((y - y_mean)**2).sum(0)
rss = (resid**2).sum(0)
r_squared = 1 - rss / tss
df_resid = nobs - 2
mse = rss / df_resid
bse = np.sqrt(np.diag(xxinv)[:, None] * mse)  # standard error of params
########

# reference statsmodels OLS
from collections import defaultdict as DefaultDict
results = DefaultDict(list)
for yj in y.T:
    res = OLS(yj, x).fit()
    results['params'].append(res.params)
    results['bse'].append(res.bse)
    results['r_squared'].append(res.rsquared)
    results['mse'].append(res.mse_resid)


print '\nparameters'
print np.column_stack((params.T, results['params']))
print '\nstandard error of parameter estimates'
print np.column_stack((bse.T, results['bse']))
print '\nR-squared'
print np.column_stack((r_squared.T, results['r_squared']))

from numpy.testing import assert_allclose
assert_allclose(params.T, results['params'], rtol=1e-13)
assert_allclose(bse.T, results['bse'], rtol=1e-13)
assert_allclose(r_squared.T, results['r_squared'], rtol=1e-13)

---------------




>
> Josef
>
>>
>> Thanks,
>> Bryan
>>
>> _______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

Jerome Kieffer
In reply to this post by François Boulogne-3
On Mon, 13 Jan 2014 17:10:40 -0500
François Boulogne <[hidden email]> wrote:

> It's in master actually:
> https://github.com/scipy/scipy/blob/ed6b0fbd339fa3b48e56116760e2fba8d583fcaa/scipy/signal/_savitzky_golay.py
Hi,

I was wondering if the 2D-Savitsly-Golay were of some interest for some of us:
http://research.microsoft.com/en-us/um/people/jckrumm/SavGol/SavGol.htm

I never used them personnaly but I often use the 1D version and I found the idea interesting for cheap de-noising continuous signal.

Cheers,

--
Jérôme Kieffer
Data analysis unit - ESRF
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

François Boulogne-3
Le 14/01/2014 01:26, Jerome Kieffer a écrit :

>
> Hi,
>
> I was wondering if the 2D-Savitsly-Golay were of some interest for some of us:
> http://research.microsoft.com/en-us/um/people/jckrumm/SavGol/SavGol.htm
>
> I never used them personnaly but I often use the 1D version and I found the idea interesting for cheap de-noising continuous signal.
>
> Cheers,
>
I think it would be a great addition and definitely useful!

Cheers.


--
François Boulogne.
http://www.sciunto.org
GPG fingerprint: 25F6 C971 4875 A6C1 EDD1  75C8 1AA7 216E 32D5 F22F

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: calculating numerous linear regressions quickly

Sturla Molden-3
In reply to this post by Bryan Woods-3
I had the same issue when computing "lowess" regression. I ended up using a
Fortran subroutine that called the LAPACK  subroutine DGELS. (It is
possible to vectorize for multiple cores using OpenMP or link with a
multithreaded LAPACK/BLAS. What works best is dependent on the problem.)

Sturla

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user