[SciPy-User] scipy.stats.linregress

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] scipy.stats.linregress

Jerome Kieffer

Dear Scipy Community,

I am looking at the scipy.stats.linregress code and see:
    # average sum of squares:
    ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
    r_num = ssxym
    r_den = np.sqrt(ssxm*ssym)
    if r_den == 0.0:
        r = 0.0
    else:
        r = r_num / r_den
        if (r > 1.0): r = 1.0 # from numerical error

if the slope is negative, the correlation factor R is -1, not one so one should add:
       
        if (r < -1.0): r = -1.0 # from numerical error

or did I completely mis-understood the code ?

Cheers,


--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel +33 476 882 445
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

Daπid

This parameter is R**2, the square of R. You are computing it using the squares of the residuals, so everything should be positive. If you get r negative, something  has gone terribly wrong.

On Feb 12, 2013 8:59 AM, "Jerome Kieffer" <[hidden email]> wrote:

Dear Scipy Community,

I am looking at the scipy.stats.linregress code and see:
    # average sum of squares:
    ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
    r_num = ssxym
    r_den = np.sqrt(ssxm*ssym)
    if r_den == 0.0:
        r = 0.0
    else:
        r = r_num / r_den
        if (r > 1.0): r = 1.0 # from numerical error

if the slope is negative, the correlation factor R is -1, not one so one should add:

        if (r < -1.0): r = -1.0 # from numerical error

or did I completely mis-understood the code ?

Cheers,


--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel <a href="tel:%2B33%20476%20882%20445" value="+33476882445">+33 476 882 445
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

Jerome Kieffer
On Tue, 12 Feb 2013 11:55:31 +0100
Daπid <[hidden email]> wrote:

> This parameter is R**2, the square of R. You are computing it using the
> squares of the residuals, so everything should be positive. If you get r
> negative, something  has gone terribly wrong.

In [3]: scipy.stats.linregress([1,2,3],[3,2,1])
Out[3]: (-1.0, 4.0, -1.0, 9.0031759825137669e-11, 0.0)

I am not very confident in my knowledge in statistics but here R = -1 and it does not look like an error.
I implemented a method to perform thousands of linear regression and few of them returned -1.00001 (or so) which later gave NaN as stderr.

so either the test should be:
if (r*r > 1.0): r = r/abs(r)
or:
if (r > 1.0): r = 1
elif (r < -1.0): r = -1

Cheers,

--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel +33 476 882 445
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

josef.pktd
On Tue, Feb 12, 2013 at 7:19 AM, Jerome Kieffer <[hidden email]> wrote:

> On Tue, 12 Feb 2013 11:55:31 +0100
> Daπid <[hidden email]> wrote:
>
>> This parameter is R**2, the square of R. You are computing it using the
>> squares of the residuals, so everything should be positive. If you get r
>> negative, something  has gone terribly wrong.
>
> In [3]: scipy.stats.linregress([1,2,3],[3,2,1])
> Out[3]: (-1.0, 4.0, -1.0, 9.0031759825137669e-11, 0.0)
>
> I am not very confident in my knowledge in statistics but here R = -1 and it does not look like an error.
> I implemented a method to perform thousands of linear regression and few of them returned -1.00001 (or so) which later gave NaN as stderr.
>
> so either the test should be:
> if (r*r > 1.0): r = r/abs(r)
> or:
> if (r > 1.0): r = 1
> elif (r < -1.0): r = -1

The correlation coefficient that is reported is the signed
correlation, the docstring has an example that takes the square to get
R_squared.

I don't know whether it's the R_squared, I need to check.

But because r is signed we should have the check r<-1 then -1

Josef

>
> Cheers,
>
> --
> Jérôme Kieffer
> On-Line Data analysis / Software Group
> ISDD / ESRF
> tel +33 476 882 445
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

Daπid
In reply to this post by Jerome Kieffer
On 12 February 2013 13:19, Jerome Kieffer <[hidden email]> wrote:
> I am not very confident in my knowledge in statistics but here R = -1 and it does not look like an error.
> I implemented a method to perform thousands of linear regression and few of them returned -1.00001 (or so) which later gave NaN as stderr.

You are absolutely right, my mistake. I got the formula confused in my
head, I should have looked it up. Indeed, the covariance can be
negative, giving you the r<0.


David.
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

josef.pktd
In reply to this post by josef.pktd
On Tue, Feb 12, 2013 at 7:37 AM,  <[hidden email]> wrote:

> On Tue, Feb 12, 2013 at 7:19 AM, Jerome Kieffer <[hidden email]> wrote:
>> On Tue, 12 Feb 2013 11:55:31 +0100
>> Daπid <[hidden email]> wrote:
>>
>>> This parameter is R**2, the square of R. You are computing it using the
>>> squares of the residuals, so everything should be positive. If you get r
>>> negative, something  has gone terribly wrong.
>>
>> In [3]: scipy.stats.linregress([1,2,3],[3,2,1])
>> Out[3]: (-1.0, 4.0, -1.0, 9.0031759825137669e-11, 0.0)
>>
>> I am not very confident in my knowledge in statistics but here R = -1 and it does not look like an error.
>> I implemented a method to perform thousands of linear regression and few of them returned -1.00001 (or so) which later gave NaN as stderr.
>>
>> so either the test should be:
>> if (r*r > 1.0): r = r/abs(r)
>> or:
>> if (r > 1.0): r = 1
>> elif (r < -1.0): r = -1
>
> The correlation coefficient that is reported is the signed
> correlation, the docstring has an example that takes the square to get
> R_squared.
>
> I don't know whether it's the R_squared, I need to check.

>>> sm.OLS([1,2,3],sm.add_constant([3.1,2,1])).fit().rsquared
0.99924471299093653

>>> stats.linregress([3.1,2,1], [1,2,3])[2]
-0.99962228516121865
>>> stats.linregress([3.1,2,1], [1,2,3])[2]**2
0.99924471299093676

Josef
Josef

>
> But because r is signed we should have the check r<-1 then -1
>
> Josef
>
>>
>> Cheers,
>>
>> --
>> Jérôme Kieffer
>> On-Line Data analysis / Software Group
>> ISDD / ESRF
>> tel +33 476 882 445
>> _______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

josef.pktd
On Tue, Feb 12, 2013 at 7:46 AM,  <[hidden email]> wrote:

> On Tue, Feb 12, 2013 at 7:37 AM,  <[hidden email]> wrote:
>> On Tue, Feb 12, 2013 at 7:19 AM, Jerome Kieffer <[hidden email]> wrote:
>>> On Tue, 12 Feb 2013 11:55:31 +0100
>>> Daπid <[hidden email]> wrote:
>>>
>>>> This parameter is R**2, the square of R. You are computing it using the
>>>> squares of the residuals, so everything should be positive. If you get r
>>>> negative, something  has gone terribly wrong.
>>>
>>> In [3]: scipy.stats.linregress([1,2,3],[3,2,1])
>>> Out[3]: (-1.0, 4.0, -1.0, 9.0031759825137669e-11, 0.0)
>>>
>>> I am not very confident in my knowledge in statistics but here R = -1 and it does not look like an error.
>>> I implemented a method to perform thousands of linear regression and few of them returned -1.00001 (or so) which later gave NaN as stderr.
>>>
>>> so either the test should be:
>>> if (r*r > 1.0): r = r/abs(r)
>>> or:
>>> if (r > 1.0): r = 1
>>> elif (r < -1.0): r = -1
>>
>> The correlation coefficient that is reported is the signed
>> correlation, the docstring has an example that takes the square to get
>> R_squared.
>>
>> I don't know whether it's the R_squared, I need to check.
>
>>>> sm.OLS([1,2,3],sm.add_constant([3.1,2,1])).fit().rsquared
> 0.99924471299093653
>
>>>> stats.linregress([3.1,2,1], [1,2,3])[2]
> -0.99962228516121865
>>>> stats.linregress([3.1,2,1], [1,2,3])[2]**2
> 0.99924471299093676
>
> Josef
> Josef
>>
>> But because r is signed we should have the check r<-1 then -1

Volunteers for a pull request ?
and for checking that the test suite has a case with negative r, so
this doesn't get changed by accident.

Josef


>>
>> Josef
>>
>>>
>>> Cheers,
>>>
>>> --
>>> Jérôme Kieffer
>>> On-Line Data analysis / Software Group
>>> ISDD / ESRF
>>> tel +33 476 882 445
>>> _______________________________________________
>>> SciPy-User mailing list
>>> [hidden email]
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

Jerome Kieffer
On Tue, 12 Feb 2013 07:54:38 -0500
[hidden email] wrote:

> Volunteers for a pull request ?
> and for checking that the test suite has a case with negative r, so
> this doesn't get changed by accident.

I can ... but I am not used to the tests in scipy


--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel +33 476 882 445
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

josef.pktd
On Tue, Feb 12, 2013 at 10:33 AM, Jerome Kieffer <[hidden email]> wrote:
> On Tue, 12 Feb 2013 07:54:38 -0500
> [hidden email] wrote:
>
>> Volunteers for a pull request ?
>> and for checking that the test suite has a case with negative r, so
>> this doesn't get changed by accident.
>
> I can ... but I am not used to the tests in scipy

Usually I just have to search the test folder by function.

most of the linregress tests are from an old benchmark test suite,
checking mainly difficult cases.
This https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_stats.py#L736
is a test that I added.
Similar we could add one with a negative coefficient.

statsmodels OLS is verified against other packages, so we can get the
correct test results from there.

Thank you,

Josef



>
>
> --
> Jérôme Kieffer
> On-Line Data analysis / Software Group
> ISDD / ESRF
> tel +33 476 882 445
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

Jerome Kieffer
On Tue, 12 Feb 2013 10:48:33 -0500
[hidden email] wrote:

> On Tue, Feb 12, 2013 at 10:33 AM, Jerome Kieffer <[hidden email]> wrote:
> > On Tue, 12 Feb 2013 07:54:38 -0500
> > [hidden email] wrote:
> >
> >> Volunteers for a pull request ?
> >> and for checking that the test suite has a case with negative r, so
> >> this doesn't get changed by accident.
> >
> > I can ... but I am not used to the tests in scipy
>
> Usually I just have to search the test folder by function.
>

the patch for the code is trivial...
but get a valid test-case (currently breaking) is not trivial, especially that numpy.cov is casting to double precision (I have seen R<-1 only with single precision arithmetic)

I am preparing a pull-request.

Cheers,
--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel +33 476 882 445
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.stats.linregress

josef.pktd
On Tue, Feb 12, 2013 at 11:45 AM, Jerome Kieffer <[hidden email]> wrote:

> On Tue, 12 Feb 2013 10:48:33 -0500
> [hidden email] wrote:
>
>> On Tue, Feb 12, 2013 at 10:33 AM, Jerome Kieffer <[hidden email]> wrote:
>> > On Tue, 12 Feb 2013 07:54:38 -0500
>> > [hidden email] wrote:
>> >
>> >> Volunteers for a pull request ?
>> >> and for checking that the test suite has a case with negative r, so
>> >> this doesn't get changed by accident.
>> >
>> > I can ... but I am not used to the tests in scipy
>>
>> Usually I just have to search the test folder by function.
>>
>
> the patch for the code is trivial...
> but get a valid test-case (currently breaking) is not trivial, especially that numpy.cov is casting to double
precision (I have seen R<-1 only with single precision arithmetic)

We should have a test for negative r, as backwards compatibility check
for the future,
but if it's too difficult to get one with r<-1, then we can leave two
lines untested.

I don't remember if I have ever seen a case with abs(r)>1.

Josef

>
> I am preparing a pull-request.
>
> Cheers,
> --
> Jérôme Kieffer
> On-Line Data analysis / Software Group
> ISDD / ESRF
> tel +33 476 882 445
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user