[SciPy-User] np.corrcoef ddof is redundant?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] np.corrcoef ddof is redundant?

Alistair Miles
Hi,

I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.

I'd be very grateful if someone could verify this is true or tell me if I've missed something.

Thanks,
Alistair
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: [hidden email]
Tel: +44 (0)1865 287721

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Oleksandr Huziy
It does change for me, though very little....

x = np.random.randn(50)
y = x * x * x * x
for ddof in range(20):
    print "ddof = {}; r = {:.20f}".format(ddof, np.corrcoef(x, y, ddof=ddof)[0, 1])


ddof = 0; r = 0.27115960925626320099
ddof = 1; r = 0.27115960925626320099
ddof = 2; r = 0.27115960925626314548
ddof = 3; r = 0.27115960925626320099
ddof = 4; r = 0.27115960925626320099
ddof = 5; r = 0.27115960925626314548
ddof = 6; r = 0.27115960925626320099
ddof = 7; r = 0.27115960925626320099
ddof = 8; r = 0.27115960925626320099
ddof = 9; r = 0.27115960925626320099
ddof = 10; r = 0.27115960925626314548
ddof = 11; r = 0.27115960925626320099
ddof = 12; r = 0.27115960925626320099
ddof = 13; r = 0.27115960925626320099
ddof = 14; r = 0.27115960925626314548
ddof = 15; r = 0.27115960925626314548
ddof = 16; r = 0.27115960925626314548
ddof = 17; r = 0.27115960925626320099
ddof = 18; r = 0.27115960925626320099
ddof = 19; r = 0.27115960925626320099

Cheers


2015-03-10 11:55 GMT-04:00 Alistair Miles <[hidden email]>:
Hi,

I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.

I'd be very grateful if someone could verify this is true or tell me if I've missed something.

Thanks,
Alistair
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: [hidden email]
Tel: <a href="tel:%2B44%20%280%291865%20287721" value="+441865287721" target="_blank">+44 (0)1865 287721

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user




--
Sasha

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Sturla Molden-3
In reply to this post by Alistair Miles
Alistair Miles <[hidden email]> wrote:

> I'm trying to calculate correlation coefficients and looking at the
> np.corrcoef function. It has bias and ddof arguments, however when I try
> different values of ddof with test data the results are always the same,
> i.e., changing ddof has no effect. From some back-of-the-envelope algebra I
> reckon the n/(n-ddof) normalisations should get cancelled out when
> calculating correlation coefficients from a covariance matrix, and
> therefore the ddof (and bias) arguments to np.corrcoef are redundant.
>
> I'd be very grateful if someone could verify this is true or tell me if
> I've missed something.

You are right. It should cancel out or np.corrcoef would be wrong. The
sample size does not go into the Pearson product-moment correlation.

Sturla








> Thanks,
> Alistair
>
> --
> Alistair Miles
> Head of Epidemiological Informatics
> Centre for Genomics and Global Health <<a href="http://cggh.org">http://cggh.org</a>>
> The Wellcome Trust Centre for Human Genetics
> Roosevelt Drive
> Oxford
> OX3 7BN
> United Kingdom
> Web: <a href="http://purl.org/net/aliman">http://purl.org/net/aliman</a>
> Email: [hidden email]
> Tel: +44 (0)1865 287721
>
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> <a
> href="http://mail.scipy.org/mailman/listinfo/scipy-user">http://mail.scipy.org/mailman/listinfo/scipy-user</a>

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Sturla Molden-3
In reply to this post by Oleksandr Huziy
Oleksandr Huziy <[hidden email]> wrote:
> It does change for me, though very little....

Probably rounding error.

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Matthew Brett
In reply to this post by Sturla Molden-3
Hi,

On Tue, Mar 10, 2015 at 9:27 AM, Sturla Molden <[hidden email]> wrote:

> Alistair Miles <[hidden email]> wrote:
>
>> I'm trying to calculate correlation coefficients and looking at the
>> np.corrcoef function. It has bias and ddof arguments, however when I try
>> different values of ddof with test data the results are always the same,
>> i.e., changing ddof has no effect. From some back-of-the-envelope algebra I
>> reckon the n/(n-ddof) normalisations should get cancelled out when
>> calculating correlation coefficients from a covariance matrix, and
>> therefore the ddof (and bias) arguments to np.corrcoef are redundant.
>>
>> I'd be very grateful if someone could verify this is true or tell me if
>> I've missed something.
>
> You are right. It should cancel out or np.corrcoef would be wrong. The
> sample size does not go into the Pearson product-moment correlation.

Oh dear - that's embarrassing.

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

I guess we should deprecate the 'bias' and 'ddof' input arguments asap.

Cheers,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Alistair Miles
Thanks for the responses, glad to know I'm not going crazy. 

Cheers,
Alistair. 

On Tuesday, 10 March 2015, Matthew Brett <[hidden email]> wrote:
Hi,

On Tue, Mar 10, 2015 at 9:27 AM, Sturla Molden <<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;sturla.molden@gmail.com&#39;)">sturla.molden@...> wrote:
> Alistair Miles <<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;alimanfoo@googlemail.com&#39;)">alimanfoo@...> wrote:
>
>> I'm trying to calculate correlation coefficients and looking at the
>> np.corrcoef function. It has bias and ddof arguments, however when I try
>> different values of ddof with test data the results are always the same,
>> i.e., changing ddof has no effect. From some back-of-the-envelope algebra I
>> reckon the n/(n-ddof) normalisations should get cancelled out when
>> calculating correlation coefficients from a covariance matrix, and
>> therefore the ddof (and bias) arguments to np.corrcoef are redundant.
>>
>> I'd be very grateful if someone could verify this is true or tell me if
>> I've missed something.
>
> You are right. It should cancel out or np.corrcoef would be wrong. The
> sample size does not go into the Pearson product-moment correlation.

Oh dear - that's embarrassing.

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

I guess we should deprecate the 'bias' and 'ddof' input arguments asap.

Cheers,

Matthew
_______________________________________________
SciPy-User mailing list
<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;SciPy-User@scipy.org&#39;)">SciPy-User@...
http://mail.scipy.org/mailman/listinfo/scipy-user


--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: [hidden email]
Tel: +44 (0)1865 287721


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Sturla Molden-3
In reply to this post by Matthew Brett
On 10/03/15 21:12, Matthew Brett wrote:

> Oh dear - that's embarrassing.
>
> https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
>
> I guess we should deprecate the 'bias' and 'ddof' input arguments asap.

It is an unfortunate consequence of implementing np.corrcoef on top of
np.cov.

np.corrcoef should not be computed with np.cov because it just adds
additional rounding error to the result.

https://github.com/numpy/numpy/blob/32e23a1d52a05d3a56f693010eaf8d96826db75f/numpy/lib/function_base.py


Sturla


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Matthew Brett
On Tue, Mar 10, 2015 at 7:21 PM, Sturla Molden <[hidden email]> wrote:

> On 10/03/15 21:12, Matthew Brett wrote:
>
>> Oh dear - that's embarrassing.
>>
>> https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
>>
>> I guess we should deprecate the 'bias' and 'ddof' input arguments asap.
>
> It is an unfortunate consequence of implementing np.corrcoef on top of
> np.cov.

Except we should have realized that bias / ddof cancels and therefore
should not have implemented the bias / ddof input arguments (or passed
them to cov in the function).

> np.corrcoef should not be computed with np.cov because it just adds
> additional rounding error to the result.

What algorithm do you think we should use to minimize rounding error?

Cheers,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Sturla Molden-3
On 11/03/15 03:56, Matthew Brett wrote:

>> np.corrcoef should not be computed with np.cov because it just adds
>> additional rounding error to the result.
>
> What algorithm do you think we should use to minimize rounding error?

I was not actually thinking about that. I just thought we could reuse
some of the code from np.cov to avoid the redundant division and
multiplications.

But since you asked, to minimize rounding error there is a two-pass
method which can be used for both cov and corrcoef. Cf. this Matlab code:

http://home.online.no/~pjacklam/matlab/software/util/statutil/covmat.m

This would be very easy to use in NumPy.

Another method which is less known is to use the SVD. It can also be
used to compute the corrcoef. Here for real values and rowvar=False:

def cov(X, ddof):
     nx,p = X.shape
     mean = X.mean(axis=0)
     CX = X - mean[None,:]
     u,s,pc = np.linalg.svd(CX/np.sqrt(nx-ddof), full_matrices=False)
     s2 = s**2
     tmp = np.eye(p) * s2[:,None]
     return np.dot(pc.T,np.dot(tmp,pc))


Sturla

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: np.corrcoef ddof is redundant?

Matthew Brett
In reply to this post by Matthew Brett
Hi,

On Tue, Mar 10, 2015 at 1:12 PM, Matthew Brett <[hidden email]> wrote:

> Hi,
>
> On Tue, Mar 10, 2015 at 9:27 AM, Sturla Molden <[hidden email]> wrote:
>> Alistair Miles <[hidden email]> wrote:
>>
>>> I'm trying to calculate correlation coefficients and looking at the
>>> np.corrcoef function. It has bias and ddof arguments, however when I try
>>> different values of ddof with test data the results are always the same,
>>> i.e., changing ddof has no effect. From some back-of-the-envelope algebra I
>>> reckon the n/(n-ddof) normalisations should get cancelled out when
>>> calculating correlation coefficients from a covariance matrix, and
>>> therefore the ddof (and bias) arguments to np.corrcoef are redundant.
>>>
>>> I'd be very grateful if someone could verify this is true or tell me if
>>> I've missed something.
>>
>> You are right. It should cancel out or np.corrcoef would be wrong. The
>> sample size does not go into the Pearson product-moment correlation.
>
> Oh dear - that's embarrassing.
>
> https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
>
> I guess we should deprecate the 'bias' and 'ddof' input arguments asap.

https://github.com/numpy/numpy/pull/5675

Cheers,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user