ks_2samp is not giving the same results as ks.test in R

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ks_2samp is not giving the same results as ks.test in R

Peng Yu
Hi,

The ks_2samp function does not give the same answer as ks.test in R.
Does anybody know why they are different? Is ks_2samp compute
something different?

helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
> ks.test(1:5, 11:15)

        Two-sample Kolmogorov-Smirnov test

data:  1:5 and 11:15
D = 1, p-value = 0.007937
alternative hypothesis: two-sided

> ks.test(1:5, 11:15, alternative='less')

        Two-sample Kolmogorov-Smirnov test

data:  1:5 and 11:15
D^- = 0, p-value = 1
alternative hypothesis: the CDF of x lies below that of y

> ks.test(1:5, 11:15, alternative='greater')

        Two-sample Kolmogorov-Smirnov test

data:  1:5 and 11:15
D^+ = 1, p-value = 0.006738
alternative hypothesis: the CDF of x lies above that of y

>
>
helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
(1.0, 0.0037813540593701006)
helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
#!/usr/bin/env python

from scipy.stats import ks_2samp
print ks_2samp([1,2,3,4,5], [11,12,13,14,15])


--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: ks_2samp is not giving the same results as ks.test in R

josef.pktd
On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote:

> Hi,
>
> The ks_2samp function does not give the same answer as ks.test in R.
> Does anybody know why they are different? Is ks_2samp compute
> something different?
>
> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>> ks.test(1:5, 11:15)
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:5 and 11:15
> D = 1, p-value = 0.007937
> alternative hypothesis: two-sided
>
>> ks.test(1:5, 11:15, alternative='less')
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:5 and 11:15
> D^- = 0, p-value = 1
> alternative hypothesis: the CDF of x lies below that of y
>
>> ks.test(1:5, 11:15, alternative='greater')
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:5 and 11:15
> D^+ = 1, p-value = 0.006738
> alternative hypothesis: the CDF of x lies above that of y
>
>>
>>
> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
> (1.0, 0.0037813540593701006)
> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
> #!/usr/bin/env python
>
> from scipy.stats import ks_2samp
> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])

R uses by default an "exact" distribution for small samples if there
are no ties.
If there are ties or with a large sample, R uses the asymptotic distribution.

If I read the function correctly, then scipy.stats is using a small
sample approximation by Stephens. (But I would have to look up the
formula to verify this.)

In the example below with a bit larger sample and no ties, our
approximation is closer to R's "exact" pvalue than the asymptotic
distribution if exact=FALSE.

>  ks.test(1:25, (10:30)-0.5, exact=FALSE)

        Two-sample Kolmogorov-Smirnov test

data:  1:25 and (10:30) - 0.5
D = 0.36, p-value = 0.1038
alternative hypothesis: two-sided

>  ks.test(1:25, (10:30)-0.5, exact=TRUE)

        Two-sample Kolmogorov-Smirnov test

data:  1:25 and (10:30) - 0.5
D = 0.36, p-value = 0.07608
alternative hypothesis: two-sided


>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
(0.35999999999999999, 0.078993426961291274)


For the 1 sample kstest I used (when I rewrote stats.kstest) an
approximation that is closer to the exact distribution than the
asymptotic distribution, but it's also not exact.

It would be good to have better small sample approximations or exact
distributions, but I worked on this in scipy.stats when I barely had
any idea about goodness-of-fit tests.
Also, ks_2samp never got the enhancement for one-sided alternatives.
(In statsmodels I have been working so far only on one sample tests,
but not on two-sample tests.)

(I don't remember if there is a minimum size recommendation, but the
examples I usually checked were larger.)


since it's a community project: Pull Request are welcome

Josef

>
>
> --
> Regards,
> Peng
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: ks_2samp is not giving the same results as ks.test in R

josef.pktd
On Thu, Nov 1, 2012 at 9:14 PM,  <[hidden email]> wrote:

> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote:
>> Hi,
>>
>> The ks_2samp function does not give the same answer as ks.test in R.
>> Does anybody know why they are different? Is ks_2samp compute
>> something different?
>>
>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>> ks.test(1:5, 11:15)
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:5 and 11:15
>> D = 1, p-value = 0.007937
>> alternative hypothesis: two-sided
>>
>>> ks.test(1:5, 11:15, alternative='less')
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:5 and 11:15
>> D^- = 0, p-value = 1
>> alternative hypothesis: the CDF of x lies below that of y
>>
>>> ks.test(1:5, 11:15, alternative='greater')
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:5 and 11:15
>> D^+ = 1, p-value = 0.006738
>> alternative hypothesis: the CDF of x lies above that of y
>>
>>>
>>>
>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>> (1.0, 0.0037813540593701006)
>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>> #!/usr/bin/env python
>>
>> from scipy.stats import ks_2samp
>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>
> R uses by default an "exact" distribution for small samples if there
> are no ties.
> If there are ties or with a large sample, R uses the asymptotic distribution.
>
> If I read the function correctly, then scipy.stats is using a small
> sample approximation by Stephens. (But I would have to look up the
> formula to verify this.)

http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
Stephens (1970, 1985?) for the one sample test.
I don't have a reference for the two sample approximation right now.

(another bit of random information)
tables are often only available for 0.01 to 0.25 and approximations
are targeted on that range and might not be as accurate outside of it

Josef


>
> In the example below with a bit larger sample and no ties, our
> approximation is closer to R's "exact" pvalue than the asymptotic
> distribution if exact=FALSE.
>
>>  ks.test(1:25, (10:30)-0.5, exact=FALSE)
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:25 and (10:30) - 0.5
> D = 0.36, p-value = 0.1038
> alternative hypothesis: two-sided
>
>>  ks.test(1:25, (10:30)-0.5, exact=TRUE)
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:25 and (10:30) - 0.5
> D = 0.36, p-value = 0.07608
> alternative hypothesis: two-sided
>
>
>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
> (0.35999999999999999, 0.078993426961291274)
>
>
> For the 1 sample kstest I used (when I rewrote stats.kstest) an
> approximation that is closer to the exact distribution than the
> asymptotic distribution, but it's also not exact.
>
> It would be good to have better small sample approximations or exact
> distributions, but I worked on this in scipy.stats when I barely had
> any idea about goodness-of-fit tests.
> Also, ks_2samp never got the enhancement for one-sided alternatives.
> (In statsmodels I have been working so far only on one sample tests,
> but not on two-sample tests.)
>
> (I don't remember if there is a minimum size recommendation, but the
> examples I usually checked were larger.)

matlab help: http://www.mathworks.com/help/stats/kstest2.html
"The asymptotic p value becomes very accurate for large sample sizes,
and is believed to be reasonably accurate for sample sizes n1 and n2
such that (n1*n2)/(n1 + n2) >= 4."

>
>
> since it's a community project: Pull Request are welcome
>
> Josef
>
>>
>>
>> --
>> Regards,
>> Peng
>> _______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: ks_2samp is not giving the same results as ks.test in R

josef.pktd
On Thu, Nov 1, 2012 at 9:41 PM,  <[hidden email]> wrote:

> On Thu, Nov 1, 2012 at 9:14 PM,  <[hidden email]> wrote:
>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote:
>>> Hi,
>>>
>>> The ks_2samp function does not give the same answer as ks.test in R.
>>> Does anybody know why they are different? Is ks_2samp compute
>>> something different?
>>>
>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>>> ks.test(1:5, 11:15)
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:5 and 11:15
>>> D = 1, p-value = 0.007937
>>> alternative hypothesis: two-sided
>>>
>>>> ks.test(1:5, 11:15, alternative='less')
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:5 and 11:15
>>> D^- = 0, p-value = 1
>>> alternative hypothesis: the CDF of x lies below that of y
>>>
>>>> ks.test(1:5, 11:15, alternative='greater')
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:5 and 11:15
>>> D^+ = 1, p-value = 0.006738
>>> alternative hypothesis: the CDF of x lies above that of y
>>>
>>>>
>>>>
>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>>> (1.0, 0.0037813540593701006)
>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>>> #!/usr/bin/env python
>>>
>>> from scipy.stats import ks_2samp
>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>>
>> R uses by default an "exact" distribution for small samples if there
>> are no ties.
>> If there are ties or with a large sample, R uses the asymptotic distribution.
>>
>> If I read the function correctly, then scipy.stats is using a small
>> sample approximation by Stephens. (But I would have to look up the
>> formula to verify this.)
>
> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
> Stephens (1970, 1985?) for the one sample test.
> I don't have a reference for the two sample approximation right now.
>
> (another bit of random information)
> tables are often only available for 0.01 to 0.25 and approximations
(hit send too fast)  0.001 to 0.25

> are targeted on that range and might not be as accurate outside of it
>
> Josef
>
>
>>
>> In the example below with a bit larger sample and no ties, our
>> approximation is closer to R's "exact" pvalue than the asymptotic
>> distribution if exact=FALSE.
>>
>>>  ks.test(1:25, (10:30)-0.5, exact=FALSE)
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:25 and (10:30) - 0.5
>> D = 0.36, p-value = 0.1038
>> alternative hypothesis: two-sided
>>
>>>  ks.test(1:25, (10:30)-0.5, exact=TRUE)
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:25 and (10:30) - 0.5
>> D = 0.36, p-value = 0.07608
>> alternative hypothesis: two-sided
>>
>>
>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
>> (0.35999999999999999, 0.078993426961291274)
>>
>>
>> For the 1 sample kstest I used (when I rewrote stats.kstest) an
>> approximation that is closer to the exact distribution than the
>> asymptotic distribution, but it's also not exact.
>>
>> It would be good to have better small sample approximations or exact
>> distributions, but I worked on this in scipy.stats when I barely had
>> any idea about goodness-of-fit tests.
>> Also, ks_2samp never got the enhancement for one-sided alternatives.
>> (In statsmodels I have been working so far only on one sample tests,
>> but not on two-sample tests.)
>>
>> (I don't remember if there is a minimum size recommendation, but the
>> examples I usually checked were larger.)
>
> matlab help: http://www.mathworks.com/help/stats/kstest2.html
> "The asymptotic p value becomes very accurate for large sample sizes,
> and is believed to be reasonably accurate for sample sizes n1 and n2
> such that (n1*n2)/(n1 + n2) >= 4."
>>
>>
>> since it's a community project: Pull Request are welcome
>>
>> Josef
>>
>>>
>>>
>>> --
>>> Regards,
>>> Peng
>>> _______________________________________________
>>> SciPy-User mailing list
>>> [hidden email]
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: ks_2samp is not giving the same results as ks.test in R

josef.pktd
On Thu, Nov 1, 2012 at 9:42 PM,  <[hidden email]> wrote:

> On Thu, Nov 1, 2012 at 9:41 PM,  <[hidden email]> wrote:
>> On Thu, Nov 1, 2012 at 9:14 PM,  <[hidden email]> wrote:
>>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> The ks_2samp function does not give the same answer as ks.test in R.
>>>> Does anybody know why they are different? Is ks_2samp compute
>>>> something different?
>>>>
>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>>>> ks.test(1:5, 11:15)
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:5 and 11:15
>>>> D = 1, p-value = 0.007937
>>>> alternative hypothesis: two-sided
>>>>
>>>>> ks.test(1:5, 11:15, alternative='less')
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:5 and 11:15
>>>> D^- = 0, p-value = 1
>>>> alternative hypothesis: the CDF of x lies below that of y
>>>>
>>>>> ks.test(1:5, 11:15, alternative='greater')
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:5 and 11:15
>>>> D^+ = 1, p-value = 0.006738
>>>> alternative hypothesis: the CDF of x lies above that of y
>>>>
>>>>>
>>>>>
>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>>>> (1.0, 0.0037813540593701006)
>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>>>> #!/usr/bin/env python
>>>>
>>>> from scipy.stats import ks_2samp
>>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>>>
>>> R uses by default an "exact" distribution for small samples if there
>>> are no ties.
>>> If there are ties or with a large sample, R uses the asymptotic distribution.
>>>
>>> If I read the function correctly, then scipy.stats is using a small
>>> sample approximation by Stephens. (But I would have to look up the
>>> formula to verify this.)
>>
>> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
>> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
>> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
>> Stephens (1970, 1985?) for the one sample test.
>> I don't have a reference for the two sample approximation right now.
>>
>> (another bit of random information)
>> tables are often only available for 0.01 to 0.25 and approximations
> (hit send too fast)  0.001 to 0.25
>> are targeted on that range and might not be as accurate outside of it
>>
>> Josef
>>
>>
>>>
>>> In the example below with a bit larger sample and no ties, our
>>> approximation is closer to R's "exact" pvalue than the asymptotic
>>> distribution if exact=FALSE.
>>>
>>>>  ks.test(1:25, (10:30)-0.5, exact=FALSE)
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:25 and (10:30) - 0.5
>>> D = 0.36, p-value = 0.1038
>>> alternative hypothesis: two-sided
>>>
>>>>  ks.test(1:25, (10:30)-0.5, exact=TRUE)
>>>
>>>         Two-sample Kolmogorov-Smirnov test
>>>
>>> data:  1:25 and (10:30) - 0.5
>>> D = 0.36, p-value = 0.07608
>>> alternative hypothesis: two-sided
>>>
>>>
>>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
>>> (0.35999999999999999, 0.078993426961291274)
>>>
>>>
>>> For the 1 sample kstest I used (when I rewrote stats.kstest) an
>>> approximation that is closer to the exact distribution than the
>>> asymptotic distribution, but it's also not exact.
>>>
>>> It would be good to have better small sample approximations or exact
>>> distributions, but I worked on this in scipy.stats when I barely had
>>> any idea about goodness-of-fit tests.
>>> Also, ks_2samp never got the enhancement for one-sided alternatives.
>>> (In statsmodels I have been working so far only on one sample tests,
>>> but not on two-sample tests.)
>>>
>>> (I don't remember if there is a minimum size recommendation, but the
>>> examples I usually checked were larger.)
>>
>> matlab help: http://www.mathworks.com/help/stats/kstest2.html
>> "The asymptotic p value becomes very accurate for large sample sizes,
>> and is believed to be reasonably accurate for sample sizes n1 and n2
>> such that (n1*n2)/(n1 + n2) >= 4."
>>>
>>>
>>> since it's a community project: Pull Request are welcome

(since I haven't spammed the mailing list in a while, and it's fast)

I think, except in very small sample, ks_2samp looks ok.

(However, independent of our p-values, the power of Kolmogorov-Smirnov
in small samples is awful.
>>> stats.ttest_ind(np.arange(1.,26), np.arange(10,31.)-0.5)
(-3.2015129142545442, 0.0025398589272691129)       reject Null
>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
(0.35999999999999999, 0.078993426961291274)        don't reject Null at 0.05
)

---------------------------
# -*- coding: utf-8 -*-
"""Comparing pvalues for ks_2samp in small sample, permutation test

Created on Thu Nov 01 22:42:30 2012
Author: Josef Perktold

Results:

sample: 1
permutation 0.07704
scipy       0.0789934269613
R exact     0.07608
R asymp     0.1038

subset results
[ 0.0774  0.0783  0.0732  0.0772  0.0775  0.0755  0.0791  0.0743  0.0816
  0.0763]
>>> execfile('permutations_ks1samp.py')

sample: 0
permutation 0.00791
scipy       0.00378135405937
R exact     0.007937
R asymp     0.01348

subset results
[ 0.008   0.0094  0.0074  0.0071  0.0071  0.0084  0.0073  0.0084  0.0083
  0.0077]

"""

import numpy as np
from scipy import stats

sample = 0

if sample == 0:
    x1, y1 = np.arange(1.,6), np.arange(11,16.)
else:
    x1, y1 = np.arange(1.,26), np.arange(10,31.)-0.5
nx = len(x1)

D, pval = stats.ks_2samp(x1, y1)

#permutation p-values
n_rep = 100000  #make it large
results = np.empty((n_rep, 2))
results.fill(np.nan)

x, y = x1.copy(), y1.copy()
xy = np.concatenate((x,y))
np.random.seed(2345678)
for ii in xrange(n_rep):
    np.random.shuffle(xy)
    results[ii] = stats.ks_2samp(xy[:nx], xy[nx:])

print '\nsample:', sample
print 'permutation', (results[:,0] >= D).mean()
print 'scipy      ', pval
if sample == 0:
    print 'R exact    ', 0.007937
    print 'R asymp    ', 0.01348
else:
    print 'R exact    ', 0.07608
    print 'R asymp    ', 0.1038

print '\nsubset results'
print (results[:,0].reshape(10000, -1) >= D).mean(0)
-------------------------

Josef


>>>
>>> Josef
>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Peng
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> [hidden email]
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: ks_2samp is not giving the same results as ks.test in R

josef.pktd
On Thu, Nov 1, 2012 at 11:27 PM,  <[hidden email]> wrote:

> On Thu, Nov 1, 2012 at 9:42 PM,  <[hidden email]> wrote:
>> On Thu, Nov 1, 2012 at 9:41 PM,  <[hidden email]> wrote:
>>> On Thu, Nov 1, 2012 at 9:14 PM,  <[hidden email]> wrote:
>>>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> The ks_2samp function does not give the same answer as ks.test in R.
>>>>> Does anybody know why they are different? Is ks_2samp compute
>>>>> something different?
>>>>>
>>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>>>>> ks.test(1:5, 11:15)
>>>>>
>>>>>         Two-sample Kolmogorov-Smirnov test
>>>>>
>>>>> data:  1:5 and 11:15
>>>>> D = 1, p-value = 0.007937
>>>>> alternative hypothesis: two-sided
>>>>>
>>>>>> ks.test(1:5, 11:15, alternative='less')
>>>>>
>>>>>         Two-sample Kolmogorov-Smirnov test
>>>>>
>>>>> data:  1:5 and 11:15
>>>>> D^- = 0, p-value = 1
>>>>> alternative hypothesis: the CDF of x lies below that of y
>>>>>
>>>>>> ks.test(1:5, 11:15, alternative='greater')
>>>>>
>>>>>         Two-sample Kolmogorov-Smirnov test
>>>>>
>>>>> data:  1:5 and 11:15
>>>>> D^+ = 1, p-value = 0.006738
>>>>> alternative hypothesis: the CDF of x lies above that of y
>>>>>
>>>>>>
>>>>>>
>>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>>>>> (1.0, 0.0037813540593701006)
>>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>>>>> #!/usr/bin/env python
>>>>>
>>>>> from scipy.stats import ks_2samp
>>>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>>>>
>>>> R uses by default an "exact" distribution for small samples if there
>>>> are no ties.
>>>> If there are ties or with a large sample, R uses the asymptotic distribution.
>>>>
>>>> If I read the function correctly, then scipy.stats is using a small
>>>> sample approximation by Stephens. (But I would have to look up the
>>>> formula to verify this.)
>>>
>>> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
>>> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
>>> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
>>> Stephens (1970, 1985?) for the one sample test.
>>> I don't have a reference for the two sample approximation right now.
>>>
>>> (another bit of random information)
>>> tables are often only available for 0.01 to 0.25 and approximations
>> (hit send too fast)  0.001 to 0.25
>>> are targeted on that range and might not be as accurate outside of it
>>>
>>> Josef
>>>
>>>
>>>>
>>>> In the example below with a bit larger sample and no ties, our
>>>> approximation is closer to R's "exact" pvalue than the asymptotic
>>>> distribution if exact=FALSE.
>>>>
>>>>>  ks.test(1:25, (10:30)-0.5, exact=FALSE)
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:25 and (10:30) - 0.5
>>>> D = 0.36, p-value = 0.1038
>>>> alternative hypothesis: two-sided
>>>>
>>>>>  ks.test(1:25, (10:30)-0.5, exact=TRUE)
>>>>
>>>>         Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data:  1:25 and (10:30) - 0.5
>>>> D = 0.36, p-value = 0.07608
>>>> alternative hypothesis: two-sided
>>>>
>>>>
>>>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
>>>> (0.35999999999999999, 0.078993426961291274)
>>>>
>>>>
>>>> For the 1 sample kstest I used (when I rewrote stats.kstest) an
>>>> approximation that is closer to the exact distribution than the
>>>> asymptotic distribution, but it's also not exact.
>>>>
>>>> It would be good to have better small sample approximations or exact
>>>> distributions, but I worked on this in scipy.stats when I barely had
>>>> any idea about goodness-of-fit tests.
>>>> Also, ks_2samp never got the enhancement for one-sided alternatives.
>>>> (In statsmodels I have been working so far only on one sample tests,
>>>> but not on two-sample tests.)
>>>>
>>>> (I don't remember if there is a minimum size recommendation, but the
>>>> examples I usually checked were larger.)
>>>
>>> matlab help: http://www.mathworks.com/help/stats/kstest2.html
>>> "The asymptotic p value becomes very accurate for large sample sizes,
>>> and is believed to be reasonably accurate for sample sizes n1 and n2
>>> such that (n1*n2)/(n1 + n2) >= 4."
>>>>
>>>>
>>>> since it's a community project: Pull Request are welcome
>
> (since I haven't spammed the mailing list in a while, and it's fast)
>
> I think, except in very small sample, ks_2samp looks ok.
>
> (However, independent of our p-values, the power of Kolmogorov-Smirnov
> in small samples is awful.
>>>> stats.ttest_ind(np.arange(1.,26), np.arange(10,31.)-0.5)
> (-3.2015129142545442, 0.0025398589272691129)       reject Null
>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
> (0.35999999999999999, 0.078993426961291274)        don't reject Null at 0.05
> )

Anderson-Darling has in most cases higher power than Kolmogorov-Smirnov

on my wishlist for a long time:
"K-sample Anderson–Darling tests"
http://scholar.google.com/scholar?cluster=1766356837451056669&hl=en&as_sdt=0,5&sciodt=0,5

volunteers?

Josef

>
> ---------------------------
> # -*- coding: utf-8 -*-
> """Comparing pvalues for ks_2samp in small sample, permutation test
>
> Created on Thu Nov 01 22:42:30 2012
> Author: Josef Perktold
>
> Results:
>
> sample: 1
> permutation 0.07704
> scipy       0.0789934269613
> R exact     0.07608
> R asymp     0.1038
>
> subset results
> [ 0.0774  0.0783  0.0732  0.0772  0.0775  0.0755  0.0791  0.0743  0.0816
>   0.0763]
>>>> execfile('permutations_ks1samp.py')
>
> sample: 0
> permutation 0.00791
> scipy       0.00378135405937
> R exact     0.007937
> R asymp     0.01348
>
> subset results
> [ 0.008   0.0094  0.0074  0.0071  0.0071  0.0084  0.0073  0.0084  0.0083
>   0.0077]
>
> """
>
> import numpy as np
> from scipy import stats
>
> sample = 0
>
> if sample == 0:
>     x1, y1 = np.arange(1.,6), np.arange(11,16.)
> else:
>     x1, y1 = np.arange(1.,26), np.arange(10,31.)-0.5
> nx = len(x1)
>
> D, pval = stats.ks_2samp(x1, y1)
>
> #permutation p-values
> n_rep = 100000  #make it large
> results = np.empty((n_rep, 2))
> results.fill(np.nan)
>
> x, y = x1.copy(), y1.copy()
> xy = np.concatenate((x,y))
> np.random.seed(2345678)
> for ii in xrange(n_rep):
>     np.random.shuffle(xy)
>     results[ii] = stats.ks_2samp(xy[:nx], xy[nx:])
>
> print '\nsample:', sample
> print 'permutation', (results[:,0] >= D).mean()
> print 'scipy      ', pval
> if sample == 0:
>     print 'R exact    ', 0.007937
>     print 'R asymp    ', 0.01348
> else:
>     print 'R exact    ', 0.07608
>     print 'R asymp    ', 0.1038
>
> print '\nsubset results'
> print (results[:,0].reshape(10000, -1) >= D).mean(0)
> -------------------------
>
> Josef
>
>
>>>>
>>>> Josef
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Peng
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> [hidden email]
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user