# ks_2samp is not giving the same results as ks.test in R Classic List Threaded 6 messages Open this post in threaded view
|

## ks_2samp is not giving the same results as ks.test in R

 Hi, The ks_2samp function does not give the same answer as ks.test in R. Does anybody know why they are different? Is ks_2samp compute something different? helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ Rscript main.R > ks.test(1:5, 11:15)         Two-sample Kolmogorov-Smirnov test data:  1:5 and 11:15 D = 1, p-value = 0.007937 alternative hypothesis: two-sided > ks.test(1:5, 11:15, alternative='less')         Two-sample Kolmogorov-Smirnov test data:  1:5 and 11:15 D^- = 0, p-value = 1 alternative hypothesis: the CDF of x lies below that of y > ks.test(1:5, 11:15, alternative='greater')         Two-sample Kolmogorov-Smirnov test data:  1:5 and 11:15 D^+ = 1, p-value = 0.006738 alternative hypothesis: the CDF of x lies above that of y > > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ ./main.py (1.0, 0.0037813540593701006) helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ cat main.py #!/usr/bin/env python from scipy.stats import ks_2samp print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) -- Regards, Peng _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: ks_2samp is not giving the same results as ks.test in R

 On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: > Hi, > > The ks_2samp function does not give the same answer as ks.test in R. > Does anybody know why they are different? Is ks_2samp compute > something different? > > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ Rscript main.R >> ks.test(1:5, 11:15) > >         Two-sample Kolmogorov-Smirnov test > > data:  1:5 and 11:15 > D = 1, p-value = 0.007937 > alternative hypothesis: two-sided > >> ks.test(1:5, 11:15, alternative='less') > >         Two-sample Kolmogorov-Smirnov test > > data:  1:5 and 11:15 > D^- = 0, p-value = 1 > alternative hypothesis: the CDF of x lies below that of y > >> ks.test(1:5, 11:15, alternative='greater') > >         Two-sample Kolmogorov-Smirnov test > > data:  1:5 and 11:15 > D^+ = 1, p-value = 0.006738 > alternative hypothesis: the CDF of x lies above that of y > >> >> > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ ./main.py > (1.0, 0.0037813540593701006) > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ cat main.py > #!/usr/bin/env python > > from scipy.stats import ks_2samp > print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) R uses by default an "exact" distribution for small samples if there are no ties. If there are ties or with a large sample, R uses the asymptotic distribution. If I read the function correctly, then scipy.stats is using a small sample approximation by Stephens. (But I would have to look up the formula to verify this.) In the example below with a bit larger sample and no ties, our approximation is closer to R's "exact" pvalue than the asymptotic distribution if exact=FALSE. >  ks.test(1:25, (10:30)-0.5, exact=FALSE)         Two-sample Kolmogorov-Smirnov test data:  1:25 and (10:30) - 0.5 D = 0.36, p-value = 0.1038 alternative hypothesis: two-sided >  ks.test(1:25, (10:30)-0.5, exact=TRUE)         Two-sample Kolmogorov-Smirnov test data:  1:25 and (10:30) - 0.5 D = 0.36, p-value = 0.07608 alternative hypothesis: two-sided >>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) (0.35999999999999999, 0.078993426961291274) For the 1 sample kstest I used (when I rewrote stats.kstest) an approximation that is closer to the exact distribution than the asymptotic distribution, but it's also not exact. It would be good to have better small sample approximations or exact distributions, but I worked on this in scipy.stats when I barely had any idea about goodness-of-fit tests. Also, ks_2samp never got the enhancement for one-sided alternatives. (In statsmodels I have been working so far only on one sample tests, but not on two-sample tests.) (I don't remember if there is a minimum size recommendation, but the examples I usually checked were larger.) since it's a community project: Pull Request are welcome Josef > > > -- > Regards, > Peng > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: ks_2samp is not giving the same results as ks.test in R

 On Thu, Nov 1, 2012 at 9:14 PM,  <[hidden email]> wrote: > On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: >> Hi, >> >> The ks_2samp function does not give the same answer as ks.test in R. >> Does anybody know why they are different? Is ks_2samp compute >> something different? >> >> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ Rscript main.R >>> ks.test(1:5, 11:15) >> >>         Two-sample Kolmogorov-Smirnov test >> >> data:  1:5 and 11:15 >> D = 1, p-value = 0.007937 >> alternative hypothesis: two-sided >> >>> ks.test(1:5, 11:15, alternative='less') >> >>         Two-sample Kolmogorov-Smirnov test >> >> data:  1:5 and 11:15 >> D^- = 0, p-value = 1 >> alternative hypothesis: the CDF of x lies below that of y >> >>> ks.test(1:5, 11:15, alternative='greater') >> >>         Two-sample Kolmogorov-Smirnov test >> >> data:  1:5 and 11:15 >> D^+ = 1, p-value = 0.006738 >> alternative hypothesis: the CDF of x lies above that of y >> >>> >>> >> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ ./main.py >> (1.0, 0.0037813540593701006) >> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ cat main.py >> #!/usr/bin/env python >> >> from scipy.stats import ks_2samp >> print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) > > R uses by default an "exact" distribution for small samples if there > are no ties. > If there are ties or with a large sample, R uses the asymptotic distribution. > > If I read the function correctly, then scipy.stats is using a small > sample approximation by Stephens. (But I would have to look up the > formula to verify this.) http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_testhas the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2)) the small sample weighting ((en+0.12+0.11/en)*d) is the same as in Stephens (1970, 1985?) for the one sample test. I don't have a reference for the two sample approximation right now. (another bit of random information) tables are often only available for 0.01 to 0.25 and approximations are targeted on that range and might not be as accurate outside of it Josef > > In the example below with a bit larger sample and no ties, our > approximation is closer to R's "exact" pvalue than the asymptotic > distribution if exact=FALSE. > >>  ks.test(1:25, (10:30)-0.5, exact=FALSE) > >         Two-sample Kolmogorov-Smirnov test > > data:  1:25 and (10:30) - 0.5 > D = 0.36, p-value = 0.1038 > alternative hypothesis: two-sided > >>  ks.test(1:25, (10:30)-0.5, exact=TRUE) > >         Two-sample Kolmogorov-Smirnov test > > data:  1:25 and (10:30) - 0.5 > D = 0.36, p-value = 0.07608 > alternative hypothesis: two-sided > > >>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) > (0.35999999999999999, 0.078993426961291274) > > > For the 1 sample kstest I used (when I rewrote stats.kstest) an > approximation that is closer to the exact distribution than the > asymptotic distribution, but it's also not exact. > > It would be good to have better small sample approximations or exact > distributions, but I worked on this in scipy.stats when I barely had > any idea about goodness-of-fit tests. > Also, ks_2samp never got the enhancement for one-sided alternatives. > (In statsmodels I have been working so far only on one sample tests, > but not on two-sample tests.) > > (I don't remember if there is a minimum size recommendation, but the > examples I usually checked were larger.) matlab help: http://www.mathworks.com/help/stats/kstest2.html"The asymptotic p value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2 such that (n1*n2)/(n1 + n2) >= 4." > > > since it's a community project: Pull Request are welcome > > Josef > >> >> >> -- >> Regards, >> Peng >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: ks_2samp is not giving the same results as ks.test in R

 On Thu, Nov 1, 2012 at 9:41 PM,  <[hidden email]> wrote: > On Thu, Nov 1, 2012 at 9:14 PM,  <[hidden email]> wrote: >> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: >>> Hi, >>> >>> The ks_2samp function does not give the same answer as ks.test in R. >>> Does anybody know why they are different? Is ks_2samp compute >>> something different? >>> >>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ Rscript main.R >>>> ks.test(1:5, 11:15) >>> >>>         Two-sample Kolmogorov-Smirnov test >>> >>> data:  1:5 and 11:15 >>> D = 1, p-value = 0.007937 >>> alternative hypothesis: two-sided >>> >>>> ks.test(1:5, 11:15, alternative='less') >>> >>>         Two-sample Kolmogorov-Smirnov test >>> >>> data:  1:5 and 11:15 >>> D^- = 0, p-value = 1 >>> alternative hypothesis: the CDF of x lies below that of y >>> >>>> ks.test(1:5, 11:15, alternative='greater') >>> >>>         Two-sample Kolmogorov-Smirnov test >>> >>> data:  1:5 and 11:15 >>> D^+ = 1, p-value = 0.006738 >>> alternative hypothesis: the CDF of x lies above that of y >>> >>>> >>>> >>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ ./main.py >>> (1.0, 0.0037813540593701006) >>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp\$ cat main.py >>> #!/usr/bin/env python >>> >>> from scipy.stats import ks_2samp >>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) >> >> R uses by default an "exact" distribution for small samples if there >> are no ties. >> If there are ties or with a large sample, R uses the asymptotic distribution. >> >> If I read the function correctly, then scipy.stats is using a small >> sample approximation by Stephens. (But I would have to look up the >> formula to verify this.) > > http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2)) > the small sample weighting ((en+0.12+0.11/en)*d) is the same as in > Stephens (1970, 1985?) for the one sample test. > I don't have a reference for the two sample approximation right now. > > (another bit of random information) > tables are often only available for 0.01 to 0.25 and approximations (hit send too fast)  0.001 to 0.25 > are targeted on that range and might not be as accurate outside of it > > Josef > > >> >> In the example below with a bit larger sample and no ties, our >> approximation is closer to R's "exact" pvalue than the asymptotic >> distribution if exact=FALSE. >> >>>  ks.test(1:25, (10:30)-0.5, exact=FALSE) >> >>         Two-sample Kolmogorov-Smirnov test >> >> data:  1:25 and (10:30) - 0.5 >> D = 0.36, p-value = 0.1038 >> alternative hypothesis: two-sided >> >>>  ks.test(1:25, (10:30)-0.5, exact=TRUE) >> >>         Two-sample Kolmogorov-Smirnov test >> >> data:  1:25 and (10:30) - 0.5 >> D = 0.36, p-value = 0.07608 >> alternative hypothesis: two-sided >> >> >>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) >> (0.35999999999999999, 0.078993426961291274) >> >> >> For the 1 sample kstest I used (when I rewrote stats.kstest) an >> approximation that is closer to the exact distribution than the >> asymptotic distribution, but it's also not exact. >> >> It would be good to have better small sample approximations or exact >> distributions, but I worked on this in scipy.stats when I barely had >> any idea about goodness-of-fit tests. >> Also, ks_2samp never got the enhancement for one-sided alternatives. >> (In statsmodels I have been working so far only on one sample tests, >> but not on two-sample tests.) >> >> (I don't remember if there is a minimum size recommendation, but the >> examples I usually checked were larger.) > > matlab help: http://www.mathworks.com/help/stats/kstest2.html> "The asymptotic p value becomes very accurate for large sample sizes, > and is believed to be reasonably accurate for sample sizes n1 and n2 > such that (n1*n2)/(n1 + n2) >= 4." >> >> >> since it's a community project: Pull Request are welcome >> >> Josef >> >>> >>> >>> -- >>> Regards, >>> Peng >>> _______________________________________________ >>> SciPy-User mailing list >>> [hidden email] >>> http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user