Hi,
The ks_2samp function does not give the same answer as ks.test in R. Does anybody know why they are different? Is ks_2samp compute something different? helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R > ks.test(1:5, 11:15) Two-sample Kolmogorov-Smirnov test data: 1:5 and 11:15 D = 1, p-value = 0.007937 alternative hypothesis: two-sided > ks.test(1:5, 11:15, alternative='less') Two-sample Kolmogorov-Smirnov test data: 1:5 and 11:15 D^- = 0, p-value = 1 alternative hypothesis: the CDF of x lies below that of y > ks.test(1:5, 11:15, alternative='greater') Two-sample Kolmogorov-Smirnov test data: 1:5 and 11:15 D^+ = 1, p-value = 0.006738 alternative hypothesis: the CDF of x lies above that of y > > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py (1.0, 0.0037813540593701006) helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py #!/usr/bin/env python from scipy.stats import ks_2samp print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) -- Regards, Peng _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote:
> Hi, > > The ks_2samp function does not give the same answer as ks.test in R. > Does anybody know why they are different? Is ks_2samp compute > something different? > > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R >> ks.test(1:5, 11:15) > > Two-sample Kolmogorov-Smirnov test > > data: 1:5 and 11:15 > D = 1, p-value = 0.007937 > alternative hypothesis: two-sided > >> ks.test(1:5, 11:15, alternative='less') > > Two-sample Kolmogorov-Smirnov test > > data: 1:5 and 11:15 > D^- = 0, p-value = 1 > alternative hypothesis: the CDF of x lies below that of y > >> ks.test(1:5, 11:15, alternative='greater') > > Two-sample Kolmogorov-Smirnov test > > data: 1:5 and 11:15 > D^+ = 1, p-value = 0.006738 > alternative hypothesis: the CDF of x lies above that of y > >> >> > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py > (1.0, 0.0037813540593701006) > helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py > #!/usr/bin/env python > > from scipy.stats import ks_2samp > print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) R uses by default an "exact" distribution for small samples if there are no ties. If there are ties or with a large sample, R uses the asymptotic distribution. If I read the function correctly, then scipy.stats is using a small sample approximation by Stephens. (But I would have to look up the formula to verify this.) In the example below with a bit larger sample and no ties, our approximation is closer to R's "exact" pvalue than the asymptotic distribution if exact=FALSE. > ks.test(1:25, (10:30)-0.5, exact=FALSE) Two-sample Kolmogorov-Smirnov test data: 1:25 and (10:30) - 0.5 D = 0.36, p-value = 0.1038 alternative hypothesis: two-sided > ks.test(1:25, (10:30)-0.5, exact=TRUE) Two-sample Kolmogorov-Smirnov test data: 1:25 and (10:30) - 0.5 D = 0.36, p-value = 0.07608 alternative hypothesis: two-sided >>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) (0.35999999999999999, 0.078993426961291274) For the 1 sample kstest I used (when I rewrote stats.kstest) an approximation that is closer to the exact distribution than the asymptotic distribution, but it's also not exact. It would be good to have better small sample approximations or exact distributions, but I worked on this in scipy.stats when I barely had any idea about goodness-of-fit tests. Also, ks_2samp never got the enhancement for one-sided alternatives. (In statsmodels I have been working so far only on one sample tests, but not on two-sample tests.) (I don't remember if there is a minimum size recommendation, but the examples I usually checked were larger.) since it's a community project: Pull Request are welcome Josef > > > -- > Regards, > Peng > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Thu, Nov 1, 2012 at 9:14 PM, <[hidden email]> wrote:
> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: >> Hi, >> >> The ks_2samp function does not give the same answer as ks.test in R. >> Does anybody know why they are different? Is ks_2samp compute >> something different? >> >> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R >>> ks.test(1:5, 11:15) >> >> Two-sample Kolmogorov-Smirnov test >> >> data: 1:5 and 11:15 >> D = 1, p-value = 0.007937 >> alternative hypothesis: two-sided >> >>> ks.test(1:5, 11:15, alternative='less') >> >> Two-sample Kolmogorov-Smirnov test >> >> data: 1:5 and 11:15 >> D^- = 0, p-value = 1 >> alternative hypothesis: the CDF of x lies below that of y >> >>> ks.test(1:5, 11:15, alternative='greater') >> >> Two-sample Kolmogorov-Smirnov test >> >> data: 1:5 and 11:15 >> D^+ = 1, p-value = 0.006738 >> alternative hypothesis: the CDF of x lies above that of y >> >>> >>> >> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py >> (1.0, 0.0037813540593701006) >> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py >> #!/usr/bin/env python >> >> from scipy.stats import ks_2samp >> print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) > > R uses by default an "exact" distribution for small samples if there > are no ties. > If there are ties or with a large sample, R uses the asymptotic distribution. > > If I read the function correctly, then scipy.stats is using a small > sample approximation by Stephens. (But I would have to look up the > formula to verify this.) http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2)) the small sample weighting ((en+0.12+0.11/en)*d) is the same as in Stephens (1970, 1985?) for the one sample test. I don't have a reference for the two sample approximation right now. (another bit of random information) tables are often only available for 0.01 to 0.25 and approximations are targeted on that range and might not be as accurate outside of it Josef > > In the example below with a bit larger sample and no ties, our > approximation is closer to R's "exact" pvalue than the asymptotic > distribution if exact=FALSE. > >> ks.test(1:25, (10:30)-0.5, exact=FALSE) > > Two-sample Kolmogorov-Smirnov test > > data: 1:25 and (10:30) - 0.5 > D = 0.36, p-value = 0.1038 > alternative hypothesis: two-sided > >> ks.test(1:25, (10:30)-0.5, exact=TRUE) > > Two-sample Kolmogorov-Smirnov test > > data: 1:25 and (10:30) - 0.5 > D = 0.36, p-value = 0.07608 > alternative hypothesis: two-sided > > >>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) > (0.35999999999999999, 0.078993426961291274) > > > For the 1 sample kstest I used (when I rewrote stats.kstest) an > approximation that is closer to the exact distribution than the > asymptotic distribution, but it's also not exact. > > It would be good to have better small sample approximations or exact > distributions, but I worked on this in scipy.stats when I barely had > any idea about goodness-of-fit tests. > Also, ks_2samp never got the enhancement for one-sided alternatives. > (In statsmodels I have been working so far only on one sample tests, > but not on two-sample tests.) > > (I don't remember if there is a minimum size recommendation, but the > examples I usually checked were larger.) matlab help: http://www.mathworks.com/help/stats/kstest2.html "The asymptotic p value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2 such that (n1*n2)/(n1 + n2) >= 4." > > > since it's a community project: Pull Request are welcome > > Josef > >> >> >> -- >> Regards, >> Peng >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Thu, Nov 1, 2012 at 9:41 PM, <[hidden email]> wrote:
> On Thu, Nov 1, 2012 at 9:14 PM, <[hidden email]> wrote: >> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: >>> Hi, >>> >>> The ks_2samp function does not give the same answer as ks.test in R. >>> Does anybody know why they are different? Is ks_2samp compute >>> something different? >>> >>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R >>>> ks.test(1:5, 11:15) >>> >>> Two-sample Kolmogorov-Smirnov test >>> >>> data: 1:5 and 11:15 >>> D = 1, p-value = 0.007937 >>> alternative hypothesis: two-sided >>> >>>> ks.test(1:5, 11:15, alternative='less') >>> >>> Two-sample Kolmogorov-Smirnov test >>> >>> data: 1:5 and 11:15 >>> D^- = 0, p-value = 1 >>> alternative hypothesis: the CDF of x lies below that of y >>> >>>> ks.test(1:5, 11:15, alternative='greater') >>> >>> Two-sample Kolmogorov-Smirnov test >>> >>> data: 1:5 and 11:15 >>> D^+ = 1, p-value = 0.006738 >>> alternative hypothesis: the CDF of x lies above that of y >>> >>>> >>>> >>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py >>> (1.0, 0.0037813540593701006) >>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py >>> #!/usr/bin/env python >>> >>> from scipy.stats import ks_2samp >>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) >> >> R uses by default an "exact" distribution for small samples if there >> are no ties. >> If there are ties or with a large sample, R uses the asymptotic distribution. >> >> If I read the function correctly, then scipy.stats is using a small >> sample approximation by Stephens. (But I would have to look up the >> formula to verify this.) > > http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test > has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2)) > the small sample weighting ((en+0.12+0.11/en)*d) is the same as in > Stephens (1970, 1985?) for the one sample test. > I don't have a reference for the two sample approximation right now. > > (another bit of random information) > tables are often only available for 0.01 to 0.25 and approximations > are targeted on that range and might not be as accurate outside of it > > Josef > > >> >> In the example below with a bit larger sample and no ties, our >> approximation is closer to R's "exact" pvalue than the asymptotic >> distribution if exact=FALSE. >> >>> ks.test(1:25, (10:30)-0.5, exact=FALSE) >> >> Two-sample Kolmogorov-Smirnov test >> >> data: 1:25 and (10:30) - 0.5 >> D = 0.36, p-value = 0.1038 >> alternative hypothesis: two-sided >> >>> ks.test(1:25, (10:30)-0.5, exact=TRUE) >> >> Two-sample Kolmogorov-Smirnov test >> >> data: 1:25 and (10:30) - 0.5 >> D = 0.36, p-value = 0.07608 >> alternative hypothesis: two-sided >> >> >>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) >> (0.35999999999999999, 0.078993426961291274) >> >> >> For the 1 sample kstest I used (when I rewrote stats.kstest) an >> approximation that is closer to the exact distribution than the >> asymptotic distribution, but it's also not exact. >> >> It would be good to have better small sample approximations or exact >> distributions, but I worked on this in scipy.stats when I barely had >> any idea about goodness-of-fit tests. >> Also, ks_2samp never got the enhancement for one-sided alternatives. >> (In statsmodels I have been working so far only on one sample tests, >> but not on two-sample tests.) >> >> (I don't remember if there is a minimum size recommendation, but the >> examples I usually checked were larger.) > > matlab help: http://www.mathworks.com/help/stats/kstest2.html > "The asymptotic p value becomes very accurate for large sample sizes, > and is believed to be reasonably accurate for sample sizes n1 and n2 > such that (n1*n2)/(n1 + n2) >= 4." >> >> >> since it's a community project: Pull Request are welcome >> >> Josef >> >>> >>> >>> -- >>> Regards, >>> Peng >>> _______________________________________________ >>> SciPy-User mailing list >>> [hidden email] >>> http://mail.scipy.org/mailman/listinfo/scipy-user SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Thu, Nov 1, 2012 at 9:42 PM, <[hidden email]> wrote:
> On Thu, Nov 1, 2012 at 9:41 PM, <[hidden email]> wrote: >> On Thu, Nov 1, 2012 at 9:14 PM, <[hidden email]> wrote: >>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: >>>> Hi, >>>> >>>> The ks_2samp function does not give the same answer as ks.test in R. >>>> Does anybody know why they are different? Is ks_2samp compute >>>> something different? >>>> >>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R >>>>> ks.test(1:5, 11:15) >>>> >>>> Two-sample Kolmogorov-Smirnov test >>>> >>>> data: 1:5 and 11:15 >>>> D = 1, p-value = 0.007937 >>>> alternative hypothesis: two-sided >>>> >>>>> ks.test(1:5, 11:15, alternative='less') >>>> >>>> Two-sample Kolmogorov-Smirnov test >>>> >>>> data: 1:5 and 11:15 >>>> D^- = 0, p-value = 1 >>>> alternative hypothesis: the CDF of x lies below that of y >>>> >>>>> ks.test(1:5, 11:15, alternative='greater') >>>> >>>> Two-sample Kolmogorov-Smirnov test >>>> >>>> data: 1:5 and 11:15 >>>> D^+ = 1, p-value = 0.006738 >>>> alternative hypothesis: the CDF of x lies above that of y >>>> >>>>> >>>>> >>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py >>>> (1.0, 0.0037813540593701006) >>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py >>>> #!/usr/bin/env python >>>> >>>> from scipy.stats import ks_2samp >>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) >>> >>> R uses by default an "exact" distribution for small samples if there >>> are no ties. >>> If there are ties or with a large sample, R uses the asymptotic distribution. >>> >>> If I read the function correctly, then scipy.stats is using a small >>> sample approximation by Stephens. (But I would have to look up the >>> formula to verify this.) >> >> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test >> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2)) >> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in >> Stephens (1970, 1985?) for the one sample test. >> I don't have a reference for the two sample approximation right now. >> >> (another bit of random information) >> tables are often only available for 0.01 to 0.25 and approximations > (hit send too fast) 0.001 to 0.25 >> are targeted on that range and might not be as accurate outside of it >> >> Josef >> >> >>> >>> In the example below with a bit larger sample and no ties, our >>> approximation is closer to R's "exact" pvalue than the asymptotic >>> distribution if exact=FALSE. >>> >>>> ks.test(1:25, (10:30)-0.5, exact=FALSE) >>> >>> Two-sample Kolmogorov-Smirnov test >>> >>> data: 1:25 and (10:30) - 0.5 >>> D = 0.36, p-value = 0.1038 >>> alternative hypothesis: two-sided >>> >>>> ks.test(1:25, (10:30)-0.5, exact=TRUE) >>> >>> Two-sample Kolmogorov-Smirnov test >>> >>> data: 1:25 and (10:30) - 0.5 >>> D = 0.36, p-value = 0.07608 >>> alternative hypothesis: two-sided >>> >>> >>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) >>> (0.35999999999999999, 0.078993426961291274) >>> >>> >>> For the 1 sample kstest I used (when I rewrote stats.kstest) an >>> approximation that is closer to the exact distribution than the >>> asymptotic distribution, but it's also not exact. >>> >>> It would be good to have better small sample approximations or exact >>> distributions, but I worked on this in scipy.stats when I barely had >>> any idea about goodness-of-fit tests. >>> Also, ks_2samp never got the enhancement for one-sided alternatives. >>> (In statsmodels I have been working so far only on one sample tests, >>> but not on two-sample tests.) >>> >>> (I don't remember if there is a minimum size recommendation, but the >>> examples I usually checked were larger.) >> >> matlab help: http://www.mathworks.com/help/stats/kstest2.html >> "The asymptotic p value becomes very accurate for large sample sizes, >> and is believed to be reasonably accurate for sample sizes n1 and n2 >> such that (n1*n2)/(n1 + n2) >= 4." >>> >>> >>> since it's a community project: Pull Request are welcome (since I haven't spammed the mailing list in a while, and it's fast) I think, except in very small sample, ks_2samp looks ok. (However, independent of our p-values, the power of Kolmogorov-Smirnov in small samples is awful. >>> stats.ttest_ind(np.arange(1.,26), np.arange(10,31.)-0.5) (-3.2015129142545442, 0.0025398589272691129) reject Null >>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) (0.35999999999999999, 0.078993426961291274) don't reject Null at 0.05 ) --------------------------- # -*- coding: utf-8 -*- """Comparing pvalues for ks_2samp in small sample, permutation test Created on Thu Nov 01 22:42:30 2012 Author: Josef Perktold Results: sample: 1 permutation 0.07704 scipy 0.0789934269613 R exact 0.07608 R asymp 0.1038 subset results [ 0.0774 0.0783 0.0732 0.0772 0.0775 0.0755 0.0791 0.0743 0.0816 0.0763] >>> execfile('permutations_ks1samp.py') sample: 0 permutation 0.00791 scipy 0.00378135405937 R exact 0.007937 R asymp 0.01348 subset results [ 0.008 0.0094 0.0074 0.0071 0.0071 0.0084 0.0073 0.0084 0.0083 0.0077] """ import numpy as np from scipy import stats sample = 0 if sample == 0: x1, y1 = np.arange(1.,6), np.arange(11,16.) else: x1, y1 = np.arange(1.,26), np.arange(10,31.)-0.5 nx = len(x1) D, pval = stats.ks_2samp(x1, y1) #permutation p-values n_rep = 100000 #make it large results = np.empty((n_rep, 2)) results.fill(np.nan) x, y = x1.copy(), y1.copy() xy = np.concatenate((x,y)) np.random.seed(2345678) for ii in xrange(n_rep): np.random.shuffle(xy) results[ii] = stats.ks_2samp(xy[:nx], xy[nx:]) print '\nsample:', sample print 'permutation', (results[:,0] >= D).mean() print 'scipy ', pval if sample == 0: print 'R exact ', 0.007937 print 'R asymp ', 0.01348 else: print 'R exact ', 0.07608 print 'R asymp ', 0.1038 print '\nsubset results' print (results[:,0].reshape(10000, -1) >= D).mean(0) ------------------------- Josef >>> >>> Josef >>> >>>> >>>> >>>> -- >>>> Regards, >>>> Peng >>>> _______________________________________________ >>>> SciPy-User mailing list >>>> [hidden email] >>>> http://mail.scipy.org/mailman/listinfo/scipy-user SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Thu, Nov 1, 2012 at 11:27 PM, <[hidden email]> wrote:
> On Thu, Nov 1, 2012 at 9:42 PM, <[hidden email]> wrote: >> On Thu, Nov 1, 2012 at 9:41 PM, <[hidden email]> wrote: >>> On Thu, Nov 1, 2012 at 9:14 PM, <[hidden email]> wrote: >>>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <[hidden email]> wrote: >>>>> Hi, >>>>> >>>>> The ks_2samp function does not give the same answer as ks.test in R. >>>>> Does anybody know why they are different? Is ks_2samp compute >>>>> something different? >>>>> >>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R >>>>>> ks.test(1:5, 11:15) >>>>> >>>>> Two-sample Kolmogorov-Smirnov test >>>>> >>>>> data: 1:5 and 11:15 >>>>> D = 1, p-value = 0.007937 >>>>> alternative hypothesis: two-sided >>>>> >>>>>> ks.test(1:5, 11:15, alternative='less') >>>>> >>>>> Two-sample Kolmogorov-Smirnov test >>>>> >>>>> data: 1:5 and 11:15 >>>>> D^- = 0, p-value = 1 >>>>> alternative hypothesis: the CDF of x lies below that of y >>>>> >>>>>> ks.test(1:5, 11:15, alternative='greater') >>>>> >>>>> Two-sample Kolmogorov-Smirnov test >>>>> >>>>> data: 1:5 and 11:15 >>>>> D^+ = 1, p-value = 0.006738 >>>>> alternative hypothesis: the CDF of x lies above that of y >>>>> >>>>>> >>>>>> >>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py >>>>> (1.0, 0.0037813540593701006) >>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py >>>>> #!/usr/bin/env python >>>>> >>>>> from scipy.stats import ks_2samp >>>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15]) >>>> >>>> R uses by default an "exact" distribution for small samples if there >>>> are no ties. >>>> If there are ties or with a large sample, R uses the asymptotic distribution. >>>> >>>> If I read the function correctly, then scipy.stats is using a small >>>> sample approximation by Stephens. (But I would have to look up the >>>> formula to verify this.) >>> >>> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test >>> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2)) >>> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in >>> Stephens (1970, 1985?) for the one sample test. >>> I don't have a reference for the two sample approximation right now. >>> >>> (another bit of random information) >>> tables are often only available for 0.01 to 0.25 and approximations >> (hit send too fast) 0.001 to 0.25 >>> are targeted on that range and might not be as accurate outside of it >>> >>> Josef >>> >>> >>>> >>>> In the example below with a bit larger sample and no ties, our >>>> approximation is closer to R's "exact" pvalue than the asymptotic >>>> distribution if exact=FALSE. >>>> >>>>> ks.test(1:25, (10:30)-0.5, exact=FALSE) >>>> >>>> Two-sample Kolmogorov-Smirnov test >>>> >>>> data: 1:25 and (10:30) - 0.5 >>>> D = 0.36, p-value = 0.1038 >>>> alternative hypothesis: two-sided >>>> >>>>> ks.test(1:25, (10:30)-0.5, exact=TRUE) >>>> >>>> Two-sample Kolmogorov-Smirnov test >>>> >>>> data: 1:25 and (10:30) - 0.5 >>>> D = 0.36, p-value = 0.07608 >>>> alternative hypothesis: two-sided >>>> >>>> >>>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) >>>> (0.35999999999999999, 0.078993426961291274) >>>> >>>> >>>> For the 1 sample kstest I used (when I rewrote stats.kstest) an >>>> approximation that is closer to the exact distribution than the >>>> asymptotic distribution, but it's also not exact. >>>> >>>> It would be good to have better small sample approximations or exact >>>> distributions, but I worked on this in scipy.stats when I barely had >>>> any idea about goodness-of-fit tests. >>>> Also, ks_2samp never got the enhancement for one-sided alternatives. >>>> (In statsmodels I have been working so far only on one sample tests, >>>> but not on two-sample tests.) >>>> >>>> (I don't remember if there is a minimum size recommendation, but the >>>> examples I usually checked were larger.) >>> >>> matlab help: http://www.mathworks.com/help/stats/kstest2.html >>> "The asymptotic p value becomes very accurate for large sample sizes, >>> and is believed to be reasonably accurate for sample sizes n1 and n2 >>> such that (n1*n2)/(n1 + n2) >= 4." >>>> >>>> >>>> since it's a community project: Pull Request are welcome > > (since I haven't spammed the mailing list in a while, and it's fast) > > I think, except in very small sample, ks_2samp looks ok. > > (However, independent of our p-values, the power of Kolmogorov-Smirnov > in small samples is awful. >>>> stats.ttest_ind(np.arange(1.,26), np.arange(10,31.)-0.5) > (-3.2015129142545442, 0.0025398589272691129) reject Null >>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5) > (0.35999999999999999, 0.078993426961291274) don't reject Null at 0.05 > ) Anderson-Darling has in most cases higher power than Kolmogorov-Smirnov on my wishlist for a long time: "K-sample Andersonâ€“Darling tests" http://scholar.google.com/scholar?cluster=1766356837451056669&hl=en&as_sdt=0,5&sciodt=0,5 volunteers? Josef > > --------------------------- > # -*- coding: utf-8 -*- > """Comparing pvalues for ks_2samp in small sample, permutation test > > Created on Thu Nov 01 22:42:30 2012 > Author: Josef Perktold > > Results: > > sample: 1 > permutation 0.07704 > scipy 0.0789934269613 > R exact 0.07608 > R asymp 0.1038 > > subset results > [ 0.0774 0.0783 0.0732 0.0772 0.0775 0.0755 0.0791 0.0743 0.0816 > 0.0763] >>>> execfile('permutations_ks1samp.py') > > sample: 0 > permutation 0.00791 > scipy 0.00378135405937 > R exact 0.007937 > R asymp 0.01348 > > subset results > [ 0.008 0.0094 0.0074 0.0071 0.0071 0.0084 0.0073 0.0084 0.0083 > 0.0077] > > """ > > import numpy as np > from scipy import stats > > sample = 0 > > if sample == 0: > x1, y1 = np.arange(1.,6), np.arange(11,16.) > else: > x1, y1 = np.arange(1.,26), np.arange(10,31.)-0.5 > nx = len(x1) > > D, pval = stats.ks_2samp(x1, y1) > > #permutation p-values > n_rep = 100000 #make it large > results = np.empty((n_rep, 2)) > results.fill(np.nan) > > x, y = x1.copy(), y1.copy() > xy = np.concatenate((x,y)) > np.random.seed(2345678) > for ii in xrange(n_rep): > np.random.shuffle(xy) > results[ii] = stats.ks_2samp(xy[:nx], xy[nx:]) > > print '\nsample:', sample > print 'permutation', (results[:,0] >= D).mean() > print 'scipy ', pval > if sample == 0: > print 'R exact ', 0.007937 > print 'R asymp ', 0.01348 > else: > print 'R exact ', 0.07608 > print 'R asymp ', 0.1038 > > print '\nsubset results' > print (results[:,0].reshape(10000, -1) >= D).mean(0) > ------------------------- > > Josef > > >>>> >>>> Josef >>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Peng >>>>> _______________________________________________ >>>>> SciPy-User mailing list >>>>> [hidden email] >>>>> http://mail.scipy.org/mailman/listinfo/scipy-user SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |