Most statistical tests and statistical inference in scipy.stats and
statsmodels relies on large number assumptions. Everyone is talking about "Big data", but is anyone still interested in doing small sample statistics in python. I'd like to know whether it's worth spending any time on general purpose small sample statistics. for example: http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html ``` Example homework problem: Twenty participants were given a list of 20 words to process. The 20 participants were randomly assigned to one of two treatment conditions. Half were instructed to count the number of vowels in each word (shallow processing). Half were instructed to judge whether the object described by each word would be useful if one were stranded on a desert island (deep processing). After a brief distractor task, all subjects were given a surprise free recall task. The number of words correctly recalled was recorded for each subject. Here are the data: Shallow Processing: 13 12 11 9 11 13 14 14 14 15 Deep Processing: 12 15 14 14 13 12 15 14 16 17 ``` Josef _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Thu, Oct 11, 2012 at 10:57:23AM -0400, [hidden email] wrote:
> Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. I am! > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. It is. Big data is a buzz, but few people have big data. In addition, what they don't realize is that it is often a small sample problem in terms of statistics, as the number of sample is often not much bigger than the number of features. Thanks for all your work on scipy.stats! Gael _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
On 11 October 2012 15:57, <[hidden email]> wrote:
> Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. > > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. I'm certainly interested in that sort of thing - a lot of biology still revolves around simple, 'small data' stats. Thanks, Thomas _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
On Thu, Oct 11, 2012 at 7:57 AM, <[hidden email]> wrote:
> Most statistical tests and statistical inference in scipy.stats and > statsmodels relies on large number assumptions. > > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. +1 -- Sergio (Serge) Rey Professor, School of Geographical Sciences and Urban Planning GeoDa Center for Geospatial Analysis and Computation Arizona State University http://geoplan.asu.edu/rey Editor, International Regional Science Review http://irx.sagepub.com _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
On 10/11/2012 04:57 PM, [hidden email] wrote:
> Most statistical tests and statistical inference in scipy.stats and > statsmodels relies on large number assumptions. > > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. > > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. > > for example: > > http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html > > ``` > Example homework problem: > [...] > Shallow Processing: 13 12 11 9 11 13 14 14 14 15 > Deep Processing: 12 15 14 14 13 12 15 14 16 17 > ``` I am very interested in inference from small samples, but I have some concerns about both the example and the proposed approach based on the permutation test. IMHO the question in the example at that URL, i.e. "Did the instructions given to the participants significantly affect their level of recall?" is not directly addressed by the permutation test. The permutation test is related the question "how (un)likely is the collected dataset under the assumption that the instructions did not affect the level of recall?". In other words the initial question is about quantifying how likely is the hypothesis "the instructions do not affect the level of recall" (let's call it H_0) given the collected dataset, with respect to how likely is the hypothesis "the instructions affect the level of recall" (let's call it H_1) given the data. In a bit more formal notation the initial question is about estimating p(H_0|data) and p(H_1|data), while the permutation test provides a different quantity, which is related (see [0]) to p(data|H_0). Clearly p(data|H_0) is different from p(H_0|data). Literature on this point is for example http://dx.doi.org/10.1016/j.socec.2004.09.033 On a different side, I am also interested in understanding which are the assumptions under which the permutation test is expected to work. I am not an expert in that field but, as far as I know, the permutation test - and all resampling approaches in general - requires that the sample is "representative" of the underlying distribution of the problem. In my opinion this requirement is difficult to assess in practice and it is even more troubling for the specific case of "small data" - of interest for this thread. Any comment on these points is warmly welcome. Best, Emanuele [0] A minor detail: I said "related" because the outcome of the permutation test, and of classical tests for hypothesis testing in general, is not precisely p(data|H_0). First of all those tests rely on a statistic of the dataset and not on the dataset itself. In the example at the URL the statistic (called "criterion" there) is the difference between the means of the two groups. Second and more important, the test provides an estimate of the probability of observing such a value for the statistic... "or a more extreme one". So if we call the statistic over the data as T(data), then the classical tests provide p(t>T(data)|H_0), and not p(data|H_0). Anyway even p(t>T(data)|H_0) is clearly different from the initial question, i.e. p(H_0|data). _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On 12.10.2012 10:36, Emanuele Olivetti wrote:
> In other words the initial question is about quantifying how likely is the > hypothesis "the instructions do not affect the level of recall" > (let's call it H_0) given the collected dataset, with respect to how likely is the > hypothesis "the instructions affect the level of recall" (let's call it H_1) > given the data. In a bit more formal notation the initial question is about > estimating p(H_0|data) and p(H_1|data), while the permutation test provides > a different quantity, which is related (see [0]) to p(data|H_0). Clearly > p(data|H_0) is different from p(H_0|data). Here you must use Bayes formula :) p(H_0|data) is proportional to p(data|H_0) * p(H_0 a priori) The scale factor is just a constant, so you can generate samples from p(H_0|data) simply by using a Markov chain (e.g. Gibbs sampler) to sample from p(data|H_0) * p(H_0 a priori). And that is what we call "Bayesian statistics" :-) The "classical statistics" (sometimes called "frequentist") is very different and deals with long-run error rates you would get if the experiment and data collection are repeated. In this framework is is meaningless to speak about p(H_0|data) or p(H_0 a priori), because H_0 is not considered a random variable. Probabilities can only be assigned to random variables. The main difference from the Bayesian approach is thus that a Bayesian consider the collected data fixed and H_0 random, whereas a frequentist consider the data random and H_0 fixed. To a Bayesian the data are what you got and "the universal truth about H0" in unkown. Randomness is the uncertainty about this truth. Probability is a measurement of the precision or knowledge about H0. Doing the transform p * log2(p) yields the Shannon information in bits. To a frequentist, the data are random (i.e. collecting a new set will yield a different sample) and "the universal truth about H0" is fixed but unknown. Randomness is the process that gives you a different data set each time you draw a sample. It is not the uncertainty about H0. Choosing side it is more a matter of religion than science. Both approaches have major flaws: * The Bayesian approach is not scale invariable. A monotonic transform like y = f(x) can yield a different conclusion if we analyze y instead of x. For example your null hypothesis can be true if you used a linear scale and false if you have used a log-scale. Also, the conclusion is dependent on your prior opinion, which can be subjective. * The frequentist approach makes it possible to collect too much data. If you just collect enough data, any correlation or two-sided test will be significant. Obviously collecting more data should always give you better information, not invariably lead to a fixed conclusion. Why do statistics if you know the conclusion in advance? Sturla _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Emanuele Olivetti-3
On 12 Oct 2012 09:37, "Emanuele Olivetti" <[hidden email]> wrote: In this sentence, the word "significantly" is a term of art used to refer exactly to the quantity p(t>T(data)|H_0). So, yes, the permutation test addresses the original question; you just have to be familiar with the field's particular jargon to understand what they're saying. :-) > The permutation test is All tests require some kind of representativeness, and this isn't really a problem. The data are by definition representative (in the technical sense) of the distribution they were drawn from. (The trouble comes when you want to decide whether that distribution matches anything you care about, but looking at the data won't tell you that.) A well designed test is one that is correct on average across samples. The alternative to a permutation test here is to make very strong assumptions about the underlying distributions (e.g. with a t test), and these assumptions are often justified only for large samples. And, resampling tests are computationally expensive, but this is no problem for small samples. So that's why non parametrics are often better in this setting. -n > Any comment on these points is warmly welcome. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Sturla Molden-2
On 12.10.2012 13:12, Sturla Molden wrote:
> * The Bayesian approach is not scale invariable. A monotonic transform > like y = f(x) can yield a different conclusion if we analyze y instead > of x. And this, by the way, is what really pissed off Ronald A. Fisher, the father of the "p-value". He constructed the p-value as a heuristic for assessing H0 specifically to avoid this issue. Ronald A. Fisher never accepted the significance testing (type-1 and type-2 error rates) of Pearson and Neuman, as experiments are seldom repeated. In fact the p-value has nothing to do with significance testing. To correct the other issues of the p-value Fisher later constructed a different kind of analysis he called "fiuducial inference". It is not commonly used today. It depends on looking at hypothesis testing as signal processing: measurement = signal + noise The noise is considered random and and the signal is the truth about H0. Fisher argued we can interfere the truth about H0 from subtracting the random noise from the collected data. The method has none of the absurdities of Bayesian and classical statistics, but for some reason it never got popular among practitioners. Sturla _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Sturla Molden-2
Hi Sturla,
Thanks for the brief review of the frequentist and Bayesian differences (I'll try to send a few comments in a future post). The aim of my previous message was definitely more pragmatic and it boiled down to two questions that stick with Josef's call: 1) In this thread people expressed interest in making hypothesis testing from small samples, so is permutation test addressing the question of the accompanying motivating example? In my opinion it is not and I hope I provided brief but compelling motivation to support this point of view. 2) What are the assumptions under which the permutation test is valid/acceptable (independently from the accompanying motivating example)? I have looked around on this topic but I had just found generic desiderata for all resampling approaches, i.e. that the sample should be "representative" of the underlying distribution - whatever this means in practical terms. What's your take on these two questions? I guess it would be nice to clarify/discuss the motivating questions and the assumptions in this thread before planning any coding. Best, Emanuele On 10/12/2012 01:12 PM, Sturla Molden wrote: > [...] > > The "classical statistics" (sometimes called "frequentist") is very > different and deals with long-run error rates you would get if the > experiment and data collection are repeated. In this framework is is > meaningless to speak about p(H_0|data) or p(H_0 a priori), because H_0 > is not considered a random variable. Probabilities can only be assigned > to random variables. > > > [...] > > To a Bayesian the data are what you got and "the universal truth about > H0" in unkown. Randomness is the uncertainty about this truth. > Probability is a measurement of the precision or knowledge about H0. > Doing the transform p * log2(p) yields the Shannon information in bits. > > [...] > Choosing side it is more a matter of religion than science. > > > _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On 12.10.2012 16:21, Emanuele Olivetti wrote:
> 1) In this thread people expressed interest in making hypothesis testing > from small samples, so is permutation test addressing the question of > the accompanying motivating example? In my opinion it is not and I hope I > provided brief but compelling motivation to support this point of view. For the problem Josef described, I'd analyze that as a two-sample goodness-of-fit test against a common bin(20,p) distribution. > 2) What are the assumptions under which the permutation test is > valid/acceptable (independently from the accompanying motivating example)? > I have looked around on this topic but I had just found generic desiderata for > all resampling approaches, i.e. that the sample should be "representative" > of the underlying distribution - whatever this means in practical terms. Ronald A. Fisher considered the permutation test to be the "exact procedure" the t-test should approximate. It has, in fact, all the assumptions of the t-test. Surprisingly many think the t-test assume normally distributed data. It does not. If you have this idea too, forget it please. The t-test only asserts that the large-sample "sampling distribution of the mean" (i.e. the mean you calculate, not the data point themselves) is a normal distribution. This is due to the central limit theorem. If you collect enough data, the distribution of the sample mean will converge towards a normal distribution. That is a mathematical necessity, and can be proven to always be the case. But with small data samples, the sampling distribution of the mean can deviate from a normal distribution. That is when we need to use the permutation test instead. I.e.: The t-test is an approximation to the permutation test for "large enough" data samples. What we mean by "large enough" is another story. We can e.g. estimate the sampling distribution of the mean using Efron's bootstrap, and run a goodness-of-fit test. What most practitioners do, though, is to check if their data is approximately normally distributed. That usually signifies a lack of understanding for the t-test. They think the data must be normal. The data do not. But if the data are normally distributed we can be sure the sample mean is normal as well. So under what circumstances are the assumptions for the permutation test not satisfied? One notable example is the Behrens-Fisher problem! That is, you want to compare the expectancy value of two distributions with different variance. The permutation test does not help to solve this problem any more than the t-test does. This is clearly a situation where distributions matter, showing that the permutation test is not a "distribution free" test. Sturla _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nathaniel Smith
On 10/12/2012 01:22 PM, Nathaniel Smith
wrote:
Thanks Nathaniel for pointing that out. I guess I'll hardly be much familiar with such a jargon ;-). Nevertheless while reading the example I believed that the aim of the thought experiment was to decide among two competing theories/hypothesis, given the results of the experiment. But I share your point that the term "significant" turns it into a different question.
Indeed my wording was imprecise so thanks once more for correcting it. Moreover you put it really well: "The trouble comes when you want to decide whether that distribution matches anything you care about, but looking at the data won't tell you that". Could you tell more about evaluating the correctness of a test across different samples? It sounds interesting.
I agree with you that strong assumptions about the underlying distributions, e.g. parametric modeling, may raise big practical concerns. The only pro is that at least you know the assumptions explicitly. Best, Emanuele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Emanuele Olivetti-3
On Fri, Oct 12, 2012 at 10:21 AM, Emanuele Olivetti
<[hidden email]> wrote: > Hi Sturla, > > Thanks for the brief review of the frequentist and Bayesian differences > (I'll try to send a few comments in a future post). > > The aim of my previous message was definitely more pragmatic > and it boiled down to two questions that stick with Josef's call: My aim is even more practical: If everyone else has it, and it's useful, then let's do it in Python. as for mannwhineyu this would mean tables for very small samples exact permutation for the next higher, and random permutation for medium sample sizes. (and advertise empirical likelihood in statsmodels) and for other cases (somewhere in the future) bias correction and higher order expansions of the distribution of the test statistics or estimates. http://www.alglib.net/hypothesistesting/mannwhitneyu.php (Limitation: There are too many things for "let's make it available in python".) > > 1) In this thread people expressed interest in making hypothesis testing > from small samples, so is permutation test addressing the question of > the accompanying motivating example? In my opinion it is not and I hope I > provided brief but compelling motivation to support this point of view. I got two questions "wrong" in the survey. And had to struggle with several of these http://en.wikipedia.org/wiki/P-value#Misunderstandings (especially because I was implicitly adding "if the Null is true" to some of the statements.) I find the "at least one wrong answer" graph misleading compared to the break down by question. Under the assumptions of the tests and the permutation distribution, I think the permutation tests answer the question whether there are statistically significant differences (in means, medians, distributions) across samples. But it's in the classical statistical test tradition. http://en.wikipedia.org/wiki/Uniformly_most_powerful_test consistency of test, ... > > 2) What are the assumptions under which the permutation test is > valid/acceptable (independently from the accompanying motivating example)? > I have looked around on this topic but I had just found generic desiderata for > all resampling approaches, i.e. that the sample should be "representative" > of the underlying distribution - whatever this means in practical terms. I collected a few papers, but haven't read them yet or only partially https://github.com/statsmodels/statsmodels/wiki/Permutation-Tests One problem is that all tests rely on assumptions and with small samples there is not enough information to tests the underlying assumptions or to switch to something that requires even weaker assumptions and still have power. For example my small Monte Carlo with mannwhitneyu: Difference between permutation pvalues and large sample normal distribution p-values is not large. I saw one recommendation that 7 observations for each sample is enough. One reference says the extreme tail probabilities are inaccurate. With only a few observations, the power of the test is very low and only detects large differences. If the distributions of the observations are symmetric and the sample size is the same, then both permutation and normal pvalues are correctly sized (close to 0.05 under the null) even if the underlying distributions are different (t(2) versus normal). If the sample sizes are unequal then differences in the distributions, causes a bias in the test, under- or over-rejecting. >From the references it sounds like that if the distributions are skewed, then the tests are also incorrectly sized. The main problem I have in terms of interpretation is that we are in many cases not really estimating a mean or median shift, but more likely stochastic dominance. Under one condition the distribution has "higher" values then under the other condition, where "higher" could mean mean-shift or just some higher quantiles (more weight on larger values). Thanks for the comments. Josef > > What's your take on these two questions? > I guess it would be nice to clarify/discuss the motivating questions and the > assumptions in this thread before planning any coding. > > Best, > > Emanuele > > > On 10/12/2012 01:12 PM, Sturla Molden wrote: >> [...] >> >> The "classical statistics" (sometimes called "frequentist") is very >> different and deals with long-run error rates you would get if the >> experiment and data collection are repeated. In this framework is is >> meaningless to speak about p(H_0|data) or p(H_0 a priori), because H_0 >> is not considered a random variable. Probabilities can only be assigned >> to random variables. >> >> >> [...] >> >> To a Bayesian the data are what you got and "the universal truth about >> H0" in unkown. Randomness is the uncertainty about this truth. >> Probability is a measurement of the precision or knowledge about H0. >> Doing the transform p * log2(p) yields the Shannon information in bits. >> >> [...] >> Choosing side it is more a matter of religion than science. >> >> >> > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Emanuele Olivetti-3
On Fri, Oct 12, 2012 at 4:27 PM, Emanuele Olivetti
<[hidden email]> wrote: > On 10/12/2012 01:22 PM, Nathaniel Smith wrote: > > On 12 Oct 2012 09:37, "Emanuele Olivetti" <[hidden email]> wrote: > >> IMHO the question in the example at that URL, i.e. "Did the instructions >> given to the participants significantly affect their level of recall?" is >> not directly addressed by the permutation test. > > In this sentence, the word "significantly" is a term of art used to refer > exactly to the quantity p(t>T(data)|H_0). So, yes, the permutation test > addresses the original question; you just have to be familiar with the > field's particular jargon to understand what they're saying. :-) > > > Thanks Nathaniel for pointing that out. I guess I'll hardly be much familiar > with > such a jargon ;-). Nevertheless while reading the example I believed > that the aim of the thought experiment was to decide among two competing > theories/hypothesis, given the results of the experiment. Well, it is, at some level. But in practice psychologists are not simple Bayesian updaters, and in the context of their field's practices, the way you make these decisions involves Neyman-Pearson significance tests as one component. Of course one can debate whether that is a good thing or not (I actually tend to fall on the side that says it *is* a good thing), but that's getting pretty far afield of Josef's question :-). > But I share your point that the term "significant" turns it into a different > question. > > > All tests require some kind of representativeness, and this isn't really a > problem. The data are by definition representative (in the technical sense) > of the distribution they were drawn from. (The trouble comes when you want > to decide whether that distribution matches anything you care about, but > looking at the data won't tell you that.) A well designed test is one that > is correct on average across samples. > > > Indeed my wording was imprecise so thanks once more for correcting > it. Moreover you put it really well: "The trouble comes when you want to > > decide whether that distribution matches anything you care about, but > looking at the data won't tell you that". > Could you tell more about evaluating the correctness of a test across > different samples? It sounds interesting. Well, it's a relatively simple point, actually. The definition of a good frequentist significance test is a function f(data) which returns a p-value, and this p-value satisfies two rules: 1) When 'data' is sampled from the null hypothesis distribution, then f(data) is uniformly distributed between 0 and 1. 2) When 'data' is sampled from an alternative distribution of interest, then f(data) will have a distribution that is peaked near 0. So the point is just that you can't tell whether a given function f(data) is well-behaved or not by looking at a single value for 'data', since the requirements for being well-behaved talk only about the distribution of f(data) given a distribution for 'data'. -n _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by josef.pktd
On Thu, Oct 11, 2012 at 10:57 AM, <[hidden email]> wrote:
> Most statistical tests and statistical inference in scipy.stats and > statsmodels relies on large number assumptions. > > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. > > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. > > for example: > > http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html > > ``` > Example homework problem: > Twenty participants were given a list of 20 words to process. The 20 > participants were randomly assigned to one of two treatment > conditions. Half were instructed to count the number of vowels in each > word (shallow processing). Half were instructed to judge whether the > object described by each word would be useful if one were stranded on > a desert island (deep processing). After a brief distractor task, all > subjects were given a surprise free recall task. The number of words > correctly recalled was recorded for each subject. Here are the data: > > Shallow Processing: 13 12 11 9 11 13 14 14 14 15 > Deep Processing: 12 15 14 14 13 12 15 14 16 17 > ``` example: R package coin http://cran.r-project.org/web/packages/coin/vignettes/coin.pdf found again while digging for an error in p-values in stats.wilcoxon in the presence of ties https://github.com/scipy/scipy/pull/338 and enhancements for it. Josef > Josef _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |