# "small data" statistics Classic List Threaded 14 messages Open this post in threaded view
|

## "small data" statistics

 Most statistical tests and statistical inference in scipy.stats and statsmodels relies on large number assumptions. Everyone is talking about "Big data", but is anyone still interested in doing small sample statistics in python. I'd like to know whether it's worth spending any time on general purpose small sample statistics. for example: http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html``` Example homework problem: Twenty participants were given a list of 20 words to process. The 20 participants were randomly assigned to one of two treatment conditions. Half were instructed to count the number of vowels in each word (shallow processing). Half were instructed to judge whether the object described by each word would be useful if one were stranded on a desert island (deep processing). After a brief distractor task, all subjects were given a surprise free recall task. The number of words correctly recalled was recorded for each subject. Here are the data: Shallow Processing: 13 12 11 9 11 13 14 14 14 15 Deep Processing: 12 15 14 14 13 12 15 14 16 17 ``` Josef _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

 On Thu, Oct 11, 2012 at 10:57:23AM -0400, [hidden email] wrote: > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. I am! > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. It is. Big data is a buzz, but few people have big data. In addition, what they don't realize is that it is often a small sample problem in terms of statistics, as the number of sample is often not much bigger than the number of features. Thanks for all your work on scipy.stats! Gael _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

 In reply to this post by josef.pktd On 11 October 2012 15:57,  <[hidden email]> wrote: > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. > > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. I'm certainly interested in that sort of thing - a lot of biology still revolves around simple, 'small data' stats. Thanks, Thomas _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

 In reply to this post by josef.pktd On Thu, Oct 11, 2012 at 7:57 AM,  <[hidden email]> wrote: > Most statistical tests and statistical inference in scipy.stats and > statsmodels relies on large number assumptions. > > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. +1 -- Sergio (Serge) Rey Professor, School of Geographical Sciences and Urban Planning GeoDa Center for Geospatial Analysis and Computation Arizona State University http://geoplan.asu.edu/reyEditor, International Regional Science Review http://irx.sagepub.com_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

 In reply to this post by josef.pktd On 10/11/2012 04:57 PM, [hidden email] wrote: > Most statistical tests and statistical inference in scipy.stats and > statsmodels relies on large number assumptions. > > Everyone is talking about "Big data", but is anyone still interested > in doing small sample statistics in python. > > I'd like to know whether it's worth spending any time on general > purpose small sample statistics. > > for example: > > http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html> > ``` > Example homework problem: > [...] > Shallow Processing: 13 12 11 9 11 13 14 14 14 15 > Deep Processing: 12 15 14 14 13 12 15 14 16 17 > ``` I am very interested in inference from small samples, but I have some concerns about both the example and the proposed approach based on the permutation test. IMHO the question in the example at that URL, i.e. "Did the instructions given to the participants significantly affect their level of recall?" is not directly addressed by the permutation test. The permutation test is related the question "how (un)likely is the collected dataset under the assumption that the instructions did not affect the level of recall?". In other words the initial question is about quantifying how likely is the hypothesis "the instructions do not affect the level of recall" (let's call it H_0) given the collected dataset, with respect to how likely is the hypothesis "the instructions affect the level of recall" (let's call it H_1) given the data. In a bit more formal notation the initial question is about estimating p(H_0|data) and p(H_1|data), while the permutation test provides a different quantity, which is related (see ) to p(data|H_0). Clearly p(data|H_0) is different from p(H_0|data). Literature on this point is for example http://dx.doi.org/10.1016/j.socec.2004.09.033On a different side, I am also interested in understanding which are the assumptions under which the permutation test is expected to work. I am not an expert in that field but, as far as I know, the permutation test - and all resampling approaches in general - requires that the sample is "representative" of the underlying distribution of the problem. In my opinion this requirement is difficult to assess in practice and it is even more troubling for the specific case of "small data" - of interest for this thread. Any comment on these points is warmly welcome. Best, Emanuele  A minor detail: I said "related" because the outcome of the permutation test, and of classical tests for hypothesis testing in general, is not precisely p(data|H_0). First of all those tests rely on a statistic of the dataset and not on the dataset itself. In the example at the URL the statistic (called "criterion" there) is the difference between the means of the two groups. Second and more important, the test provides an estimate of the probability of observing such a value for the statistic... "or a more extreme one". So if we call the statistic over the data as T(data), then the classical tests provide p(t>T(data)|H_0), and not p(data|H_0). Anyway even p(t>T(data)|H_0) is clearly different from the initial question, i.e. p(H_0|data). _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

Open this post in threaded view
|

## Re: "small data" statistics

 In reply to this post by Emanuele Olivetti-3 On 12 Oct 2012 09:37, "Emanuele Olivetti" <[hidden email]> wrote: > > On 10/11/2012 04:57 PM, [hidden email] wrote: > > Most statistical tests and statistical inference in scipy.stats and > > statsmodels relies on large number assumptions. > > > > Everyone is talking about "Big data", but is anyone still interested > > in doing small sample statistics in python. > > > > I'd like to know whether it's worth spending any time on general > > purpose small sample statistics. > > > > for example: > > > > http://facultyweb.berry.edu/vbissonnette/statshw/doc/perm_2bs.html > > > > ``` > > Example homework problem: > > [...] > > Shallow Processing: 13 12 11 9 11 13 14 14 14 15 > > Deep Processing: 12 15 14 14 13 12 15 14 16 17 > > ``` > > I am very interested in inference from small samples, but I have > some concerns about both the example and the proposed approach > based on the permutation test. > > IMHO the question in the example at that URL, i.e. "Did the instructions > given to the participants significantly affect their level of recall?" is > not directly addressed by the permutation test. In this sentence, the word "significantly" is a term of art used to refer exactly to the quantity p(t>T(data)|H_0). So, yes, the permutation test addresses the original question; you just have to be familiar with the field's particular jargon to understand what they're saying. :-) > The permutation test is > related the question "how (un)likely is the collected dataset under the > assumption that the instructions did not affect the level of recall?". > > In other words the initial question is about quantifying how likely is the > hypothesis "the instructions do not affect the level of recall" > (let's call it H_0) given the collected dataset, with respect to how likely is the > hypothesis "the instructions affect the level of recall" (let's call it H_1) > given the data. In a bit more formal notation the initial question is about > estimating p(H_0|data) and p(H_1|data), while the permutation test provides > a different quantity, which is related (see ) to p(data|H_0). Clearly > p(data|H_0) is different from p(H_0|data). > Literature on this point is for example http://dx.doi.org/10.1016/j.socec.2004.09.033 > > On a different side, I am also interested in understanding which are the assumptions > under which the permutation test is expected to work. I am not an expert in that > field but, as far as I know, the permutation test - and all resampling approaches > in general - requires that the sample is "representative" of the underlying > distribution of the problem. In my opinion this requirement is difficult to assess > in practice and it is even more troubling for the specific case of "small data" - of > interest for this thread. All tests require some kind of representativeness, and this isn't really a problem. The data are by definition representative (in the technical sense) of the distribution they were drawn from. (The trouble comes when you want to decide whether that distribution matches anything you care about, but looking at the data won't tell you that.) A well designed test is one that is correct on average across samples. The alternative to a permutation test here is to make very strong assumptions about the underlying distributions (e.g. with a t test), and these assumptions are often justified only for large samples.  And, resampling tests are computationally expensive, but this is no problem for small samples. So that's why non parametrics are often better in this setting. -n > Any comment on these points is warmly welcome. > > Best, > > Emanuele > >  A minor detail: I said "related" because the outcome of the permutation test, > and of classical tests for hypothesis testing in general, is not precisely p(data|H_0). > First of all those tests rely on a statistic of the dataset and not on the dataset itself. > In the example at the URL the statistic (called "criterion" there) is the difference > between the means of the two groups. Second and more important, > the test provides an estimate of the probability of observing such a value > for the statistic... "or a more extreme one". So if we call the statistic over the > data as T(data), then the classical tests provide p(t>T(data)|H_0), and not > p(data|H_0). Anyway even p(t>T(data)|H_0) is clearly different from the initial > question, i.e. p(H_0|data). > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

 In reply to this post by Sturla Molden-2 On 12.10.2012 13:12, Sturla Molden wrote: > * The Bayesian approach is not scale invariable. A monotonic transform > like y = f(x) can yield a different conclusion if we analyze y instead > of x. And this, by the way, is what really pissed off Ronald A. Fisher, the father of the "p-value". He constructed the p-value as a heuristic for assessing H0 specifically to avoid this issue. Ronald A. Fisher never accepted the significance testing (type-1 and type-2 error rates) of Pearson and Neuman, as experiments are seldom repeated. In fact the p-value has nothing to do with significance testing. To correct the other issues of the p-value Fisher later constructed a different kind of analysis he called "fiuducial inference". It is not commonly used today. It depends on looking at hypothesis testing as signal processing: measurement = signal + noise The noise is considered random and and the signal is the truth about H0. Fisher argued we can interfere the truth about H0 from subtracting the random noise from the collected data. The method has none of the absurdities of Bayesian and classical statistics, but for some reason it never got popular among practitioners. Sturla _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

Open this post in threaded view
|

## Re: "small data" statistics

 On 12.10.2012 16:21, Emanuele Olivetti wrote: > 1) In this thread people expressed interest in making hypothesis testing > from small samples, so is permutation test addressing the question of > the accompanying motivating example? In my opinion it is not and I hope I > provided brief but compelling motivation to support this point of view. For the problem Josef described, I'd analyze that as a two-sample goodness-of-fit test against a common bin(20,p) distribution. > 2) What are the assumptions under which the permutation test is > valid/acceptable (independently from the accompanying motivating example)? > I have looked around on this topic but I had just found generic desiderata for > all resampling approaches, i.e. that the sample should be "representative" > of the underlying distribution - whatever this means in practical terms. Ronald A. Fisher considered the permutation test to be the "exact procedure" the t-test should approximate. It has, in fact, all the assumptions of the t-test. Surprisingly many think the t-test assume normally distributed data. It does not. If you have this idea too, forget it please. The t-test only asserts that the large-sample "sampling distribution of the mean" (i.e. the mean you calculate, not the data point themselves) is a normal distribution. This is due to the central limit theorem. If you collect enough data, the distribution of the sample mean will converge towards a normal distribution. That is a mathematical necessity, and can be proven to always be the case. But with small data samples, the sampling distribution of the mean can deviate from a normal distribution. That is when we need to use the permutation test instead. I.e.: The t-test is an approximation to the permutation test for "large enough" data samples. What we mean by "large enough" is another story. We can e.g. estimate the sampling distribution of the mean using Efron's bootstrap, and run a goodness-of-fit test. What most practitioners do, though, is to check if their data is approximately normally distributed. That usually signifies a lack of understanding for the t-test. They think the data must be normal. The data do not. But if the data are normally distributed we can be sure the sample mean is normal as well. So under what circumstances are the assumptions for the permutation test not satisfied? One notable example is the Behrens-Fisher problem! That is, you want to compare the expectancy value of two distributions with different variance. The permutation test does not help to solve this problem any more than the t-test does. This is clearly a situation where distributions matter, showing that the permutation test is not a "distribution free" test. Sturla _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: "small data" statistics

 In reply to this post by Nathaniel Smith On 10/12/2012 01:22 PM, Nathaniel Smith wrote: On 12 Oct 2012 09:37, "Emanuele Olivetti" <[hidden email]> wrote: > IMHO the question in the example at that URL, i.e. "Did the instructions > given to the participants significantly affect their level of recall?" is > not directly addressed by the permutation test. In this sentence, the word "significantly" is a term of art used to refer exactly to the quantity p(t>T(data)|H_0). So, yes, the permutation test addresses the original question; you just have to be familiar with the field's particular jargon to understand what they're saying. :-) Thanks Nathaniel for pointing that out. I guess I'll hardly be much familiar with such a jargon ;-). Nevertheless while reading the example I believed that the aim of the thought experiment was to decide among two competing theories/hypothesis, given the results of the experiment. But I share your point that the term "significant" turns it into a different question. All tests require some kind of representativeness, and this isn't really a problem. The data are by definition representative (in the technical sense) of the distribution they were drawn from. (The trouble comes when you want to decide whether that distribution matches anything you care about, but looking at the data won't tell you that.) A well designed test is one that is correct on average across samples. Indeed my wording was imprecise so thanks once more for correcting it. Moreover you put it really well: "The trouble comes when you want to decide whether that distribution matches anything you care about, but looking at the data won't tell you that". Could you tell more about evaluating the correctness of a test across different samples? It sounds interesting. The alternative to a permutation test here is to make very strong assumptions about the underlying distributions (e.g. with a t test), and these assumptions are often justified only for large samples.  And, resampling tests are computationally expensive, but this is no problem for small samples. So that's why non parametrics are often better in this setting. I agree with you that strong assumptions about the underlying distributions, e.g. parametric modeling, may raise big practical concerns. The only pro is that at least you know the assumptions explicitly. Best, Emanuele _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|