Dear all, Hope you are doing very well.
I am trying to apply a statistical normalization which require the values to be normally distributed.
I have prepared a short notebook with all details. It will be great if someone can help me out. Thanks Best regards Bakary N’tji DIALLO PhD Student (Bioinformatics), Research Unit in Bioinformatics (RUBi) Mail: [hidden email] | Skype: diallobakary4 Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14 _______________________________________________ SciPy-User mailing list [hidden email] https://mail.python.org/mailman/listinfo/scipy-user |
I think you misunderstand the null hypothesis. The null hypothesis for this test is that the data are not normally distributed. Since the p-value is your examples is 0.0003 (i.e., less than 0.001), you can reject the null hypothesis, suggesting that your data are normally distributed. -Paul On Fri, Dec 7, 2018 at 8:54 AM Bakary N'tji Diallo <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.python.org/mailman/listinfo/scipy-user |
On Fri, Dec 7, 2018 at 12:08 PM Paul Hobson <[hidden email]> wrote:
That's not correct. The null hypothesis is the data come from a normal distribution. My guess is that because of the relatively large sample size, the power is quite large and the test detects relatively small deviation from normality. len(x) Out[8]: 1444 stats.skewtest(x) Out[9]: SkewtestResult(statistic=1.79241121722139, pvalue=0.073067119279312559) stats.kurtosistest(x) Out[10]: KurtosistestResult(statistic=3.5348152259352097, pvalue=0.00040806039300234271) According the the two separate tests that are combined in the normal test, the data has heavier tails, larger kurtosis than the normal distribution. (Using kstest as distance measure, however, shows that the normal distribution matches the data better than a t distribution with smaller df. Note, pvalues for kstest don't apply because loc and scale are estimated.) Josef
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.python.org/mailman/listinfo/scipy-user |
Thank you for your replies. About the large sample size, just for clarification, this is not a sample, these are all the scores. Should I do a random sampling? Other approach I tried was to normalize the data using the following approach: x = x - 2*x log_data = np.log(x) # to transform scores into positive value to apply the log function The log_data was also found to be not normally distributed. Le ven. 7 déc. 2018 à 22:05, <[hidden email]> a écrit :
Bakary N’tji DIALLO PhD Student (Bioinformatics), Research Unit in Bioinformatics (RUBi) Mail: [hidden email] | Skype: diallobakary4 Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14 _______________________________________________ SciPy-User mailing list [hidden email] https://mail.python.org/mailman/listinfo/scipy-user |
I am reading this: "With large enough sample sizes (> 30 or 40), the violation of the normality assumption should not cause major problems (4); this implies that we can use parametric procedures even when the data are not normally distributed (8)." from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/ Can I then use the normalization procedure given the large sample size? The normalization is simply calculating the zscore as in (here) using the mean and standard deviation. Le sam. 8 déc. 2018 à 06:52, Bakary N'tji Diallo <[hidden email]> a écrit :
Bakary N’tji DIALLO PhD Student (Bioinformatics), Research Unit in Bioinformatics (RUBi) Mail: [hidden email] | Skype: diallobakary4 Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14 _______________________________________________ SciPy-User mailing list [hidden email] https://mail.python.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |