[SciPy-User] Are the scores normally distributed?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] Are the scores normally distributed?

Bakary N'tji Diallo
Dear all,
Hope you are doing very well.
 
I am trying to apply a statistical normalization which require the values to be normally distributed.
I have prepared a short notebook with all details.

It will be great if someone can help me out.

Thanks
Best regards
--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Are the scores normally distributed?

Paul Hobson-2
I think you misunderstand the null hypothesis.

The null hypothesis for this test is that the data are not normally distributed.

Since the p-value is your examples is 0.0003 (i.e., less than 0.001), you can reject the null hypothesis, suggesting that your data are normally distributed.
-Paul

On Fri, Dec 7, 2018 at 8:54 AM Bakary N'tji Diallo <[hidden email]> wrote:
Dear all,
Hope you are doing very well.
 
I am trying to apply a statistical normalization which require the values to be normally distributed.
I have prepared a short notebook with all details.

It will be great if someone can help me out.

Thanks
Best regards
--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Are the scores normally distributed?

josef.pktd


On Fri, Dec 7, 2018 at 12:08 PM Paul Hobson <[hidden email]> wrote:
I think you misunderstand the null hypothesis.

The null hypothesis for this test is that the data are not normally distributed.

That's not correct. The null hypothesis is the data come from a normal distribution.

My guess is that because of the relatively large sample size, the power is quite large and the test detects relatively small deviation from normality.

len(x)
Out[8]: 1444

stats.skewtest(x)
Out[9]: SkewtestResult(statistic=1.79241121722139, pvalue=0.073067119279312559)

stats.kurtosistest(x)
Out[10]: KurtosistestResult(statistic=3.5348152259352097, pvalue=0.00040806039300234271)

According the the two separate tests that are combined in the normal test, the data has heavier tails, larger kurtosis than the normal distribution.

(Using kstest as distance measure, however, shows that the normal distribution matches the data better than a t distribution with smaller df. 
Note, pvalues for kstest don't apply because loc and scale are estimated.)

Josef


 

Since the p-value is your examples is 0.0003 (i.e., less than 0.001), you can reject the null hypothesis, suggesting that your data are normally distributed.
-Paul

On Fri, Dec 7, 2018 at 8:54 AM Bakary N'tji Diallo <[hidden email]> wrote:
Dear all,
Hope you are doing very well.
 
I am trying to apply a statistical normalization which require the values to be normally distributed.
I have prepared a short notebook with all details.

It will be great if someone can help me out.

Thanks
Best regards
--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Are the scores normally distributed?

Bakary N'tji Diallo
Thank you for your replies.
About the large sample size, just for clarification, this is not a sample, these are all the scores.
Should I do a random sampling?
Other approach I tried was to normalize the data using the following approach:
x = x - 2*x 
log_data = np.log(x)   # to transform scores into positive value to apply the log function 
The log_data was also found to be not normally distributed.

Le ven. 7 déc. 2018 à 22:05, <[hidden email]> a écrit :


On Fri, Dec 7, 2018 at 12:08 PM Paul Hobson <[hidden email]> wrote:
I think you misunderstand the null hypothesis.

The null hypothesis for this test is that the data are not normally distributed.

That's not correct. The null hypothesis is the data come from a normal distribution.

My guess is that because of the relatively large sample size, the power is quite large and the test detects relatively small deviation from normality.

len(x)
Out[8]: 1444

stats.skewtest(x)
Out[9]: SkewtestResult(statistic=1.79241121722139, pvalue=0.073067119279312559)

stats.kurtosistest(x)
Out[10]: KurtosistestResult(statistic=3.5348152259352097, pvalue=0.00040806039300234271)

According the the two separate tests that are combined in the normal test, the data has heavier tails, larger kurtosis than the normal distribution.

(Using kstest as distance measure, however, shows that the normal distribution matches the data better than a t distribution with smaller df. 
Note, pvalues for kstest don't apply because loc and scale are estimated.)

Josef


 

Since the p-value is your examples is 0.0003 (i.e., less than 0.001), you can reject the null hypothesis, suggesting that your data are normally distributed.
-Paul

On Fri, Dec 7, 2018 at 8:54 AM Bakary N'tji Diallo <[hidden email]> wrote:
Dear all,
Hope you are doing very well.
 
I am trying to apply a statistical normalization which require the values to be normally distributed.
I have prepared a short notebook with all details.

It will be great if someone can help me out.

Thanks
Best regards
--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user


--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: ​+27798233845 | +223 74 56 57 22 | +223 97 39 77 14


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Are the scores normally distributed?

Bakary N'tji Diallo
I am reading this: "With large enough sample sizes (> 30 or 40), the violation of the normality assumption should not cause major problems (4); this implies that we can use parametric procedures even when the data are not normally distributed (8)."
Can I then use the normalization procedure given the large sample size? The normalization is simply calculating the zscore as in (here) using the mean and standard deviation. 



Le sam. 8 déc. 2018 à 06:52, Bakary N'tji Diallo <[hidden email]> a écrit :
Thank you for your replies.
About the large sample size, just for clarification, this is not a sample, these are all the scores.
Should I do a random sampling?
Other approach I tried was to normalize the data using the following approach:
x = x - 2*x 
log_data = np.log(x)   # to transform scores into positive value to apply the log function 
The log_data was also found to be not normally distributed.

Le ven. 7 déc. 2018 à 22:05, <[hidden email]> a écrit :


On Fri, Dec 7, 2018 at 12:08 PM Paul Hobson <[hidden email]> wrote:
I think you misunderstand the null hypothesis.

The null hypothesis for this test is that the data are not normally distributed.

That's not correct. The null hypothesis is the data come from a normal distribution.

My guess is that because of the relatively large sample size, the power is quite large and the test detects relatively small deviation from normality.

len(x)
Out[8]: 1444

stats.skewtest(x)
Out[9]: SkewtestResult(statistic=1.79241121722139, pvalue=0.073067119279312559)

stats.kurtosistest(x)
Out[10]: KurtosistestResult(statistic=3.5348152259352097, pvalue=0.00040806039300234271)

According the the two separate tests that are combined in the normal test, the data has heavier tails, larger kurtosis than the normal distribution.

(Using kstest as distance measure, however, shows that the normal distribution matches the data better than a t distribution with smaller df. 
Note, pvalues for kstest don't apply because loc and scale are estimated.)

Josef


 

Since the p-value is your examples is 0.0003 (i.e., less than 0.001), you can reject the null hypothesis, suggesting that your data are normally distributed.
-Paul

On Fri, Dec 7, 2018 at 8:54 AM Bakary N'tji Diallo <[hidden email]> wrote:
Dear all,
Hope you are doing very well.
 
I am trying to apply a statistical normalization which require the values to be normally distributed.
I have prepared a short notebook with all details.

It will be great if someone can help me out.

Thanks
Best regards
--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: +27798233845 | +223 74 56 57 22 | +223 97 39 77 14

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user


--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: ​+27798233845 | +223 74 56 57 22 | +223 97 39 77 14



--

Bakary N’tji DIALLO

PhD Student (Bioinformatics)Research Unit in Bioinformatics (RUBi)

Mail: [hidden email] |  Skype: diallobakary4

Tel: ​+27798233845 | +223 74 56 57 22 | +223 97 39 77 14


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user