how to find p value using mannwhitneyu

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

how to find p value using mannwhitneyu

This post has NOT been accepted by the mailing list yet.
I am trying to do a hypothesis test on census data.

im trying to prove that years of education has an affect on salary.salary is a defined as <=50k, >50k.

can i do this to calculate p value?

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from scipy.stats import mannwhitneyu

print('Reading datasets...')
df_trn = pd.read_csv('adult.trn', index_col=False, skipinitialspace=True)

ALL_COLS = set(df_trn.columns)
Wanted_COLS = set(['years-of-edu', 'salary'])
Del_Cols=ALL_COLS - Wanted_COLS

new_trn1 = {}

for column in df_trn.drop(Del_Cols, axis=1).columns:
    le = LabelEncoder()
    new_trn1[column] = le.fit_transform(df_trn[column])
print("length of newtrn1")
list1 = np.array(new_trn1['years-of-edu'])
list2 = np.array(new_trn1['salary'])

list3 = list1[list2 == 0]
list4 = list1[list2 == 1]

print('list3:', np.median(list3))
print('List4:', np.median(list4))

pp=mannwhitneyu(list3, list4)
it returns p value as zero.

my data set looks like