Chi-Square Test in Python
GOAL
To write program of chi-square test using python.
What is chi-square test?
Chi-square test which means “Pearson’s chi-square test” here, is a method of statistical hypothesis testing for goodness-of-fit and independence.
Goodness-of-fit test is the testing to determine whether the observed frequency distribution is the same as the theoretical distribution.
Independence test is the testing to determine whether 2 observations that is represented by 2*2 table, on 2 variables are independent of each other.
Details will be longer. Please see the following sites and document.
Implementation
The following is implementation for chi-square test.
Import libraries
import numpy as np import pandas as pd import scipy as sp from scipy import stats
Data preparing
| gourp A | group B | group C | |
| success | 23 | 65 | 158 |
| failure | 100 | 44 | 119 |
| success rate | 0.187 | 0.596 | 0.570 |
chi_square_data.csv
A,B,C 23,65,158 100,44,119
Read and Set Data
csv_line = []
with open('chi_square_data.csv', ) as f:
for i in f:
items = i.split(',')
for j in range(len(items)):
if '\n' in items[j]:
items[j] =float(items[j][:-1])
else:
items[j] =float(items[j])
csv_line.append(items)group = csv_line[0]
success = [int(n) for n in csv_line[1]]
failure = [int(n) for n in csv_line[2]]
groups = []
result =[]
count = []
for i in range(len(group)):
groups += [group[i], group[i]] #['A','A', 'B', 'B', 'C', 'C']
result += ['success', 'failure'] #['success', 'failure', 'success', 'failure', 'success', 'failure']
count += [success[i], failure[i]] #[23, 100, 65, 44, 158, 119]
data = pd.DataFrame({
'groups' : groups,
'result' : result,
'count' : count
})cross_data = pd.pivot_table(
data = data,
values ='count',
aggfunc = 'sum',
index = 'groups',
columns = 'result'
)
print(cross_data)
>>result failure success
groups
A 100 23
B 44 65
C 119 158Chi-square test
print(stats.chi2_contingency(cross_data, correction=False))
>> (57.23616422920877, 3.726703617716424e-13, 2, array([[ 63.554, 59.446],
[ 56.32 , 52.68 ],
[143.126, 133.874]]))- chi2 : 57.23616422920877
- The test statistic
- p : 3.726703617716424e-13
- The p-value of the test
- dof : 2
- Degrees of freedom
- expected : array
- The expected frequencies, based on the marginal sums of the table.
The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), the difference between groups is significant.