Chi-Square Test in Python
GOAL
To write program of chi-square test using python.
What is chi-square test?
Chi-square test which means “Pearson’s chi-square test” here, is a method of statistical hypothesis testing for goodness-of-fit and independence.
Goodness-of-fit test is the testing to determine whether the observed frequency distribution is the same as the theoretical distribution.
Independence test is the testing to determine whether 2 observations that is represented by 2*2 table, on 2 variables are independent of each other.
Details will be longer. Please see the following sites and document.
Implementation
The following is implementation for chi-square test.
Import libraries
import numpy as np import pandas as pd import scipy as sp from scipy import stats
Data preparing
gourp A | group B | group C | |
success | 23 | 65 | 158 |
failure | 100 | 44 | 119 |
success rate | 0.187 | 0.596 | 0.570 |
chi_square_data.csv
A,B,C 23,65,158 100,44,119
Read and Set Data
csv_line = [] with open('chi_square_data.csv', ) as f: for i in f: items = i.split(',') for j in range(len(items)): if '\n' in items[j]: items[j] =float(items[j][:-1]) else: items[j] =float(items[j]) csv_line.append(items)
group = csv_line[0] success = [int(n) for n in csv_line[1]] failure = [int(n) for n in csv_line[2]] groups = [] result =[] count = [] for i in range(len(group)): groups += [group[i], group[i]] #['A','A', 'B', 'B', 'C', 'C'] result += ['success', 'failure'] #['success', 'failure', 'success', 'failure', 'success', 'failure'] count += [success[i], failure[i]] #[23, 100, 65, 44, 158, 119] data = pd.DataFrame({ 'groups' : groups, 'result' : result, 'count' : count })
cross_data = pd.pivot_table( data = data, values ='count', aggfunc = 'sum', index = 'groups', columns = 'result' ) print(cross_data) >>result failure success groups A 100 23 B 44 65 C 119 158
Chi-square test
print(stats.chi2_contingency(cross_data, correction=False)) >> (57.23616422920877, 3.726703617716424e-13, 2, array([[ 63.554, 59.446], [ 56.32 , 52.68 ], [143.126, 133.874]]))
- chi2 : 57.23616422920877
- The test statistic
- p : 3.726703617716424e-13
- The p-value of the test
- dof : 2
- Degrees of freedom
- expected : array
- The expected frequencies, based on the marginal sums of the table.
The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), the difference between groups is significant.