Chi-Square Test in R

This article is just a little note on R.

GOAL

To do chi-square test and residual analysis in R. If you want to know chi-square test and implement it in python, refer “Chi-Square Test in Python”.

Source Code

> test_data <- data.frame(
  groups = c("A","A", "B", "B", "C", "C"),
  result = c("success", "failure", "success", "failure", "success", 
 "failure"),
  number = c(23, 100, 65, 44, 158, 119)
)

> test_data
groups  result number
1      A success     23
2      A failure    100
3      B success     65
4      B failure     44
5      C success    158
6      C failure    119

> cross_data <- xtabs(number ~ ., test_data)
> cross_data
      result
groups failure success
     A     100      23
     B      44      65
     C     119     158

> result <- chisq.test(cross_data, correct=F)
> result
	Pearson's Chi-squared test

data:  cross_data
X-squared = 57.236, df = 2, p-value = 3.727e-13

> reesult$residuals
      result
groups   failure   success
     A  4.571703 -4.727030
     B -1.641673  1.697450
     C -2.016609  2.085125

> result$stdres
      result
groups   failure   success
     A  7.551524 -7.551524
     B -2.663833  2.663833
     C -4.296630  4.296630

> pnorm(abs(result$stdres), lower.tail = FALSE) * 2
      result
groups      failure      success
     A 4.301958e-14 4.301958e-14
     B 7.725587e-03 7.725587e-03
     C 1.734143e-05 1.734143e-05

functions

xtabs

xtabs() function is a function to create a contingency table from cross-classifying factors that contained in a data frame. “~” is the formula to specify variables that serve as aggregation criteria are described. And “~ .” means that this function use all variables (groups+result).

chisq.test

chisq.test() function is a function to return the test statistic, degree of freedom and p-value. The argument “correct” is continuity correction and set “correct” into F to suppress the continuity correction.

reesult$residuals

$residuals return standardized residuals.

result$stdres

$stdres return adjusted standardized residuals.

pnorm(abs(result$stdres), lower.tail = FALSE) * 2

This calculates p-value of standardized residuals.