Unpaired One-way ANOVA And Multiple Comparisons In Python

2020-01-27 DataAnalytics Python

GOAL

To write program of unpaired one-way ANOVA(analysis of variance) and multiple comparisons using python. Please refer another article “Paired One-way ANOVA And Multiple Comparisons In Python” for paired one-way ANOVA.

What is ANOVA?

ANOVE is is a method of statistical hypothesis testing that determines the effects of factors and interactions, which analyzes the differences between group means within a sample.
Details will be longer. Please see the following site.

One-way ANOVA is ANOVA test that compares the means of three or more samples. Null hypothesis is that samples in groups were taken from populations with the same mean.

Implementation

The following is implementation example of one-way ANOVA.

Import Libraries

Import libraries below for ANOVA test.

import pandas as pd
import numpy as np
import scipy as sp
import csv # when you need to read csv data
from scipy import stats as st

import statsmodels.formula.api as smf
import statsmodels.api as sm
import statsmodels.stats.anova as anova #for ANOVA
from statsmodels.stats.multicomp import pairwise_tukeyhsd #for Tukey's multiple comparisons

Data Preparing

group A	85	90	88	69	78	98	87
group B	55	82	67	64	78	54	49
group C	46	95	59	80	52	73	70

test_data.csv

85, 90, 88, 69, 78, 98, 87
55, 82, 67, 64, 78, 54, 49
46, 95, 59, 80, 52, 73, 70

Read and Set Data

csv_line = []
with open('test_data.csv', ) as f:
    for i in f:
        items = i.split(',')
        for j in range(len(items)):
            if '\n' in items[j]:
                items[j] =float(items[j][:-1])
            else:
                items[j] =float(items[j])
        print(items)
        csv_line.append(items)

groupA = csv_line [0]
groupB = csv_line [1]
groupC = csv_line [2]

tdata = pd.DataFrame({'A':groupA, 'B':groupB, 'C':groupC})
tdata.index = range(1,10)
tdata

If you want to display data summary, use DataFrame.describe().

tdata.describe()

ANOVA

f, p = st.f_oneway(tdata['A'],tdata['B'],tdata['C'])
print("F=%f, p-value = %f"%(f,p))

>> F=4.920498, p-value = 0.019737

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), perform a multiple comparison.

Tukey’s multiple comparisons

Use pairwise_tukeyhsd(endog, groups, alpha=0.05) for tuky’s HSD(honestly significant difference) test. Argument endog is response variable, array of data (A[0] A[1]… A[6] B[1] … B[6] C[1] … C[6]). Argument groups is list of names(A, A…A, B…B, C…C) that corresponds to response variable. Alpha is significance level.

def tukey_hsd(group_names , *args ):
    endog = np.hstack(args)
    groups_list = []
    for i in range(len(args)):
        for j in range(len(args[i])):
            groups_list.append(group_names[i])
    groups = np.array(groups_list)
    res = pairwise_tukeyhsd(endog, groups)
    print (res.pvalues) #print only p-value
    print(res) #print result

print(tukey_hsd(['A', 'B', 'C'], tdata['A'], tdata['B'],tdata['C']))
>>[0.02259466 0.06511251 0.85313142]
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
=====================================================
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B -20.8571 0.0226 -38.9533 -2.7609   True
     A      C -17.1429 0.0651 -35.2391  0.9533  False
     B      C   3.7143 0.8531 -14.3819 21.8105  False
-----------------------------------------------------
None

Supplement

If you can’t find ‘pvalue’ key, check the version of statsmodels.

import statsmodels
statsmodels.__version__
>> 0.9.0

If the version is lower than 0.10.0, update statsmodels. Open command prompt or terminal and input the command below.

pip install --upgrade statsmodels
# or pip3 install --upgrade statsmodels

Nakoblog Engineering and Design for Storytelling