Category: DataAnalytics

How To Start Anaconda

GOAL

Today’s goal is to construct development environment for data analytics and machine learning with Anaconda.

What is Anaconda?

Anaconda is open source Python distribution for data science. You can see the list of package lists here.

the open-source Individual Edition (Distribution) is the easiest way to perform Python/R data science and machine learning on a single machine. Developed for solo practitioners, it is the toolkit that equips you to work with thousands of open-source packages and libraries.

from anaconda.com/products/individual

Environment

macOS Catalina 10.15.5
Python 3.9.7

Method

Download & Install

Access the websiteanaconda.com/products/individual and click “Download” button to download the installer of the edition that you want.

Start the installer and click “Continue”.

Check if anaconda is completely installed with terminal.

Start Anaconda Navigator

Start applications > Anaconda-Navigator.

Create virtual environment

Click Environments and create new environment.

I named new environment “data_analysis”.

Install libraries or modules with conda

Conda is an open source package management system and environment management system.

Open terminal in the environment where you want to install libraries.

Then put the “conda install” command to install libraries.
For example, I installed pytorch in the “pytorch” environment. The option “-c” is the channel (What is a “conda channel”?).

conda install pytorch torchvision -c pytorch

Start Application

Select environment what you want to use and install or launch application.

I launched “Jupyter notebook” and check that “pythorch” library is installed successfully.

Internal Representation of Numbers in Computer.

I found the following phenomena but why?

#source code
a = 1.2
b = 2.4
c = 3.6
print(a + b == c)
False

GOAL

To understand the mechanism of the internal representation of integer and float point in computer.

Binary Number

Numbers are represented as binary number in computer.

Unsigned int

Signed int: Two’s Complement notation

What is complement?

Complement has two definition.
First, complement on n of a given number can be defined as the smallest number the sum of which and the given number increase its digit. Second, it is also defined as the biggest number the sum of which and the given number doesn’t increase its digit. In decimal number, the 10’s complement of 3 is 7 and the 9’s complement of 3 is 6. How about in binary number?

One’s complement

One’s complement of the input is the number generated by reversing each digit of the input from 0 to 1 and from 1 to 0.

Two’s complement

Two’s complement of the input is the number generated by reversing each digit of the input, that is one’s complement, plus 1. Or it can be calculated easily by subtracting 1 from the input and reverse each digit from 1 to 0 and from 0 to 1.

The range that can be expressed in two’s complement notation

The range is asymmetric.

Floating point

The following is the way to represent floating point number in computer.

The digit is limited and this means that the floating point can’t represent all the decimals in continuous range. Then the decimals are approximated to closest numbers.

Why float(1.2)+float(2.4) is not equal to float(3.6)?

In computers float number can’t have exactly 1.1 but just an approximation that can be expressed in binary. You can see errors by setting the number of decimal places bigger.

#source code
a = 1.2
b = 2.4
c = 3.6
print('{:.20f}'.format(a))
print('{:.20f}'.format(b))
print('{:.20f}'.format(c))
print('{:.20f}'.format(a+b))
1.19999999999999995559
2.39999999999999991118
3.60000000000000008882
3.59999999999999964473

*You can avoid this phenomena by using Decimal number in python.

from decimal import *
a = Decimal('1.2')
b = Decimal('2.4')
c = Decimal('3.6')
print(a+b == c)

print('{:.20f}'.format(a))
print('{:.20f}'.format(b))
print('{:.20f}'.format(c))
print('{:.20f}'.format(a+b))
True
1.20000000000000000000
2.40000000000000000000
3.60000000000000000000
3.60000000000000000000

Unpaired One-way ANOVA And Multiple Comparisons In Python

GOAL

To write program of unpaired one-way ANOVA(analysis of variance) and multiple comparisons using python. Please refer another article “Paired One-way ANOVA And Multiple Comparisons In Python” for paired one-way ANOVA.

What is ANOVA?

ANOVE is is a method of statistical hypothesis testing that determines the effects of factors and interactions, which analyzes the differences between group means within a sample.
Details will be longer. Please see the following site.

One-way ANOVA is ANOVA test that compares the means of three or more samples. Null hypothesis is that samples in groups were taken from populations with the same mean.

Implementation

The following is implementation example of one-way ANOVA.

Import Libraries

Import libraries below for ANOVA test.

import pandas as pd
import numpy as np
import scipy as sp
import csv # when you need to read csv data
from scipy import stats as st

import statsmodels.formula.api as smf
import statsmodels.api as sm
import statsmodels.stats.anova as anova #for ANOVA
from statsmodels.stats.multicomp import pairwise_tukeyhsd #for Tukey's multiple comparisons

Data Preparing

group A85908869789887
group B55826764785449
group C46955980527370

test_data.csv

85, 90, 88, 69, 78, 98, 87
55, 82, 67, 64, 78, 54, 49
46, 95, 59, 80, 52, 73, 70

Read and Set Data

csv_line = []
with open('test_data.csv', ) as f:
    for i in f:
        items = i.split(',')
        for j in range(len(items)):
            if '\n' in items[j]:
                items[j] =float(items[j][:-1])
            else:
                items[j] =float(items[j])
        print(items)
        csv_line.append(items)
groupA = csv_line [0]
groupB = csv_line [1]
groupC = csv_line [2]

tdata = pd.DataFrame({'A':groupA, 'B':groupB, 'C':groupC})
tdata.index = range(1,10)
tdata 

If you want to display data summary, use DataFrame.describe().

tdata.describe()

ANOVA

f, p = st.f_oneway(tdata['A'],tdata['B'],tdata['C'])
print("F=%f, p-value = %f"%(f,p))

>> F=4.920498, p-value = 0.019737

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), perform a multiple comparison.

Tukey’s multiple comparisons

Use pairwise_tukeyhsd(endog, groups, alpha=0.05) for tuky’s HSD(honestly significant difference) test. Argument endog is response variable, array of data (A[0] A[1]… A[6] B[1] … B[6] C[1] … C[6]). Argument groups is list of names(A, A…A, B…B, C…C) that corresponds to response variable. Alpha is significance level.

def tukey_hsd(group_names , *args ):
    endog = np.hstack(args)
    groups_list = []
    for i in range(len(args)):
        for j in range(len(args[i])):
            groups_list.append(group_names[i])
    groups = np.array(groups_list)
    res = pairwise_tukeyhsd(endog, groups)
    print (res.pvalues) #print only p-value
    print(res) #print result
print(tukey_hsd(['A', 'B', 'C'], tdata['A'], tdata['B'],tdata['C']))
>>[0.02259466 0.06511251 0.85313142]
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
=====================================================
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B -20.8571 0.0226 -38.9533 -2.7609   True
     A      C -17.1429 0.0651 -35.2391  0.9533  False
     B      C   3.7143 0.8531 -14.3819 21.8105  False
-----------------------------------------------------
None

Supplement

If you can’t find ‘pvalue’ key, check the version of statsmodels.

import statsmodels
statsmodels.__version__
>> 0.9.0

If the version is lower than 0.10.0, update statsmodels. Open command prompt or terminal and input the command below.

pip install --upgrade statsmodels
# or pip3 install --upgrade statsmodels

Chi-Square Test in R

This article is just a little note on R.

GOAL

To do chi-square test and residual analysis in R. If you want to know chi-square test and implement it in python, refer “Chi-Square Test in Python”.

Source Code

> test_data <- data.frame(
  groups = c("A","A", "B", "B", "C", "C"),
  result = c("success", "failure", "success", "failure", "success", 
 "failure"),
  number = c(23, 100, 65, 44, 158, 119)
)

> test_data
groups  result number
1      A success     23
2      A failure    100
3      B success     65
4      B failure     44
5      C success    158
6      C failure    119

> cross_data <- xtabs(number ~ ., test_data)
> cross_data
      result
groups failure success
     A     100      23
     B      44      65
     C     119     158

> result <- chisq.test(cross_data, correct=F)
> result
	Pearson's Chi-squared test

data:  cross_data
X-squared = 57.236, df = 2, p-value = 3.727e-13

> reesult$residuals
      result
groups   failure   success
     A  4.571703 -4.727030
     B -1.641673  1.697450
     C -2.016609  2.085125

> result$stdres
      result
groups   failure   success
     A  7.551524 -7.551524
     B -2.663833  2.663833
     C -4.296630  4.296630

> pnorm(abs(result$stdres), lower.tail = FALSE) * 2
      result
groups      failure      success
     A 4.301958e-14 4.301958e-14
     B 7.725587e-03 7.725587e-03
     C 1.734143e-05 1.734143e-05

functions

xtabs

xtabs() function is a function to create a contingency table from cross-classifying factors that contained in a data frame. “~” is the formula to specify variables that serve as aggregation criteria are described. And “~ .” means that this function use all variables (groups+result).

chisq.test

chisq.test() function is a function to return the test statistic, degree of freedom and p-value. The argument “correct” is continuity correction and set “correct” into F to suppress the continuity correction.

reesult$residuals

$residuals return standardized residuals.

result$stdres

$stdres return adjusted standardized residuals.

pnorm(abs(result$stdres), lower.tail = FALSE) * 2

This calculates p-value of standardized residuals.

Chi-Square Test in Python

GOAL

To write program of chi-square test using python. 

What is chi-square test?

Chi-square test which means “Pearson’s chi-square test” here, is a method of statistical hypothesis testing for goodness-of-fit and independence.

Goodness-of-fit test is the testing to determine whether the observed frequency distribution is the same as the theoretical distribution.
Independence test is the testing to determine whether 2 observations that is represented by 2*2 table, on 2 variables are independent of each other.

Details will be longer. Please see the following sites and document.

Implementation

The following is implementation for chi-square test.

Import libraries

import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats

Data preparing

gourp Agroup Bgroup C
success2365158
failure10044119
success rate0.1870.5960.570

chi_square_data.csv

A,B,C
23,65,158
100,44,119

Read and Set Data

csv_line = []
with open('chi_square_data.csv', ) as f:
    for i in f:
        items = i.split(',')
        for j in range(len(items)):
            if '\n' in items[j]:
                items[j] =float(items[j][:-1])
            else:
                items[j] =float(items[j])
        csv_line.append(items)
group = csv_line[0]
success = [int(n) for n in csv_line[1]]
failure = [int(n) for n in csv_line[2]]

groups = [] 
result =[]
count = []
for i in range(len(group)):
    groups += [group[i], group[i]] #['A','A', 'B', 'B', 'C', 'C']
    result += ['success', 'failure'] #['success', 'failure', 'success', 'failure', 'success', 'failure']
    count += [success[i], failure[i]] #[23, 100, 65, 44, 158, 119]
    
data =  pd.DataFrame({
    'groups' : groups,
    'result' : result,
    'count' : count
})
cross_data = pd.pivot_table(
    data = data,
    values ='count',
    aggfunc = 'sum',
    index = 'groups',
    columns = 'result'
)
print(cross_data)
>>result  failure  success
groups                  
A           100       23
B            44       65
C           119      158

Chi-square test

print(stats.chi2_contingency(cross_data, correction=False))
>> (57.23616422920877, 3.726703617716424e-13, 2, array([[ 63.554,  59.446],
       [ 56.32 ,  52.68 ],
       [143.126, 133.874]]))
  • chi2 : 57.23616422920877
    • The test statistic
  • p : 3.726703617716424e-13
    • The p-value of the test
  • dof : 2
    • Degrees of freedom
  • expected : array
    • The expected frequencies, based on the marginal sums of the table.

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), the difference between groups is significant.

Paired One-way ANOVA And Multiple Comparisons In Python

GOAL

To write program of paired one-way ANOVA(analysis of variance) and multiple comparisons using python. Please refer another article “Unpaired One-way ANOVA And Multiple Comparisons In Python” for unpaired one-way ANOVA.

What is ANOVA

ANOVA(analysis of variance) is a method of statistical hypothesis testing that determines the effects of factors and interactions, which analyzes the differences between group means within a sample.
Details will be longer. Please see the following site.

One-way ANOVA is ANOVA test that compares the means of three or more samples. Null hypothesis is that samples in groups were taken from populations with the same mean.

Implementation

The following is implementation example of paired one-way ANOVA.

Import Libraries

Import libraries below for ANOVA test.

import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np
import statsmodels.stats.anova as anova

Data Preparing

id_1id_2id_3id_4id_5id_6id_7
condition A85908869789887
condition B55826764785449
condition C46955980527370

test_data.csv

85, 90, 88, 69, 78, 98, 87
55, 82, 67, 64, 78, 54, 49
46, 95, 59, 80, 52, 73, 70

Read and Set Data

csv_line = []
with open('test_data.csv', ) as f:
    for i in f:
        items = i.split(',')
        for j in range(len(items)):
            if '\n' in items[j]:
                items[j] =float(items[j][:-1])
            else:
                items[j] =float(items[j])
        print(items)
        csv_line.append(items)
groupA = csv_line [0]
groupB = csv_line [1]
groupC = csv_line [2]
tdata = pd.DataFrame({'A':groupA, 'B':groupB, 'C':groupC})
tdata.index = range(1,10)
tdata 

If you want to display data summary, use DataFrame.describe().

tdata.describe()

ANOVA

subjects=['id1','id2','id3','id4','id5','id6','id7']
points = np.array(groupA +groupB + groupC)
conditions = np.repeat(['A','B','C'],len(group0))
subjects = np.array(subjects+subjects+subjects)
df = pd.DataFrame({'Point':points,'Conditions':conditions,'Subjects':subjects})
aov=anova.AnovaRM(df, 'Point','Subjects',['Conditions'])
result=aov.fit()

print(result)

>>                 Anova
========================================
           F Value Num DF  Den DF Pr > F
----------------------------------------
Conditions  5.4182 2.0000 12.0000 0.0211
========================================

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), perform a multiple comparison. This p value is different between paired ANOVA and unpaired ANOVA.

Tukey’s multiple comparisons

Use pairwise_tukeyhsd(endog, groups, alpha=0.05) for tuky’s HSD(honestly significant difference) test. Argument endog is response variable, array of data (A[0] A[1]… A[6] B[1] … B[6] C[1] … C[6]). Argument groups is list of names(A, A…A, B…B, C…C) that corresponds to response variable. Alpha is significance level.

def tukey_hsd(group_names , *args ):
    endog = np.hstack(args)
    groups_list = []
    for i in range(len(args)):
        for j in range(len(args[i])):
            groups_list.append(group_names[i])
    groups = np.array(groups_list)
    res = pairwise_tukeyhsd(endog, groups)
    print (res.pvalues) #print only p-value
    print(res) #print result
print(tukey_hsd(['A', 'B', 'C'], tdata['A'], tdata['B'],tdata['C']))
>> [0.02259466 0.06511251 0.85313142]
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
=====================================================
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B -20.8571 0.0226 -38.9533 -2.7609   True
     A      C -17.1429 0.0651 -35.2391  0.9533  False
     B      C   3.7143 0.8531 -14.3819 21.8105  False
-----------------------------------------------------
None