Today’s goal is to construct development environment for data analytics and machine learning with Anaconda.

What is Anaconda?

Anaconda is open source Python distribution for data science. You can see the list of package lists here.

the open-source Individual Edition (Distribution) is the easiest way to perform Python/R data science and machine learning on a single machine. Developed for solo practitioners, it is the toolkit that equips you to work with thousands of open-source packages and libraries.

Access the website anaconda.com/products/individual and click “Download” button to download the installer of the edition that you want.

Start the installer and click “Continue”.

Check if anaconda is completely installed with terminal.

Start Anaconda Navigator

Start applications > Anaconda-Navigator.

Create virtual environment

Click Environments and create new environment.

I named new environment “data_analysis”.

Install libraries or modules with conda

Conda is an open source package management system and environment management system.

Open terminal in the environment where you want to install libraries.

Then put the “conda install” command to install libraries. For example, I installed pytorch in the “pytorch” environment. The option “-c” is the channel (What is a “conda channel”?).

conda install pytorch torchvision -c pytorch

Start Application

Select environment what you want to use and install or launch application.

I launched “Jupyter notebook” and check that “pythorch” library is installed successfully.

#source code
a = 1.2
b = 2.4
c = 3.6
print(a + b == c)

False

GOAL

To understand the mechanism of the internal representation of integer and float point in computer.

Binary Number

Numbers are represented as binary number in computer.

Unsigned int

Signed int: Two’s Complement notation

What is complement?

Complement has two definition. First, complement on n of a given number can be defined as the smallest number the sum of which and the given number increase its digit. Second, it is also defined as the biggest number the sum of which and the given number doesn’t increase its digit. In decimal number, the 10’s complement of 3 is 7 and the 9’s complement of 3 is 6. How about in binary number?

One’s complement

One’s complement of the input is the number generated by reversing each digit of the input from 0 to 1 and from 1 to 0.

Two’s complement

Two’s complement of the input is the number generated by reversing each digit of the input, that is one’s complement, plus 1. Or it can be calculated easily by subtracting 1 from the input and reverse each digit from 1 to 0 and from 0 to 1.

The range that can be expressed in two’s complement notation

The range is asymmetric.

Floating point

The following is the way to represent floating point number in computer.

The digit is limited and this means that the floating point can’t represent all the decimals in continuous range. Then the decimals are approximated to closest numbers.

Why float(1.2)＋float(2.4) is not equal to float(3.6)?

In computers float number can’t have exactly 1.1 but just an approximation that can be expressed in binary. You can see errors by setting the number of decimal places bigger.

#source code
a = 1.2
b = 2.4
c = 3.6
print('{:.20f}'.format(a))
print('{:.20f}'.format(b))
print('{:.20f}'.format(c))
print('{:.20f}'.format(a+b))

*You can avoid this phenomena by using Decimal number in python.

from decimal import *
a = Decimal('1.2')
b = Decimal('2.4')
c = Decimal('3.6')
print(a+b == c)
print('{:.20f}'.format(a))
print('{:.20f}'.format(b))
print('{:.20f}'.format(c))
print('{:.20f}'.format(a+b))

ANOVE is is a method of statistical hypothesis testing that determines the effects of factors and interactions, which analyzes the differences between group means within a sample. Details will be longer. Please see the following site.

One-way ANOVA is ANOVA test that compares the means of three or more samples. Null hypothesis is that samples in groups were taken from populations with the same mean.

Implementation

The following is implementation example of one-way ANOVA.

Import Libraries

Import libraries below for ANOVA test.

import pandas as pd
import numpy as np
import scipy as sp
import csv # when you need to read csv data
from scipy import stats as st
import statsmodels.formula.api as smf
import statsmodels.api as sm
import statsmodels.stats.anova as anova #for ANOVA
from statsmodels.stats.multicomp import pairwise_tukeyhsd #for Tukey's multiple comparisons

csv_line = []
with open('test_data.csv', ) as f:
for i in f:
items = i.split(',')
for j in range(len(items)):
if '\n' in items[j]:
items[j] =float(items[j][:-1])
else:
items[j] =float(items[j])
print(items)
csv_line.append(items)

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), perform a multiple comparison.

Tukey’s multiple comparisons

Use pairwise_tukeyhsd(endog, groups, alpha=0.05) for tuky’s HSD(honestly significant difference) test. Argument endog is response variable, array of data (A[0] A[1]… A[6] B[1] … B[6] C[1] … C[6]). Argument groups is list of names(A, A…A, B…B, C…C) that corresponds to response variable. Alpha is significance level.

def tukey_hsd(group_names , *args ):
endog = np.hstack(args)
groups_list = []
for i in range(len(args)):
for j in range(len(args[i])):
groups_list.append(group_names[i])
groups = np.array(groups_list)
res = pairwise_tukeyhsd(endog, groups)
print (res.pvalues) #print only p-value
print(res) #print result

print(tukey_hsd(['A', 'B', 'C'], tdata['A'], tdata['B'],tdata['C']))
>>[0.02259466 0.06511251 0.85313142]
Multiple Comparison of Means - Tukey HSD, FWER=0.05
=====================================================
group1 group2 meandiff p-adj lower upper reject
-----------------------------------------------------
A B -20.8571 0.0226 -38.9533 -2.7609 True
A C -17.1429 0.0651 -35.2391 0.9533 False
B C 3.7143 0.8531 -14.3819 21.8105 False
-----------------------------------------------------
None

Supplement

If you can’t find ‘pvalue’ key, check the version of statsmodels.

To do chi-square test and residual analysis in R. If you want to know chi-square test and implement it in python, refer “Chi-Square Test in Python”.

Source Code

> test_data <- data.frame(
groups = c("A","A", "B", "B", "C", "C"),
result = c("success", "failure", "success", "failure", "success",
"failure"),
number = c(23, 100, 65, 44, 158, 119)
)
> test_data
groups result number
1 A success 23
2 A failure 100
3 B success 65
4 B failure 44
5 C success 158
6 C failure 119
> cross_data <- xtabs(number ~ ., test_data)
> cross_data
result
groups failure success
A 100 23
B 44 65
C 119 158
> result <- chisq.test(cross_data, correct=F)
> result
Pearson's Chi-squared test
data: cross_data
X-squared = 57.236, df = 2, p-value = 3.727e-13
> reesult$residuals
result
groups failure success
A 4.571703 -4.727030
B -1.641673 1.697450
C -2.016609 2.085125
> result$stdres
result
groups failure success
A 7.551524 -7.551524
B -2.663833 2.663833
C -4.296630 4.296630
> pnorm(abs(result$stdres), lower.tail = FALSE) * 2
result
groups failure success
A 4.301958e-14 4.301958e-14
B 7.725587e-03 7.725587e-03
C 1.734143e-05 1.734143e-05

functions

xtabs

xtabs() function is a function to create a contingency table from cross-classifying factors that contained in a data frame. “~” is the formula to specify variables that serve as aggregation criteria are described. And “~ .” means that this function use all variables (groups+result).

chisq.test

chisq.test() function is a function to return the test statistic, degree of freedom and p-value. The argument “correct” is continuity correction and set “correct” into F to suppress the continuity correction.

reesult$residuals

$residuals return standardized residuals.

result$stdres

$stdres return adjusted standardized residuals.

pnorm(abs(result$stdres), lower.tail = FALSE) * 2

This calculates p-value of standardized residuals.

Chi-square test which means “Pearson’s chi-square test” here, is a method of statistical hypothesis testing for goodness-of-fit and independence.

Goodness-of-fit test is the testing to determine whether the observed frequency distribution is the same as the theoretical distribution.
Independence test is the testing to determine whether 2 observations that is represented by 2*2 table, on 2 variables are independent of each other.

Details will be longer. Please see the following sites and document.

The following is implementation for chi-square test.

Import libraries

import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats

Data preparing

gourp A

group B

group C

success

23

65

158

failure

100

44

119

success rate

0.187

0.596

0.570

chi_square_data.csv

A,B,C
23,65,158
100,44,119

Read and Set Data

csv_line = []
with open('chi_square_data.csv', ) as f:
for i in f:
items = i.split(',')
for j in range(len(items)):
if '\n' in items[j]:
items[j] =float(items[j][:-1])
else:
items[j] =float(items[j])
csv_line.append(items)

group = csv_line[0]
success = [int(n) for n in csv_line[1]]
failure = [int(n) for n in csv_line[2]]
groups = []
result =[]
count = []
for i in range(len(group)):
groups += [group[i], group[i]] #['A','A', 'B', 'B', 'C', 'C']
result += ['success', 'failure'] #['success', 'failure', 'success', 'failure', 'success', 'failure']
count += [success[i], failure[i]] #[23, 100, 65, 44, 158, 119]
data = pd.DataFrame({
'groups' : groups,
'result' : result,
'count' : count
})

cross_data = pd.pivot_table(
data = data,
values ='count',
aggfunc = 'sum',
index = 'groups',
columns = 'result'
)
print(cross_data)
>>result failure success
groups
A 100 23
B 44 65
C 119 158

The expected frequencies, based on the marginal sums of the table.

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), the difference between groups is significant.

ANOVA(analysis of variance) is a method of statistical hypothesis testing that determines the effects of factors and interactions, which analyzes the differences between group means within a sample. Details will be longer. Please see the following site.

One-way ANOVA is ANOVA test that compares the means of three or more samples. Null hypothesis is that samples in groups were taken from populations with the same mean.

Implementation

The following is implementation example of paired one-way ANOVA.

Import Libraries

Import libraries below for ANOVA test.

import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np
import statsmodels.stats.anova as anova

csv_line = []
with open('test_data.csv', ) as f:
for i in f:
items = i.split(',')
for j in range(len(items)):
if '\n' in items[j]:
items[j] =float(items[j][:-1])
else:
items[j] =float(items[j])
print(items)
csv_line.append(items)

aov=anova.AnovaRM(df, 'Point','Subjects',['Conditions'])
result=aov.fit()
print(result)
>> Anova
========================================
F Value Num DF Den DF Pr > F
----------------------------------------
Conditions 5.4182 2.0000 12.0000 0.0211
========================================

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. When statistically significant, that is, p-value is less than 0.05 (typically ≤ 0.05), perform a multiple comparison. This p value is different between paired ANOVA and unpaired ANOVA.

Tukey’s multiple comparisons

Use pairwise_tukeyhsd(endog, groups, alpha=0.05) for tuky’s HSD(honestly significant difference) test. Argument endog is response variable, array of data (A[0] A[1]… A[6] B[1] … B[6] C[1] … C[6]). Argument groups is list of names(A, A…A, B…B, C…C) that corresponds to response variable. Alpha is significance level.

def tukey_hsd(group_names , *args ):
endog = np.hstack(args)
groups_list = []
for i in range(len(args)):
for j in range(len(args[i])):
groups_list.append(group_names[i])
groups = np.array(groups_list)
res = pairwise_tukeyhsd(endog, groups)
print (res.pvalues) #print only p-value
print(res) #print result

print(tukey_hsd(['A', 'B', 'C'], tdata['A'], tdata['B'],tdata['C']))
>> [0.02259466 0.06511251 0.85313142]
Multiple Comparison of Means - Tukey HSD, FWER=0.05
=====================================================
group1 group2 meandiff p-adj lower upper reject
-----------------------------------------------------
A B -20.8571 0.0226 -38.9533 -2.7609 True
A C -17.1429 0.0651 -35.2391 0.9533 False
B C 3.7143 0.8531 -14.3819 21.8105 False
-----------------------------------------------------
None