Data Analysis and Statistical Inference: A Quick Guide Part 2 (ANOVA)

Handling variables in data, Statistical Tests & Codes

Janpreet Singh
The Bit Theories

--

In my previous article, I discussed about how to make out inferences between different types of variables. This time, I would be telling about making inferences between a categorical and continuous variable using ANOVA (ANalysis Of VAriance).

You would be thinking that “hey!!, we could simply use t-test for that”, but the problem is that t-test becomes inefficient as the number of factor classes increases. When we have two factor classes then we have to do 1 t-test, when we have 3 factor classes then we have to do 3 t-tests but when we have 4 factor classes the number increases to 6. So, we need an efficient procedure to examine the data if we have more than 2 factor classes, hence the ANOVA test.

When we talk about variance, there can be two types of variance, i.e. within group variability and between group variability.

within group variability (Fig. 1)

In Fig. 1 depicting the within group variability, on the the left hand side the sample has less within group variability than the sample on the right hand side.

between group variability(Fig. 2)

In Fig. 2, we have three samples shown in colors blue, maroon and green. When the distribution of samples follows this pattern, it implies that they have less group variability.

So, by above discussion we can say that less within group variability will make the samples vary significantly and less between group variability will make the samples not vary significantly.

ANOVA just captures what is mentioned in the above statement!! Basically, ANOVA separates within group variability and between group variability and provides us inferences on that. For this purpose, we use the F ratio which is defined as follows

F ratio for ANOVA

Now, having the intuition for the F ratio and not going much into the mathematics, let’s head straight for the R code.

The F ratio of different samples tells about the total variability of different samples.

The null hypothesis here is : Means of all the the groups taken does not vary significantly i.e. if we have three groups then (mean1 = mean2 = mean3) and alternative hypothesis is that at least one pair of samples is significantly different.

Here, if the p-value is greater than 0.05 then we accept the null hypothesis, otherwise we reject the null hypothesis.

Link for the data is here

CODE:

setwd("/home/datumx/test/bit-theories")
#loading the data
flower <- read.csv("flowers.csv")
#converting the values to numeric factor
flower$flower_type <- as.numeric(as.factor(flower$flower_type))
#ordering them according to the group the belong to
newdata <- flower[order(flower$flower_type),]
#creating a dummy group variable
groups = factor(rep(letters[1:4], each = 6))
#running anova on the groups
anova_fit = lm(formula = flower$petal.size ~ groups)
#checking for p-value
anova(anova_fit)
Analysis of Variance Table

Response: flower$petal.size
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 610.46 203.486 3.6191 0.03096 *
Residuals 20 1124.50 56.225

The output of the function is a classical ANOVA table with the following data:
Df = degree of freedom
Sum Sq= deviance (within groups, and residual)
Mean Sq= variance (within groups, and residual)
F value= the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
Pr(>F) = p-value

Here I have taken the data which has taken 4 types of flowers and their petal sizes. Each flower has 6 readings for the petal size. The null hypothesis would be assuming that the mean petal size of all the petals are same.

In the Analysis of variance table here, we get the p-value less than 0.05, so we accept the null hypothesis that the mean petal size of all the flowers are same.

We can also compare the computed F-value with the tabulated F-value:

qf(0.950, 20, 3)
[1] 8.66019

The point to be noted is that each of the group has same number of readings. There are other designs of ANOVA which can be used when we have non uniform samples.

Takeaways:

  • Meaning of ANOVA.
  • Limitations of t-test over ANOVA.
  • making Inference when we have more than two groups of a categorical variable.

Team Cyber Labs (Website, Facebook)

--

--

Data science and machine learning enthusiast. Travel. Reading. History.