Analysis of Variance in R, You will be able to identify reasons for employing an Analysis of Variance (or ANOVA) test in your data analysis after completing this tutorial.
You’ll also learn how to analyze the findings of an ANOVA f-test.
Let’s imagine you want to look at a category variable and see how it relates to other variables.
Take, for example, the Airline dataset.
Step 1: Loading Data
library(tidyverse) library(dplyr) library(ggplot2) data<-read.csv("D:/RStudio/Airlinedata.csv",1) head(data)
“How do different categories of the reporting airline feature (as a categorical variable) affect flight delays?” is a question you might wish to consider.
The ANOVA method can be used to evaluate categorical variables like “Reporting_Airline.”
ANOVA can be used to determine the relationship between two groups of a categorical variable.
You may use ANOVA to see if there is any difference in the average flight delays for the different airlines in the Airline dataset.
Step 2: Null Hypothesis
As a result, the null hypothesis for ANOVA is that the mean (the reporting airline’s average value) is the same for all groups.
The alternate or research hypothesis is that the average for all groups is not the same.
Here we are going to explain two group cases, a comparison between AA vs AS and AA vs PA (1).
In the first case, the null hypothesis is that the mean values of ‘AA’ and ‘AS’ are the same, while the alternative hypothesis is that they are not.
The F-test score and the p-value are returned by the ANOVA test.
The F-test determines the ratio of the variance between the mean of each sample group and the variation within each sample group.
The p-value indicates whether or not the outcome is statistically significant.
In general, you can consider a variance to be statistically significant if the p-value is less than 0.05.
The association is substantial if the F-test score is high and no association if the F-test score is low.
Step 3: ANOVA comparison
The aov() function in the stats package can be used to perform the ANOVA test.
data1<-data %>% select(ArrDelay, Reporting_Airline) %>% filter(Reporting_Airline=='AA'|Reporting_Airline=='AS') AOV<-aov(ArrDelay~Reporting_Airline,data=data1) summary(AOV) Df SumSq MeanSq Fvalue Pr(>F) Reporting_Airline 1 126 125.7 0.13 0.718 Residuals 1139 1097707 963.7
It calculates the ANOVA results once you enter the arrival delay data of the two airline groups you want to compare.
Because the F-test score of 0.13 is less than 1 and the P-value is greater than 0.05, the prices between “AA” and “AS” are not significantly different.
A similar analysis can be used to “AA” and “PA (1).”
data1<-data %>% select(ArrDelay, Reporting_Airline) %>% filter(Reporting_Airline=='AA'|Reporting_Airline=='PA (1)') AOV<-aov(ArrDelay~Reporting_Airline,data=data1) summary(AOV) Df Sum Sq Mean Sq F value Pr(>F) Reporting_Airline 1 24008 24008 17.95 2.45e-05 *** Residuals 1127 1507339 1337
Because the F-test score of 17.95 is quite high and the P-value is 0.0000245, which is less than 0.05, the flight delays between “AA” and “PA (1)” are significantly different.
Because the ANOVA test produces a significant F-test score and a small P-value, you can conclude that there is a strong association between a category variable and other factors.
You learned that an ANOVA test can be used to identify correlations between distinct groups of a categorical variable and that the F-test score and p-value can be used to identify the statistical significance.