Grouping Data in R, You’ll learn the fundamentals of grouping and how to utilize it to transform and visualize a dataset in this tutorial.
Think about the flight delays in the airline dataset that we discussed in the previous post. We want to know the flight’s delay time varies depending on the day of the week?
As a general thought, people fly most frequently on Mondays and Fridays. Is it true?. For example. Is this going to affect the duration of the flight delay?.
You could find the answers to these questions by grouping the data by reporting airline. After then, compare the outcomes from each day of the week to one another.
Grouping data in r
The group_by() method in tidyverse can be used to accomplish this. When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable’s distinct categories.
You can group by a single variable or by giving in multiple variable names to group by several variables.
Let’s imagine you want to get the average flight delay minutes and see how they differ depending on the “Reporting_Airline” and “DayOfWeek” variables.
Let’s load the dataset.
library(tidyverse) library(dplyr) library(ggplot2) data<-read.csv("D:/RStudio/Airlinedata.csv",1) head(data)
To do so, first call on the dataset, then group the data in the second line by “Reporting_Airline” and “DayOfWeek.”
You want to see how average flight delays differ across the board, therefore in the third line, take the mean of “ArrDelayMinutes” for each group.
delay<-data %>% group_by(Reporting_Airline, DayOfWeek) %>% summarize(avg_delay=mean(ArrDelayMinutes))
The data has been divided into subcategories, with each subcategory’s average flight delay being displayed.
Reporting_Airline DayOfWeek avg_delay <chr> <int> <dbl> 1 AA 1 9.19 2 AA 2 6.23 3 AA 3 7.29 4 AA 4 12.4 5 AA 5 15.9 6 AA 6 8 7 AA 7 13.2 8 AS 1 15 9 AS 2 24.8 10 AS 3 19.1 # ... with 53 more rows
Sort the “avg_delay” column with the longest delay values at the top to further your investigation.
This example prints the result table after sorting the “avg_delay” column in decreasing order with the arrange(desc()) function.
delay %>% arrange(desc(avg_delay)) Reporting_Airline DayOfWeek avg_delay <chr> <int> <dbl> 1 PA (1) 6 122. 2 PA (1) 5 36.5 3 TW 6 34.8 4 B6 7 26.8 5 HP 5 26 6 VX 2 25.7 7 HP 2 25 8 AS 2 24.8 9 HP 1 23.3 10 HP 6 23 # ... with 53 more rows
On Fridays and Saturdays, PA(1) had more flight delays than the others, according to the data.
You may turn this table into a heatmap to make the data easier to grasp.
A heatmap is a two-dimensional data visualization approach that displays the magnitude of a phenomenon as color.
The color fluctuation might be via hue or intensity, giving the reader clear visual indications about how the occurrence is clustered or evolves over time.
It’s a wonderful approach to visualize the relationship between the target variable and other factors by plotting the target variable over numerous variables.
We can make use of the ggplot() function to make the heatmap.
delay %>% ggplot(aes(x=Reporting_Airline,y=DayOfWeek,fill=avg_delay))+ geom_tile(color="white",size=0.3)+ scale_fill_gradient(low="blue",high="red")
The scale fill gradient() function can be used to define the gradient color scales and generate a two-color gradient, low and high.
You learned in this tutorial that grouping and sorting categorical data can help you answer more complex questions about it, and its visualization, like as a heatmap.
Subscribe to the YouTube channel for more video tutorials.