Reshaping a dataframe / table from long to wide format or wide to long format is one of the daily tasks a Data Analyst / Data Scientist would be doing. The long format is similar to the
tidy format that the
tidyverse advocates. Even while, it’s been a very common task – the
tidyr package’s solution of using
gather() almost never was intuitive enough to be used in the code without SOing or Referring the documentation. While I used to think, It’s just me who didn’t
gather, It seemed to be mass problem when the package developer Hadley Wickham realized this to completely rewrite those functions into two new functions
pivot_longer() (which is also equivalent to
cast() – or simply
If you are an R-user, It’s highly likely that you’ve got
tidyr either as a standalone package or part of the
tidyverse packages. Please update the package the get the latest
tidyr to get these functions for reshaping.
install.packages("tidyr") or devtools::install_github("tidyverse/tidyr")
and load the library.
library(tidyr) packageVersion("tidyr") ##  '0.8.3.9000'
Let’s take this problem from a very popular Stack Overflow Question.
#data df <- read.table(textConnection( "Code Country 1950 1951 1952 1953 1954 AFG Afghanistan 20249 19000 22532 5000 24555 ALB Albania 8097 8986 3000 11123 12246"), header=TRUE) df ## Code Country X1950 X1951 X1952 X1953 X1954 ## 1 AFG Afghanistan 20249 19000 22532 5000 24555 ## 2 ALB Albania 8097 8986 3000 11123 12246
This wide format is good for some use-cases but for most of what we do with Data Analysis, it’d require the data to be in the long format.
Why Reshaping - The Problem
Given this data, Let’s try to make a Time-series Line Chart using
ggplot2. But the format in which the data is currently shaped (wide) can’t help us in building the line chart because for a line chart using
geom_line() we need the data in the long format - where the x-axis and y-axis are columns (ideally with x being a Time variable and y being a continuous variable) and also that we’ve got the country which could help us comparing both the countries.
Let’s do that!
library(tidyverse) df_pivoted <- pivot_longer(df, cols = -c("Code","Country"), names_to = "year") df_pivoted ## # A tibble: 10 x 4 ## Code Country year value ## <fct> <fct> <chr> <int> ## 1 AFG Afghanistan X1950 20249 ## 2 AFG Afghanistan X1951 19000 ## 3 AFG Afghanistan X1952 22532 ## 4 AFG Afghanistan X1953 5000 ## 5 AFG Afghanistan X1954 24555 ## 6 ALB Albania X1950 8097 ## 7 ALB Albania X1951 8986 ## 8 ALB Albania X1952 3000 ## 9 ALB Albania X1953 11123 ## 10 ALB Albania X1954 12246
I’m personally glad that I don’t have to check the confusing syntax of
gather() instead the intuitve name of
pivot_longer() (which is literally what we’ve to do) flows through my fingers.
As you can see above, the
pivot_longer() function as any tidyverse function supports
%>% which also means the first argument is the dataframe itself, then the list of
cols we would like to
pivot_longer() and the next essential argument
names_to - new column name under whcih these are rolled up. This gives us the reshaped data which we could use for plotting. If you came to this article, just to see reshaping, you’re good by this place. But if you want to connect better with the reason why we wanted to reshape in the first place, Let’s go do the line graph!
Line Graph with reshaped data
Now that we know how to reshape the data (from wide to long format), we can use
geom_line() to plot the (time-series) lines.
df %>% pivot_longer(-c("Code","Country"),"year") %>% mutate(year = as.integer(str_replace(year, "X",""))) %>% ggplot() + geom_line(aes(year, value, group = Country, color = Country)) + labs(title = "Line Graph after Reshaping", subtitle = "I love pivot_longer()", caption = "I was always confused with gather and spread") + theme_minimal()
In the above code, I’ve repeated the reshaping again - just to imply a point that this is why
%>%s are cool because we can build intuitive ETL (kind-of) pipeline and visualization together step-by-step.
Well, We finally ended up with a beautiful (opinionated) line graph to compare two countries’ (made-up values) which came from a wide format data originally which we reshaped to long format data.