How to reshape a dataframe from wide to long or long to wide format

Posted on July 31, 2019 by r-bloggers on Programming with R in R bloggers | 0 Comments

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reshaping a dataframe / table from long to wide format or wide to long format is one of the daily tasks a Data Analyst / Data Scientist would be doing. The long format is similar to the tidy format that the tidyverse advocates. Even while, it’s been a very common task – the tidyr package’s solution of using spread() and gather() almost never was intuitive enough to be used in the code without SOing or Referring the documentation. While I used to think, It’s just me who didn’t spread and gather, It seemed to be mass problem when the package developer Hadley Wickham realized this to completely rewrite those functions into two new functions pivot_wider() and pivot_longer() (which is also equivalent to melt() and cast() – or simply unpivot and pivot)

update your `tidyr`

If you are an R-user, It’s highly likely that you’ve got tidyr either as a standalone package or part of the tidyverse packages. Please update the package the get the latest tidyr to get these functions for reshaping.

install.packages("tidyr") 

or

devtools::install_github("tidyverse/tidyr")

and load the library.

library(tidyr)

packageVersion("tidyr")
## [1] '0.8.3.9000'

Data

Let’s take this problem from a very popular Stack Overflow Question.

#data
df <- read.table(textConnection(
"Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20249  19000  22532  5000  24555
ALB  Albania        8097   8986   3000  11123  12246"), header=TRUE)

df
##   Code     Country X1950 X1951 X1952 X1953 X1954
## 1  AFG Afghanistan 20249 19000 22532  5000 24555
## 2  ALB     Albania  8097  8986  3000 11123 12246

This wide format is good for some use-cases but for most of what we do with Data Analysis, it’d require the data to be in the long format.

Why Reshaping - The Problem

Given this data, Let’s try to make a Time-series Line Chart using ggplot2. But the format in which the data is currently shaped (wide) can’t help us in building the line chart because for a line chart using geom_line() we need the data in the long format - where the x-axis and y-axis are columns (ideally with x being a Time variable and y being a continuous variable) and also that we’ve got the country which could help us comparing both the countries.

Let’s do that!

library(tidyverse)

df_pivoted <- pivot_longer(df, cols = -c("Code","Country"), names_to = "year")

df_pivoted
## # A tibble: 10 x 4
##    Code  Country     year  value
##    <fct> <fct>       <chr> <int>
##  1 AFG   Afghanistan X1950 20249
##  2 AFG   Afghanistan X1951 19000
##  3 AFG   Afghanistan X1952 22532
##  4 AFG   Afghanistan X1953  5000
##  5 AFG   Afghanistan X1954 24555
##  6 ALB   Albania     X1950  8097
##  7 ALB   Albania     X1951  8986
##  8 ALB   Albania     X1952  3000
##  9 ALB   Albania     X1953 11123
## 10 ALB   Albania     X1954 12246

I’m personally glad that I don’t have to check the confusing syntax of gather() instead the intuitve name of pivot_longer() (which is literally what we’ve to do) flows through my fingers.

As you can see above, the pivot_longer() function as any tidyverse function supports %>% which also means the first argument is the dataframe itself, then the list of cols we would like to pivot_longer() and the next essential argument names_to - new column name under whcih these are rolled up. This gives us the reshaped data which we could use for plotting. If you came to this article, just to see reshaping, you’re good by this place. But if you want to connect better with the reason why we wanted to reshape in the first place, Let’s go do the line graph!

Line Graph with reshaped data

Now that we know how to reshape the data (from wide to long format), we can use ggplot2’s geom_line() to plot the (time-series) lines.

df %>% 
  pivot_longer(-c("Code","Country"),"year") %>% 
  mutate(year = as.integer(str_replace(year, "X",""))) %>% 
  ggplot() + geom_line(aes(year, value, group = Country, color = Country)) +
  labs(title =  "Line Graph after Reshaping",
       subtitle = "I love pivot_longer()",
       caption = "I was always confused with gather and spread") +
  theme_minimal()

In the above code, I’ve repeated the reshaping again - just to imply a point that this is why %>%s are cool because we can build intuitive ETL (kind-of) pipeline and visualization together step-by-step.

The End

Well, We finally ended up with a beautiful (opinionated) line graph to compare two countries’ (made-up values) which came from a wide format data originally which we reshaped to long format data.

References

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to reshape a dataframe from wide to long or long to wide format

update your `tidyr`

Data

Why Reshaping - The Problem

Line Graph with reshaped data

The End

References

Related

update your tidyr

Data

Why Reshaping - The Problem

Line Graph with reshaped data

The End

References

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

update your `tidyr`

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)