Learn Tidyverse: Pivot Functions

[This article was first published on Blog on Data Solutions | Dedicated to helping businesses making data-driven decisions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TL:DR :

We will be using the pivot longer and pivot wider functions to change the shape of our dataframe. It is currently in a wide format, where there are multiple observations for each data point. For each individiual plant observed, sepal length and width and petal length and width and the type of species were all recorded. You may be used to using melt and spread from the reshape2 package but those functions are being phased out and pivot_longer and pivot_wider are replacing them. I’m going to start off by creating a fake dataset to work with that has 3 categories – lake, beach, and park – with counts of the number of visitors to each location since 2011.

library(tidyverse)

dat. <- data.frame(year = rep(seq(2011, 2020), each = 3), 
                   location = rep(c("beach", "park", "lake"), 10), 
                   N = round(runif(30, 10,100))) 
head(dat.)
##   year location  N
## 1 2011    beach 61
## 2 2011     park 46
## 3 2011     lake 60
## 4 2012    beach 38
## 5 2012     park 41
## 6 2012     lake 31

Currently, the data is in the long format, for each year there are 3 separate rows with counts of the number of visitors to each location. So we are going to first use the pivot_wider function to turn it into a wide format.

dat.wider <- dat. %>% 
  pivot_wider(names_from = location, 
              values_from = N)

dat.wider
## # A tibble: 10 x 4
##     year beach  park  lake
##    <int> <dbl> <dbl> <dbl>
##  1  2011    61    46    60
##  2  2012    38    41    31
##  3  2013    23    64    69
##  4  2014    67    98    70
##  5  2015    97    48    39
##  6  2016    44    59    52
##  7  2017    88    19    26
##  8  2018    22    43    94
##  9  2019    17    61    62
## 10  2020    84    56    84

The new dataset looks more like a square because it has more columns and fewer rows than our original dataframe did. Now, each row has 3 observations, the number of visitors to each location.

To turn it back to long format, we will use pivot_longer and give the col argument the columns we want to combine into one.

dat.wider %>% 
  pivot_longer(cols = c("beach", "park", "lake"))
## # A tibble: 30 x 3
##     year name  value
##    <int> <chr> <dbl>
##  1  2011 beach    61
##  2  2011 park     46
##  3  2011 lake     60
##  4  2012 beach    38
##  5  2012 park     41
##  6  2012 lake     31
##  7  2013 beach    23
##  8  2013 park     64
##  9  2013 lake     69
## 10  2014 beach    67
## # ... with 20 more rows
dat.wider %>% 
  pivot_longer(cols = c("beach", "park", "lake"),
               names_to = "location",
               values_to = "N_visitors")
## # A tibble: 30 x 3
##     year location N_visitors
##    <int> <chr>         <dbl>
##  1  2011 beach            61
##  2  2011 park             46
##  3  2011 lake             60
##  4  2012 beach            38
##  5  2012 park             41
##  6  2012 lake             31
##  7  2013 beach            23
##  8  2013 park             64
##  9  2013 lake             69
## 10  2014 beach            67
## # ... with 20 more rows

In the first lines of code I only told the function which columns to combine. In the second set of code, I specified what names I wanted those columns to turn into. The names_to argument gives the name of column that has the old column names and the values_to argument gives the name of the column that will hold the data from the combined columns.

You may be wondering why long or wide format even matters. One reason is if you use ggplot, plotting is much easier when your data is in long format instead of wide.

dat. %>% 
  ggplot(aes(x = year, y = N, color = location)) +
  geom_line()

dat.wider %>% 
  ggplot(aes(x = year, y = beach)) +
  geom_line(color = 1) +
  geom_line(aes(x = year, y = park), color =2) +
  geom_line(aes(x = year, y = lake), color = 3)

In the above examples you can see that with only 3 lines of code I create a graph with 3 lines, one for each location, and colored according to location. If I used the wide data set, it takes 5 lines of code and I have to add each location separately. If you are going to do it this way, you might as well use base r plotting. Also, if you want to use more advanced ggplot functions such as facet_wrap, having your data in the long format makes it much easier.

For more tutorials and tips like this, subscribe to our newsletter below!

To leave a comment for the author, please follow the link and comment on their blog: Blog on Data Solutions | Dedicated to helping businesses making data-driven decisions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)