tidyr::complete to show all possible combinations of variables

Posted on July 22, 2020 by kjytay in R bloggers | 0 Comments

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is an issue I often face, so I thought it best to write it down. When doing data analysis, we often want to known how many observations there are in each subgroup. These subgroups can be defined by multiple variables. In the code example below, I want to know how many vehicles there are for each (cyl, gear) combination:

library(tidyverse)
data(mtcars)
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n())

# # A tibble: 8 x 3
# # Groups:   cyl [3]
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     5     2

If you look carefully, you will notice that there are no vehicles with cyl == 8 and gear == 4. In general it’s probably better to include this combination as a row in the tibble, with count as 0. This is especially important in data pipelines where future processes might expect there to be length(unique(cyl)) * length(unique(gear)) rows in the dataset.

We can achieve this by ungrouping the dataset and applying tidyr::complete(). This ensures that every possible (cyl, gear) combination gets a row.

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear)

# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4    NA
# 9     8     5     2

For rows that didn’t appear in the original summary table, complete() fills up the remaining columns with NA. We can specify the value complete() should use to fill in these cells with the fill option:

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear, fill = list(count = 0))

# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4     0
# 9     8     5     2

References:

Reddit. Need help with dplyr: Show all possible group combinations.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

tidyr::complete to show all possible combinations of variables

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)