Better Grouped Summaries in dplyr

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.

Let’s take a look at this capability.

First let’s start dplyr.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")

## [1] '0.7.1.9000'

A dplyr pattern that I have seen used often is the “group_by() %>% mutate()” pattern. This historically has been shorthand for a “group_by() %>% summarize()” followed by a join(). It is easiest to show by example.

The following code:

mtcars %>% 
  group_by(cyl, gear) %>%
  mutate(group_mean_mpg = mean(mpg), 
            group_mean_disp = mean(disp)) %>% 
  select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
  head()

## # A tibble: 6 x 6
## # Groups:   cyl, gear [4]
##     cyl  gear   mpg  disp group_mean_mpg group_mean_disp
##   <dbl> <dbl> <dbl> <dbl>          <dbl>           <dbl>
## 1     6     4  21.0   160         19.750        163.8000
## 2     6     4  21.0   160         19.750        163.8000
## 3     4     4  22.8   108         26.925        102.6250
## 4     6     3  21.4   258         19.750        241.5000
## 5     8     3  18.7   360         15.050        357.6167
## 6     6     3  18.1   225         19.750        241.5000

is taken to be shorthand for:

mtcars %>% 
  group_by(cyl, gear) %>%
  summarize(group_mean_mpg = mean(mpg), 
            group_mean_disp = mean(disp)) %>% 
  left_join(mtcars, ., by = c('cyl', 'gear')) %>%
  select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
  head()

##   cyl gear  mpg disp group_mean_mpg group_mean_disp
## 1   6    4 21.0  160         19.750        163.8000
## 2   6    4 21.0  160         19.750        163.8000
## 3   4    4 22.8  108         26.925        102.6250
## 4   6    3 21.4  258         19.750        241.5000
## 5   8    3 18.7  360         15.050        357.6167
## 6   6    3 18.1  225         19.750        241.5000

The advantages of the shorthand are:

  • The analyst only has to specify the grouping column once.
  • The data (mtcars) enters the pipeline only once.
  • The analyst doesn’t have to start thinking about joins immediately.

Frankly I’ve never liked the shorthand. I feel it is a “magic extra” that a new user would have no way of anticipating from common use of group_by() and summarize(). I very much like the idea of wrapping this important common use case into a single verb. Adjoining “windowed” or group-calculated columns is a common and important step in analysis, and well worth having its own verb.

Below is our attempt at elevating this pattern into a packaged verb.

#' Simulate the group_by/mutate pattern with an explicit summarize and join.
#' 
#' Group a data frame by the groupingVars and compute user summaries on 
#' this data frame (user summaries specified in ...), then join these new
#' columns back into the original data and return to the user.
#' This works around https://github.com/tidyverse/dplyr/issues/2960 .
#' And it is a demonstration of a higher-order dplyr verb.
#' Author: John Mount, Win-Vector LLC.
#' 
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... list of dplyr::mutate() expressions.
#' @value d with grouped summaries added as extra columns
#' 
#' @examples
#' 
#' add_group_summaries(mtcars, 
#'                     c("cyl", "gear"), 
#'                     group_mean_mpg = mean(mpg), 
#'                     group_mean_disp = mean(disp)) %>%
#'   head()
#' 
#' @export
#' 
add_group_summaries <- function(d, groupingVars, ...) {
  # convert char vector into quosure vector
  # These interfaces are still changing, so take care.
  groupingQuos <- lapply(groupingVars, 
                         function(si) { quo(!!as.name(si)) })
  dg <- group_by(d, !!!groupingQuos)
  ds <- summarize(dg, ...)
  ds <- ungroup(ds)
  left_join(d, ds, by= groupingVars)
}

This works as follows:

mtcars %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
  head()

##   cyl gear  mpg disp group_mean_mpg group_mean_disp
## 1   6    4 21.0  160         19.750        163.8000
## 2   6    4 21.0  160         19.750        163.8000
## 3   4    4 22.8  108         26.925        102.6250
## 4   6    3 21.4  258         19.750        241.5000
## 5   8    3 18.7  360         15.050        357.6167
## 6   6    3 18.1  225         19.750        241.5000

And this also works on database-backed dplyr data (which the shorthand currently does not, please see dplyr 2887 issue and dplyr issue 2960).

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>% 
  group_by(cyl, gear) %>%
  mutate(group_mean_mpg = mean(mpg), 
         group_mean_disp = mean(disp))

## Error: Window function `avg()` is not supported by this database

mtcars2 %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
  head()

## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.19.3 [:memory:]
##     cyl  gear   mpg  disp group_mean_mpg group_mean_disp
##   <dbl> <dbl> <dbl> <dbl>          <dbl>           <dbl>
## 1     6     4  21.0   160         19.750        163.8000
## 2     6     4  21.0   160         19.750        163.8000
## 3     4     4  22.8   108         26.925        102.6250
## 4     6     3  21.4   258         19.750        241.5000
## 5     8     3  18.7   360         15.050        357.6167
## 6     6     3  18.1   225         19.750        241.5000

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)