Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.
Let’s take a look at this capability.
First let’s start dplyr.
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.1.9000'
A dplyr pattern that I have seen used often is the "group_by() %>% mutate()" pattern. This historically has been shorthand for a "group_by() %>% summarize()" followed by a join(). It is easiest to show by example.
The following code:
mtcars %>%
group_by(cyl, gear) %>%
mutate(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
head()
## # A tibble: 6 x 6 ## # Groups: cyl, gear [4] ## cyl gear mpg disp group_mean_mpg group_mean_disp ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 6 4 21.0 160 19.750 163.8000 ## 2 6 4 21.0 160 19.750 163.8000 ## 3 4 4 22.8 108 26.925 102.6250 ## 4 6 3 21.4 258 19.750 241.5000 ## 5 8 3 18.7 360 15.050 357.6167 ## 6 6 3 18.1 225 19.750 241.5000
is taken to be shorthand for:
mtcars %>%
group_by(cyl, gear) %>%
summarize(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
left_join(mtcars, ., by = c('cyl', 'gear')) %>%
select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
head()
## cyl gear mpg disp group_mean_mpg group_mean_disp ## 1 6 4 21.0 160 19.750 163.8000 ## 2 6 4 21.0 160 19.750 163.8000 ## 3 4 4 22.8 108 26.925 102.6250 ## 4 6 3 21.4 258 19.750 241.5000 ## 5 8 3 18.7 360 15.050 357.6167 ## 6 6 3 18.1 225 19.750 241.5000
The advantages of the shorthand are:
- The analyst only has to specify the grouping column once.
- The data (
mtcars) enters the pipeline only once. - The analyst doesn’t have to start thinking about joins immediately.
Frankly I’ve never liked the shorthand. I feel it is a "magic extra" that a new user would have no way of anticipating from common use of group_by() and summarize(). I very much like the idea of wrapping this important common use case into a single verb. Adjoining "windowed" or group-calculated columns is a common and important step in analysis, and well worth having its own verb.
Below is our attempt at elevating this pattern into a packaged verb.
#' Simulate the group_by/mutate pattern with an explicit summarize and join.
#'
#' Group a data frame by the groupingVars and compute user summaries on
#' this data frame (user summaries specified in ...), then join these new
#' columns back into the original data and return to the user.
#' This works around https://github.com/tidyverse/dplyr/issues/2960 .
#' And it is a demonstration of a higher-order dplyr verb.
#' Author: John Mount, Win-Vector LLC.
#'
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... list of dplyr::mutate() expressions.
#' @value d with grouped summaries added as extra columns
#'
#' @examples
#'
#' add_group_summaries(mtcars,
#' c("cyl", "gear"),
#' group_mean_mpg = mean(mpg),
#' group_mean_disp = mean(disp)) %>%
#' head()
#'
#' @export
#'
add_group_summaries <- function(d, groupingVars, ...) {
# convert char vector into quosure vector
# These interfaces are still changing, so take care.
groupingQuos <- lapply(groupingVars,
function(si) { quo(!!as.name(si)) })
dg <- group_by(d, !!!groupingQuos)
ds <- summarize(dg, ...)
ds <- ungroup(ds)
left_join(d, ds, by= groupingVars)
}
This works as follows:
mtcars %>%
add_group_summaries(c("cyl", "gear"),
group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
head()
## cyl gear mpg disp group_mean_mpg group_mean_disp ## 1 6 4 21.0 160 19.750 163.8000 ## 2 6 4 21.0 160 19.750 163.8000 ## 3 4 4 22.8 108 26.925 102.6250 ## 4 6 3 21.4 258 19.750 241.5000 ## 5 8 3 18.7 360 15.050 357.6167 ## 6 6 3 18.1 225 19.750 241.5000
And this also works on database-backed dplyr data (which the shorthand currently does not, please see dplyr 2887 issue and dplyr issue 2960).
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
group_by(cyl, gear) %>%
mutate(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp))
## Error: Window function `avg()` is not supported by this database
mtcars2 %>%
add_group_summaries(c("cyl", "gear"),
group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
select(cyl, gear, mpg, disp, group_mean_mpg, group_mean_disp) %>%
head()
## # Source: lazy query [?? x 6] ## # Database: sqlite 3.19.3 [:memory:] ## cyl gear mpg disp group_mean_mpg group_mean_disp ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 6 4 21.0 160 19.750 163.8000 ## 2 6 4 21.0 160 19.750 163.8000 ## 3 4 4 22.8 108 26.925 102.6250 ## 4 6 3 21.4 258 19.750 241.5000 ## 5 8 3 18.7 360 15.050 357.6167 ## 6 6 3 18.1 225 19.750 241.5000
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
