Programming with dplyr by using dplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The title may seem tautological, but since the arrival of dplyr 0.7.x, there have been some efforts at using dplyr without actually using it that I can’t quite understand. The tidyverse has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.
Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to program over dplyr without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval”. Let’s consider the following interactive pipeline:
library(dplyr)
starwars %>%
group_by(homeworld) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr. But is it? Not at all: we just need to add one line and a bang-bang (!!):
starwars_mean <- function(var) {
var <- enquo(var)
starwars %>%
group_by(!!var) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
}
starwars_mean(homeworld)
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
The enquo() function quotes the name we put in our function (homeworld), and the bang-bang unquotes and uses that name instead of var. That’s it. What about seplyr? With seplyr, we just have to (and I quote)
- Change dplyr verbs to their matching seplyr “*_se()” adapters.
- Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
- Replace “=” in expressions with “:=”.
This is the result:
library(seplyr)
starwars_mean <- function(my_var) {
starwars %>%
group_by_se(my_var) %>%
summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
"mean_mass" := "mean(mass, na.rm = TRUE)",
"count" := "n()"))
}
starwars_mean("homeworld")
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
Basically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:
starwars_mean <- function(var) {
var <- as.name(var)
starwars %>%
group_by(!!var) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
}
starwars_mean("homeworld")
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
But we can do even better if we remember that dplyr provides scoped variants (see ?dplyr::scoped) for most of the verbs. In this case, group_by_at() comes in handy:
starwars_mean <- function(var) {
starwars %>%
group_by_at(var) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
}
starwars_mean("homeworld")
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
That’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr and R’s paste0() function to build up expressions:
grouped_mean <- function(data, grouping_variables, value_variables) {
result_names <- paste0("mean_", value_variables)
expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)")
data %>%
group_by_se(grouping_variables) %>%
summarize_se(c(result_names := expressions,
"count" := "n()"))
}
starwars %>%
grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
## eye_color mean_mass mean_birth_year count
## <chr> <dbl> <dbl> <int>
## 1 black 76.28571 33.00000 10
## 2 blue 86.51667 67.06923 19
## 3 blue-gray 77.00000 57.00000 1
## 4 brown 66.09231 108.96429 21
## 5 dark NaN NaN 1
## 6 gold NaN NaN 1
## 7 green, yellow 159.00000 NaN 1
## 8 hazel 66.00000 34.50000 3
## 9 orange 282.33333 231.00000 8
## 10 pink NaN NaN 1
## 11 red 81.40000 33.66667 5
## 12 red, blue NaN NaN 1
## 13 unknown 31.50000 NaN 3
## 14 white 48.00000 NaN 1
## 15 yellow 81.11111 76.38000 11
And the same with dplyr’s scoped verbs (note that I’ve added the last rename_at() on a whim, just to get exactly the same output as before, but it is not really necessary):
grouped_mean <- function(data, grouping_variables, value_variables) {
data %>%
group_by_at(grouping_variables) %>%
mutate(count = n()) %>%
summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
rename_at(value_variables, funs(paste0("mean_", .)))
}
starwars %>%
grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
## eye_color mean_mass mean_birth_year count
## <chr> <dbl> <dbl> <dbl>
## 1 black 76.28571 33.00000 10
## 2 blue 86.51667 67.06923 19
## 3 blue-gray 77.00000 57.00000 1
## 4 brown 66.09231 108.96429 21
## 5 dark NaN NaN 1
## 6 gold NaN NaN 1
## 7 green, yellow 159.00000 NaN 1
## 8 hazel 66.00000 34.50000 3
## 9 orange 282.33333 231.00000 8
## 10 pink NaN NaN 1
## 11 red 81.40000 33.66667 5
## 12 red, blue NaN NaN 1
## 13 unknown 31.50000 NaN 3
## 14 white 48.00000 NaN 1
## 15 yellow 81.11111 76.38000 11
Wrapping up, the tidyeval paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr is full of tools that will make your life easier, not harder.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.