Recently, I was trying to calculate the percentiles of a set of variables within a data set grouped by another variable. However, I quickly ran into the realization that this is not very straight forward when using
summarize. Before I demonstrate, let’s load the libraries that we will need.
If you don’t believe me when I say that it is not straight forward, go ahead and try to run the following block of code.
mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarize(quants = quantile(mpg, probs = c(0.2, 0.5, 0.8)))
If you ran the code, you will see that it throws the following error:
Error in summarise_impl(.data, dots) : Column `quants` must be length 1 (a summary value), not 3
This error is telling us that the result is returning an object of length 3 (our three quantiles) when it is expecting to get only one value. A quick Google search comes up with numerous stack overflow questions and answers about this. Most of these solutions revolve around using the
do function to calculate the quantiles on each of the groups. However, according to Hadley,
do will eventually be “going away”. While there is no definite time frame on this, I try to use it as little as possible. The new recommended practice is a combination of
purrr::map for most cases of grouping. I love this approach for most things (and it is even the accepted for one of the SO questions mentioned above) but I worked up a new solution that I think is useful for calculating percentiles on multiple groups for any desired number of percentiles.
Let’s start by creating a vector of the desired percentiles to calculate. In this example, we will calculate the 20th, 50th, and 80th percentiles.
p <- c(0.2, 0.5, 0.8)
Now we can create a list of functions, with one for each quantile, using
purrr::partial. We can also assign names to each function (useful for the output of
p_names <- map_chr(p, ~paste0(.x*100, "%")) p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% set_names(nm = p_names) p_funs
## $`20%` ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## <environment: 0x7fcf50757430> ## ## $`50%` ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## <environment: 0x7fcf50762c30> ## ## $`80%` ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## <environment: 0x7fcf51148830>
p_funs we can see that we have a named list with each element containing a function comprised of the
quantile function. The beauty of this is that you can use this list in the same way you would define multiple functions in any other
summarize_all functions (i.e.
funs(mean, sd)). The only difference is that we will now have to use the “bang-bang-bang” operator (
rlang (it is also exported from
dplyr). The final product looks like this.
mtcars %>% group_by(cyl) %>% summarize_at(vars(mpg), funs(!!!p_funs))
## # A tibble: 3 x 4 ## cyl `20%` `50%` `80%` ## <dbl> <dbl> <dbl> <dbl> ## 1 4 22.8 26 30.4 ## 2 6 18.3 19.7 21 ## 3 8 13.9 15.2 16.8
I think that this provides a pretty neat way to get the desired output in a format that does not require a large amount of post calculation manipulation. In addition, it is, in my opinion, more straightforward than a lot of the
do methods. This method also allows for quantiles to be calculated for more than one variable, although post-processing would be necessary in that case. Here is an example.
mtcars %>% group_by(cyl) %>% summarize_at(vars(mpg, hp), funs(!!!p_funs)) %>% select(cyl, contains("mpg"), contains("hp"))
## # A tibble: 3 x 7 ## cyl `mpg_20%` `mpg_50%` `mpg_80%` `hp_20%` `hp_50%` `hp_80%` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 22.8 26 30.4 65 91 97 ## 2 6 18.3 19.7 21 110 110 123 ## 3 8 13.9 15.2 16.8 175 192. 245
partial is yet another tool from the
purrr package that can greatly enhance your R coding abilities. While this is surely a basic application of its functionality, one can easily see how powerful this function can be.