# Calculating quantiles for groups with dplyr::summarize and purrr::partial

**Rstats on goonR blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, I was trying to calculate the percentiles of a set of variables within a data set grouped by another variable. However, I quickly ran into the realization that this is not very straight forward when using `dplyr`

’s `summarize`

. Before I demonstrate, let’s load the libraries that we will need.

library(dplyr) library(purrr)

If you don’t believe me when I say that it is not straight forward, go ahead and try to run the following block of code.

mtcars %>% dplyr::group_by(cyl) %>% dplyr::summarize(quants = quantile(mpg, probs = c(0.2, 0.5, 0.8)))

If you ran the code, you will see that it throws the following error:

Error in summarise_impl(.data, dots) : Column `quants` must be length 1 (a summary value), not 3

This error is telling us that the result is returning an object of length 3 (our three quantiles) when it is expecting to get only one value. A quick Google search comes up with numerous stack overflow questions and answers about this. Most of these solutions revolve around using the `do`

function to calculate the quantiles on each of the groups. However, according to Hadley, `do`

will eventually be “going away”. While there is no definite time frame on this, I try to use it as little as possible. The new recommended practice is a combination of `tidyr::nest`

, `dplyr::mutate`

and `purrr::map`

for most cases of grouping. I love this approach for most things (and it is even the accepted for one of the SO questions mentioned above) but I worked up a new solution that I think is useful for calculating percentiles on multiple groups for any desired number of percentiles.

This method uses `purrr::map`

and a Function Operator, `purrr::partial`

, to create a list of functions that can than be applied to a data set using `dplyr::summarize_at`

and a little magic from `rlang`

.

Let’s start by creating a vector of the desired percentiles to calculate. In this example, we will calculate the 20^{th}, 50^{th}, and 80^{th} percentiles.

p <- c(0.2, 0.5, 0.8)

Now we can create a list of functions, with one for each quantile, using `purrr::map`

and `purrr::partial`

. We can also assign names to each function (useful for the output of `summarize`

) using `purrr::set_names`

p_names <- map_chr(p, ~paste0(.x*100, "%")) p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% set_names(nm = p_names) p_funs ## $`20%` ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## <environment: 0x7fcf50757430> ## ## $`50%` ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## <environment: 0x7fcf50762c30> ## ## $`80%` ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## <environment: 0x7fcf51148830>

Looking at `p_funs`

we can see that we have a named list with each element containing a function comprised of the `quantile`

function. The beauty of this is that you can use this list in the same way you would define multiple functions in any other `summarize_at`

or `summarize_all`

functions (i.e. `funs(mean, sd)`

). The only difference is that we will now have to use the “bang-bang-bang” operator (`!!!`

) from `rlang`

(it is also exported from `dplyr`

). The final product looks like this.

mtcars %>% group_by(cyl) %>% summarize_at(vars(mpg), funs(!!!p_funs)) ## # A tibble: 3 x 4 ## cyl `20%` `50%` `80%` ## <dbl> <dbl> <dbl> <dbl> ## 1 4 22.8 26 30.4 ## 2 6 18.3 19.7 21 ## 3 8 13.9 15.2 16.8

I think that this provides a pretty neat way to get the desired output in a format that does not require a large amount of post calculation manipulation. In addition, it is, in my opinion, more straightforward than a lot of the `do`

methods. This method also allows for quantiles to be calculated for more than one variable, although post-processing would be necessary in that case. Here is an example.

mtcars %>% group_by(cyl) %>% summarize_at(vars(mpg, hp), funs(!!!p_funs)) %>% select(cyl, contains("mpg"), contains("hp")) ## # A tibble: 3 x 7 ## cyl `mpg_20%` `mpg_50%` `mpg_80%` `hp_20%` `hp_50%` `hp_80%` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 22.8 26 30.4 65 91 97 ## 2 6 18.3 19.7 21 110 110 123 ## 3 8 13.9 15.2 16.8 175 192. 245

`partial`

is *yet another* tool from the `purrr`

package that can greatly enhance your R coding abilities. While this is surely a basic application of its functionality, one can easily see how powerful this function can be.

**leave a comment**for the author, please follow the link and comment on their blog:

**Rstats on goonR blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.