# Calculating quantiles for groups with dplyr::summarize and purrr::partial

September 30, 2018
By

(This article was first published on Rstats on goonR blog, and kindly contributed to R-bloggers)

Recently, I was trying to calculate the percentiles of a set of variables within a data set grouped by another variable. However, I quickly ran into the realization that this is not very straight forward when using `dplyr`’s `summarize`. Before I demonstrate, let’s load the libraries that we will need.

``````library(dplyr)
library(purrr)``````

If you don’t believe me when I say that it is not straight forward, go ahead and try to run the following block of code.

``````mtcars %>%
dplyr::group_by(cyl) %>%
dplyr::summarize(quants = quantile(mpg, probs = c(0.2, 0.5, 0.8)))``````

If you ran the code, you will see that it throws the following error:

``````Error in summarise_impl(.data, dots) :
Column `quants` must be length 1 (a summary value), not 3``````

This error is telling us that the result is returning an object of length 3 (our three quantiles) when it is expecting to get only one value. A quick Google search comes up with numerous stack overflow questions and answers about this. Most of these solutions revolve around using the `do` function to calculate the quantiles on each of the groups. However, according to Hadley, `do` will eventually be “going away”. While there is no definite time frame on this, I try to use it as little as possible. The new recommended practice is a combination of `tidyr::nest`, `dplyr::mutate` and `purrr::map` for most cases of grouping. I love this approach for most things (and it is even the accepted for one of the SO questions mentioned above) but I worked up a new solution that I think is useful for calculating percentiles on multiple groups for any desired number of percentiles.

This method uses `purrr::map` and a Function Operator, `purrr::partial`, to create a list of functions that can than be applied to a data set using `dplyr::summarize_at` and a little magic from `rlang`.

Let’s start by creating a vector of the desired percentiles to calculate. In this example, we will calculate the 20th, 50th, and 80th percentiles.

``p <- c(0.2, 0.5, 0.8)``

Now we can create a list of functions, with one for each quantile, using `purrr::map` and `purrr::partial`. We can also assign names to each function (useful for the output of `summarize`) using `purrr::set_names`

``````p_names <- map_chr(p, ~paste0(.x*100, "%"))

p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)

p_funs``````
``````## \$`20%`
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
##
## \$`50%`
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
##
##
## \$`80%`
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
## ``````

Looking at `p_funs` we can see that we have a named list with each element containing a function comprised of the `quantile` function. The beauty of this is that you can use this list in the same way you would define multiple functions in any other `summarize_at` or `summarize_all` functions (i.e. `funs(mean, sd)`). The only difference is that we will now have to use the “bang-bang-bang” operator (`!!!`) from `rlang` (it is also exported from `dplyr`). The final product looks like this.

``````mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), funs(!!!p_funs))``````
``````## # A tibble: 3 x 4
##     cyl `20%` `50%` `80%`
##
## 1     4  22.8  26    30.4
## 2     6  18.3  19.7  21
## 3     8  13.9  15.2  16.8``````

I think that this provides a pretty neat way to get the desired output in a format that does not require a large amount of post calculation manipulation. In addition, it is, in my opinion, more straightforward than a lot of the `do` methods. This method also allows for quantiles to be calculated for more than one variable, although post-processing would be necessary in that case. Here is an example.

``````mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg, hp), funs(!!!p_funs)) %>%
select(cyl, contains("mpg"), contains("hp"))``````
``````## # A tibble: 3 x 7
##     cyl `mpg_20%` `mpg_50%` `mpg_80%` `hp_20%` `hp_50%` `hp_80%`
##
## 1     4      22.8      26        30.4       65      91        97
## 2     6      18.3      19.7      21        110     110       123
## 3     8      13.9      15.2      16.8      175     192.      245``````

`partial` is yet another tool from the `purrr` package that can greatly enhance your R coding abilities. While this is surely a basic application of its functionality, one can easily see how powerful this function can be.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...