# Aggregation with dplyr: summarise and summarise_each

**MilanoR**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

*This article is an extract from the course “Efficient Data Manipulation with R” that the author, Andrea Spanò, kindly provided us.*

## Introduction

We use `summarise()`

with aggregate functions, which take a vector of values and return a single number. Function `summarise_each()`

offers an alternative approach to `summarise()`

with identical results.

This post aims to compare the behavior of `summarise()`

and `summarise_each()`

considering two factors we can take under control:

- How many variables to manipulate

- 1A. single variable
- 1B. more than a variable

- How many functions to apply to each variable

- 2A. single function
- 2B. more than one function

resulting in the following four cases:

- Case 1: apply one function to one variable
- Case 2: apply many functions to one variable
- Case 3: apply one function to many variables
- Case 4: apply many functions to many variables

These four cases will be also tested with and without a `group_by()`

option.

## The `mtcars`

data frame

For this article we will use the well known `mtcars`

data frame.

We will first transform it into a `tbl_df`

object; no change will occur to the standard `data.frame`

object but a much better print method will be available.

Finally, to keep this article tidy and clean we will select only four variables of interest

mtcars <- mtcars %>% tbl_df() %>% select(cyl , mpg, disp)

### Case 1: apply one function to one variable

In this case, `summarise()`

results the simplest candidate.

# without group mtcars %>% summarise (mean_mpg = mean(mpg))

## Source: local data frame [1 x 1] ## ## mean_mpg ## (dbl) ## 1 20.09062

# with group mtcars %>% group_by(cyl) %>% summarise (mean_mpg = mean(mpg))

## Source: local data frame [3 x 2] ## ## cyl mean_mpg ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000

We could use function `summarise_each()`

as well but, its usage results in a loss of clarity.

# without group mtcars %>% summarise_each (funs(mean) , mean_mpg = mpg)

## Source: local data frame [1 x 1] ## ## mean_mpg ## (dbl) ## 1 20.09062

# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(mean) , mean_mpg = mpg)

## Source: local data frame [3 x 2] ## ## cyl mean_mpg ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000

### Case 2: apply many functions to one variable

In this case we can use both functions `summarise()`

and `summarise_each()`

.

Function `summarise()`

has a more intuitive syntax:

# without group mtcars %>% summarise (min_mpg = min(mpg), max_mpg = max(mpg))

## Source: local data frame [1 x 2] ## ## min_mpg max_mpg ## (dbl) (dbl) ## 1 10.4 33.9

# with group mtcars %>% group_by(cyl) %>% summarise (min_mpg = min(mpg), max_mpg = max(mpg))

## Source: local data frame [3 x 3] ## ## cyl min_mpg max_mpg ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2

The names of the output variables can be specified in simple forms like: `max_mpg = max(mpg)`

When we apply many functions to one variable, the use of `summarise_each()`

provides a more compact and tidy notation:

# without group mtcars %>% summarise_each (funs(min, max), mpg)

## Source: local data frame [1 x 2] ## ## min max ## (dbl) (dbl) ## 1 10.4 33.9

# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(min, max), mpg)

## Source: local data frame [3 x 3] ## ## cyl min max ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2

The names of the output variables is given by the name of the functions: `min`

and `max`

. In this case we loose the name of the variable the function is applied to. If we prefer something like: `min_mpg`

and `max_mpg`

we shall rename the **functions** we call within `funs()`

:

# without group mtcars %>% summarise_each (funs(min_mpg = min, max_mpg = max), mpg)

## Source: local data frame [1 x 2] ## ## min_mpg max_mpg ## (dbl) (dbl) ## 1 10.4 33.9

# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(min_mpg = min, max_mpg = max), mpg)

## Source: local data frame [3 x 3] ## ## cyl min_mpg max_mpg ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2

### Case 3: apply one function to many variables

This case is very similar to case 2. Both functions `summarise()`

and `summarise_each()`

can be used

Function `summarise()`

has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: `max_mpg = max(mpg)`

# without group mtcars %>% summarise(mean_mpg = mean(mpg), mean_disp = mean(disp))

## Source: local data frame [1 x 2] ## ## mean_mpg mean_disp ## (dbl) (dbl) ## 1 20.09062 230.7219

# with group mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), mean_disp = mean(disp))

## Source: local data frame [3 x 3] ## ## cyl mean_mpg mean_disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000

When we apply many functions to one variable, the use of `summarise_each()`

provides a more compact and tidy notation:

# without group mtcars %>% summarise_each(funs(mean) , mpg, disp)

## Source: local data frame [1 x 2] ## ## mpg disp ## (dbl) (dbl) ## 1 20.09062 230.7219

# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(mean), mpg, disp)

## Source: local data frame [3 x 3] ## ## cyl mpg disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000

The names of the output variables is given by the name of the variables: `mpg`

and `disp`

. In this case we loose track of the name of the function applied to the variables: `mean()`

. Possibly we would prefer something like: `mean_mpg`

and `mean_disp`

. In order to achieve this result we shall appropriately rename the **variables** we pass to `...`

within `summarise_each()`

:

# without group mtcars %>% summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp)

## Source: local data frame [1 x 2] ## ## mean_mpg mean_disp ## (dbl) (dbl) ## 1 20.09062 230.7219

# with group mtcars %>% group_by(cyl) %>% summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp)

## Source: local data frame [3 x 3] ## ## cyl mean_mpg mean_disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000

### Case 4: apply many functions to many variables

As in the previous cases both functions: `summarise()`

and `summarise_each()`

provide a valid alternative.

Function `summarise()`

has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: `max_mpg = max(mpg)`

# without group mtcars %>% summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp))

## Source: local data frame [1 x 4] ## ## min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472

# with a single group mtcars %>% group_by(cyl) %>% summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp))

## Source: local data frame [3 x 5] ## ## cyl min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0

When we apply many functions to one variable, the use of `summarise_each()`

provides a more compact and tidy notation:

# without group mtcars %>% summarise_each(funs(min, max) , mpg, disp)

## Source: local data frame [1 x 4] ## ## mpg_min disp_min mpg_max disp_max ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472

# with a single group mtcars %>% group_by(cyl) %>% summarise_each(funs(min, max) , mpg, disp)

## Source: local data frame [3 x 5] ## ## cyl mpg_min disp_min mpg_max disp_max ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0

The names of the output variables is given by the notation: `variable_function`

: i.e. `mpg_mim`

, `disp_min`

etc `...`

.

Naming output variables with a different notation: i.e. `function_variable`

does not appear to be possible within the call to`summarise_each()`

This goal has to be achieved with a separate instruction

# without group mtcars %>% summarise_each(funs(min, max) , mpg, disp) %>% setNames(c("min_mpg", "min_disp", "max_mpg", "max_disp"))

## Source: local data frame [1 x 4] ## ## min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472

# with group mtcars %>% group_by(cyl) %>% summarise_each(funs(min, max) , mpg, disp) %>% setNames(c("gear", "min_mpg", "min_disp", "max_mpg", "max_disp"))

## Source: local data frame [3 x 5] ## ## gear min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0

### Conclusions

When using functions returning results of length one we have two possible candidate verbs:

`summarise()`

`summarise_each()`

Function `summarise()`

has a simpler syntax while function `summarise_each()`

has a more compact notation.

As a consequence, `summarise()`

seems more appropriate dealing with a single variable or a single function. The more the number of variables or functions increases, the more `summarise_each()`

becomes a better choice.

Function `summarise_each()`

has its own way to assign names to the output variables:

#### Case 2: apply many functions to one variable

The names of the output variables is given by the name of the **functions**. In this case we loose the name of the variable the function is applied to.

#### Case 3: apply one function to many variables

The names of the output variables is given by the name of the **variables**. In this case we loose track of the name of the function applied to the variables

#### Case 4: apply many functions to many variables

The names of the output variables is given by the notation: **variable_function**. Naming output variables with a different notation does not appear to be possible within the call to `summarise_each()`

The post Aggregation with dplyr: summarise and summarise_each appeared first on MilanoR.

**leave a comment**for the author, please follow the link and comment on their blog:

**MilanoR**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.