Aggregation with dplyr: summarise and summarise_each

[This article was first published on MilanoR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We use summarise() with aggregate functions, which take a vector of values and return a single number. Function summarise_each() offers an alternative approach to summarise() with identical results.

This post aims to compare the behavior of summarise() and summarise_each() considering two factors we can take under control:

  1. How many variables to manipulate
  • 1A. single variable
  • 1B. more than a variable
  1. How many functions to apply to each variable
  • 2A. single function
  • 2B. more than one function

resulting in the following four cases:

  • Case 1: apply one function to one variable
  • Case 2: apply many functions to one variable
  • Case 3: apply one function to many variables
  • Case 4: apply many functions to many variables

These four cases will be also tested with and without a group_by() option.

The mtcars data frame

For this article we will use the well known mtcars data frame.

We will first transform it into a tbl_df object; no change will occur to the standard data.frame object but a much better print method will be available.

Finally, to keep this article tidy and clean we will select only four variables of interest

mtcars <- mtcars   %>% 
  tbl_df() %>% 
  select(cyl , mpg, disp)

Case 1: apply one function to one variable

In this case, summarise() results the simplest candidate.

# without group
mtcars %>% 
  summarise (mean_mpg = mean(mpg))

## Source: local data frame [1 x 1]
## 
##   mean_mpg
##      (dbl)
## 1 20.09062

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise (mean_mpg = mean(mpg))

## Source: local data frame [3 x 2]
## 
##     cyl mean_mpg
##   (dbl)    (dbl)
## 1     4 26.66364
## 2     6 19.74286
## 3     8 15.10000

We could use function summarise_each() as well but, its usage results in a loss of clarity.

# without group
mtcars %>% 
  summarise_each (funs(mean) , mean_mpg = mpg)

## Source: local data frame [1 x 1]
## 
##   mean_mpg
##      (dbl)
## 1 20.09062

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise_each (funs(mean) , mean_mpg = mpg)

## Source: local data frame [3 x 2]
## 
##     cyl mean_mpg
##   (dbl)    (dbl)
## 1     4 26.66364
## 2     6 19.74286
## 3     8 15.10000

Case 2: apply many functions to one variable

In this case we can use both functions summarise() and summarise_each().

Function summarise() has a more intuitive syntax:

# without group
mtcars %>% 
  summarise (min_mpg = min(mpg), max_mpg = max(mpg))

## Source: local data frame [1 x 2]
## 
##   min_mpg max_mpg
##     (dbl)   (dbl)
## 1    10.4    33.9

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise (min_mpg = min(mpg), max_mpg = max(mpg))

## Source: local data frame [3 x 3]
## 
##     cyl min_mpg max_mpg
##   (dbl)   (dbl)   (dbl)
## 1     4    21.4    33.9
## 2     6    17.8    21.4
## 3     8    10.4    19.2

The names of the output variables can be specified in simple forms like: max_mpg = max(mpg)

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

# without group
mtcars %>% 
  summarise_each (funs(min, max), mpg)

## Source: local data frame [1 x 2]
## 
##     min   max
##   (dbl) (dbl)
## 1  10.4  33.9

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise_each (funs(min, max), mpg)

## Source: local data frame [3 x 3]
## 
##     cyl   min   max
##   (dbl) (dbl) (dbl)
## 1     4  21.4  33.9
## 2     6  17.8  21.4
## 3     8  10.4  19.2

The names of the output variables is given by the name of the functions: min and max. In this case we loose the name of the variable the function is applied to. If we prefer something like: min_mpg and max_mpg we shall rename the functions we call within funs():

# without group
mtcars %>% 
  summarise_each (funs(min_mpg = min, max_mpg = max), mpg)

## Source: local data frame [1 x 2]
## 
##   min_mpg max_mpg
##     (dbl)   (dbl)
## 1    10.4    33.9

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise_each (funs(min_mpg = min, max_mpg = max), mpg)

## Source: local data frame [3 x 3]
## 
##     cyl min_mpg max_mpg
##   (dbl)   (dbl)   (dbl)
## 1     4    21.4    33.9
## 2     6    17.8    21.4
## 3     8    10.4    19.2

Case 3: apply one function to many variables

This case is very similar to case 2. Both functions summarise() and summarise_each() can be used

Function summarise() has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)

# without group
mtcars %>% 
  summarise(mean_mpg = mean(mpg), mean_disp = mean(disp))

## Source: local data frame [1 x 2]
## 
##   mean_mpg mean_disp
##      (dbl)     (dbl)
## 1 20.09062  230.7219

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg), mean_disp = mean(disp))

## Source: local data frame [3 x 3]
## 
##     cyl mean_mpg mean_disp
##   (dbl)    (dbl)     (dbl)
## 1     4 26.66364  105.1364
## 2     6 19.74286  183.3143
## 3     8 15.10000  353.1000

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

# without group
mtcars %>% 
  summarise_each(funs(mean) , mpg, disp)

## Source: local data frame [1 x 2]
## 
##        mpg     disp
##      (dbl)    (dbl)
## 1 20.09062 230.7219

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise_each (funs(mean), mpg, disp)

## Source: local data frame [3 x 3]
## 
##     cyl      mpg     disp
##   (dbl)    (dbl)    (dbl)
## 1     4 26.66364 105.1364
## 2     6 19.74286 183.3143
## 3     8 15.10000 353.1000

The names of the output variables is given by the name of the variables: mpg and disp. In this case we loose track of the name of the function applied to the variables: mean(). Possibly we would prefer something like: mean_mpg and mean_disp. In order to achieve this result we shall appropriately rename the variables we pass to ... within summarise_each():

# without group
mtcars %>% 
  summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp)

## Source: local data frame [1 x 2]
## 
##   mean_mpg mean_disp
##      (dbl)     (dbl)
## 1 20.09062  230.7219

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp)

## Source: local data frame [3 x 3]
## 
##     cyl mean_mpg mean_disp
##   (dbl)    (dbl)     (dbl)
## 1     4 26.66364  105.1364
## 2     6 19.74286  183.3143
## 3     8 15.10000  353.1000

Case 4: apply many functions to many variables

As in the previous cases both functions: summarise() and summarise_each() provide a valid alternative.

Function summarise() has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)

# without group
mtcars %>% 
  summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp))

## Source: local data frame [1 x 4]
## 
##   min_mpg min_disp max_mpg max_disp
##     (dbl)    (dbl)   (dbl)    (dbl)
## 1    10.4     71.1    33.9      472

# with a single group
mtcars %>% 
  group_by(cyl) %>% 
  summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp))

## Source: local data frame [3 x 5]
## 
##     cyl min_mpg min_disp max_mpg max_disp
##   (dbl)   (dbl)    (dbl)   (dbl)    (dbl)
## 1     4    21.4     71.1    33.9    146.7
## 2     6    17.8    145.0    21.4    258.0
## 3     8    10.4    275.8    19.2    472.0

When we apply many functions to one variable, the use of summarise_each() provides a more compact and tidy notation:

# without group
mtcars %>% 
  summarise_each(funs(min, max) , mpg, disp)

## Source: local data frame [1 x 4]
## 
##   mpg_min disp_min mpg_max disp_max
##     (dbl)    (dbl)   (dbl)    (dbl)
## 1    10.4     71.1    33.9      472

# with a single group

mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min, max) , mpg, disp)

## Source: local data frame [3 x 5]
## 
##     cyl mpg_min disp_min mpg_max disp_max
##   (dbl)   (dbl)    (dbl)   (dbl)    (dbl)
## 1     4    21.4     71.1    33.9    146.7
## 2     6    17.8    145.0    21.4    258.0
## 3     8    10.4    275.8    19.2    472.0

The names of the output variables is given by the notation: variable_function: i.e. mpg_mim, disp_min etc ....

Naming output variables with a different notation: i.e. function_variable does not appear to be possible within the call tosummarise_each()

This goal has to be achieved with a separate instruction

# without group
mtcars %>% 
  summarise_each(funs(min, max) , mpg, disp) %>%
  setNames(c("min_mpg", "min_disp", "max_mpg", "max_disp"))

## Source: local data frame [1 x 4]
## 
##   min_mpg min_disp max_mpg max_disp
##     (dbl)    (dbl)   (dbl)    (dbl)
## 1    10.4     71.1    33.9      472

# with  group
mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(min, max) , mpg, disp) %>%
  setNames(c("gear", "min_mpg", "min_disp", "max_mpg", "max_disp"))

## Source: local data frame [3 x 5]
## 
##    gear min_mpg min_disp max_mpg max_disp
##   (dbl)   (dbl)    (dbl)   (dbl)    (dbl)
## 1     4    21.4     71.1    33.9    146.7
## 2     6    17.8    145.0    21.4    258.0
## 3     8    10.4    275.8    19.2    472.0

Conclusions

When using functions returning results of length one we have two possible candidate verbs:

  • summarise()
  • summarise_each()

Function summarise() has a simpler syntax while function summarise_each() has a more compact notation.

As a consequence, summarise() seems more appropriate dealing with a single variable or a single function. The more the number of variables or functions increases, the more summarise_each() becomes a better choice.

Function summarise_each() has its own way to assign names to the output variables:

Case 2: apply many functions to one variable

The names of the output variables is given by the name of the functions. In this case we loose the name of the variable the function is applied to.

Case 3: apply one function to many variables

The names of the output variables is given by the name of the variables. In this case we loose track of the name of the function applied to the variables

Case 4: apply many functions to many variables

The names of the output variables is given by the notation: variable_function. Naming output variables with a different notation does not appear to be possible within the call to summarise_each()

The post Aggregation with dplyr: summarise and summarise_each appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)