# Aggregate – A Powerful Tool for Data Frame in R

**DataScience+**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post gives a short review of the aggregate function as used for *data.frame*s and presents some interesting uses: from the trivial but handy to the most complicated problems I have solved with aggregate.

## Aggregate (data.frame): Technical Overview

`Aggregate`

is a function in base R which can, as the name suggests, aggregate the inputted `data.frame`

d.f by applying a function specified by the `FUN`

parameter to each column of sub-data.frames defined by the `by`

input parameter.

The `by`

parameter has to be a `list`

. However, since `data.frame`

‘s are handled as (named) lists of columns, one or more columns of a `data.frame`

can also be passed as the `by`

parameter. Interestingly, if these columns are of the same `data.frame`

as the one inputted as `x`

, those columns are not passed on to the `FUN`

function.

The function to apply has to be able to accept a `vector`

(since it will be called with parts of a column of a `data.frame`

as input).

The sub-data.frames defined by the `by`

input parameter can be thought of as logical indexing:

d.f <- data.frame(rating = c("AAA", "A", "A", "AAA", "BB", "BB", "AAA", "A")) i <- 1 by <- d.f$rating sub.data.frame <- d.f[by == unique(by)[i], ]

and do this for every `i`

between 1 and `length(unique(by))`

. Note that the `by`

variable doesn’t have to agree with one (or more) column of the `data.frame`

but could be anything. Hence, one can reproduce the `aggregate`

functionality by a `for`

cycle running the cycle variable over the unique values of the variable passed as `by`

and an `sapply`

applying the function passed as `FUN`

to each column of the `data.frame`

sub.data.frame. Such a workaround however would be very difficult to document, as it would be unclear what (and why) this code is actually doing.

`Aggregate`

always returns a `data.frame`

as a result. This `data.frame`

will contain the (now unique) values from the input parameter `by`

as the first column and then columns containing the results of the call to the function in the `FUN`

parameter applied to the parts of the columns of the inputted `data.frame`

. It is interesting to note that if the function `FUN`

returns multiple values, the class of the columns of the result `data.frame`

will be `list`

or something a `list`

can be cast to (see the last example below).

It is important to note that the function call is applied to nameless vectors rather than named columns of a `data.frame`

and hence referring to the names of the `data.frame`

will not work, nor will column references such as `s.d.f[,1]`

.

## Basic Examples

The most basic uses of aggregate involve base functions such as `mean`

and `sd`

. It is indeed one of the most common uses of `aggregate`

to compare the mean or other properties of sample groups.

Recently I reproduced calculations from an Excel sheet. Most formulae were subtotals and grand totals. The Excel sheet was not very comfortably organized for this purpose: sums over rows, columns and totals of those sums were used. In R, I have changed the data to a star schema representation (when all metadata are represented row-wise and every value gets its own row) using `reshape2`

package and `melt`

then used `aggregate`

along different variables to get the different totals. The less variables you use in `by`

the more aggregated the end-result: the grand total along a dimension is simply using that dimension as “by”, while subtotals can be achieved using multiple variables as `by`

. The `FUN`

in this case was of course `sum`

.

One handy use of aggregate and a base function is getting the number of appearances of the various values:

values <- data.frame(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")) nr.of.appearances <- aggregate(x = values, by = list(unique.values = values$value), FUN = length)

My favourite use of aggregate with a base function is getting the last day of each month in a series of dates. To do so, one can use the following code (assuming your dates are stored in a “YYYY-MM-DD” format as strings or as `Date`

):

dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729) dates date 1 2001-01-01 2 2001-01-02 3 2001-01-03 4 2001-01-04 ..... last.day <- aggregate(x = dates["date"], by = list(month = substr(dates$date, 1, 7)), FUN = max) last.day month date 1 2001-01 2001-01-31 2 2001-02 2001-02-28 3 2001-03 2001-03-31 4 2001-04 2001-04-30 .....

This came in very handy when working with banking information where the last day of the month depended on banking holidays as well as weekends.

## Advanced Uses

More advanced uses of `aggregate`

depend on writing your own `function`

, e.g. anonymous functions passed on as the `FUN`

parameter. To do so, one can use the syntax

# do not run the syntax aggregate(x = d.f, by = by.list, FUN = function(s.d.f){y <- s.d.f; return(y)}

The possible uses range from calling complex portfolio risk metrics for the homogeneous risk groups of a portfolio via fitting a distribution to categories of samples to anything you can image, really.

Here is an example with a “complex” portfolio risk metric (exposure to different counterparties in different asset classes):

assets <- data.frame(asset.class = c("equity", "equity","equity", "option","option","option", "bond", "bond"), rating = c("AAA", "A", "A", "AAA", "BB", "BB", "AAA", "A"), counterparty.a = c(runif(3), rnorm(5)), counterparty.b = c(runif(3), rnorm(5)), counterparty.c = c(runif(3), rnorm(5))) assets asset.class rating counterparty.a counterparty.b counterparty.c 1 equity AAA 0.9026004 0.6029417 0.8629453 2 equity A 0.8834034 0.5809589 0.4654721 3 equity A 0.1007586 0.9368537 0.3090811 4 option AAA -1.0508915 0.7171532 0.2224984 .....

Here is the use of `aggregate()`

function.

exposures <- aggregate(x = assets[c("counterparty.a", "counterparty.b", "counterparty.c")], by = assets[c("asset.class", "rating")], FUN = function(market.values){ sum(pmax(market.values, 0)) }) exposures asset.class rating counterparty.a counterparty.b counterparty.c 1 bond A 1.0038714 0.6382029 2.2822936 2 equity A 0.9841620 1.5178126 0.7745532 3 bond AAA 0.0000000 0.0000000 0.0000000 4 equity AAA 0.9026004 0.6029417 0.8629453 .....

Next up: fitting a Gaussian distribution to observations by categories:

library(MASS) categories <- data.frame(category = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c")) observations <- data.frame(observation = c(rnorm(5, mean = 3, sd = 0.2), rnorm(5, mean = -2, sd = 0.4), rnorm(4, mean = 0, sd = 1)))

Below we use the `aggregate()`

function to find the mean and standard deviation by categories.

distr.estimate <- aggregate(x = observations, by = categories, FUN = function(observations){ fitdistr(observations, densfun = "normal")$estimate }) distr.estimate category observation.mean observation.sd 1 a 3.0606926 0.1779962 2 b -2.1446040 0.1658481 3 c -0.1881841 0.5613013

This last example showcases several interesting properties. First, the `data.frame`

to `aggregate`

and the list of `by`

variables don’t have to be the same. While this is implied in other places of the post, this is an explicit example of such a setup. Secondly, the function passed as `FUN`

is not only an anonymous function, it is curried from a function with more than one input parameter. A function of a single input variable `observations`

has been created from the two-input variable function `fitdistr`

: fixing one of the input variables by setting `densfun = "normal"`

. Thirdly, rather than returning the full return value of the `fitdistr`

function, the return value is restricted to the element `estimate`

from the return value. And last but not least, the return value of the anonymous function passed to `FUN`

consists of two variables and not only one. Interestingly, aggregate casts the return value from `list`

to a `matrix`

and names the elements for us. However, these names can’t be used to reference the columns of the matrix. You can however reference them as follows:

distr.estimate$observation[1,][["mean"]] [1] 3.016988

## Closing Words

I hope that you have found the above useful. Now that you are more familiar with `aggregate`

, it is time for the truth: everything above and much more can be done with `data.table`

, and with a much faster performance. However, `data.table`

has a complex syntax and one really has to understand how things work under the hood, while `aggregate`

is simple and insightful. Until you are comfortable with both the logic of aggregation and the syntax of `data.table`

, it is a worthy investment to first write the code using `aggregate`

and then optimize it by rewriting it using `data.table`

.

For those of you who are interested, a dedicated post is coming where the above is redone with `data.table`

, along with some additional use cases specific to `data.table`

.

If you have any question leave a comment below.

**leave a comment**for the author, please follow the link and comment on their blog:

**DataScience+**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.