In my recent post I have written about the aggregate function in base R and gave some examples on its use. This post repeats the same examples using data.table instead, the most efficient implementation of the aggregation logic in R, plus some additional use cases showing the power of the data.table package.
This post focuses on the aggregation aspect of the data.table and only touches upon all other uses of this versatile tool. For a great resource on everything data.table, head to the authors’ own free training material.
All code snippets below require the data.table package to be installed and loaded:
Here is the example for the number of appearances of the unique values in the data:
values <- data.table(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")) values value 1: a 2: a 3: a ..... nr.of.appearances <- values[, list(nr.appearances=length(value)), by = list(unique.values = value)] nr.of.appearances unique.values nr.appearances 1: a 5 2: b 3 3: c 4
You can notice a lot of differences here. First of all, no additional function was invoke. Instead, the
 operator has been overloaded for the
data.table class allowing for a different signature: it has three inputs instead of the usual two for a
data.frame. We will return to this in a moment. Secondly, the columns of the
data.table were not referenced by their name as a
string, but as a variable instead. This is a very important aspect of the
data.table syntax. Last but not least as implied by the fact that both the aggregating function and the grouping variable are passed on as a
list one can not only group by multiple variables as in
aggregate but you can also use multiple aggregation functions at the same time. I will show an example of that later.
Coming back to the overloading of the
 operator: a
data.table is at the same time also a
data.frame. See e.g.
class(values) "data.table" "data.frame"
This means that you can use all (or at least most of) the
data.frame functionality as well. Among others you can use
aggregate like you would use for a
values <- data.frame(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")) nr.of.appearances <- aggregate(x = values, by = list(unique.values = values$value), FUN = length)
You can also use the
 operator in the classic
data.frame way by passing on only two input variables:
values[values$value == "a",] value 1: a 2: a 3: a ...
Another exciting possibility with data.table is creating a new column in a data.table derived from existing columns – with or without aggregation. Examples of both are shown below:
values[, new.col := paste0(value, value)] values value new.col 1: a aa 2: a aa 3: a aa 4: a aa ..... values[, new.col := paste0(value, length(value)), by = list(unique.values = value)] values value new.col 1: a a5 2: a a5 3: a a5 4: a a5 .....
Notice that in both cases the
data.table was directly modified, rather than left unchanged with the results returned. That’s right:
data.table creates side effect by using copy-by-reference rather than copy-by-value as (almost) everything else in R. It is arguable whether this is alien to the nature of a (more or less) functional language like R but one thing is sure: it is extremely efficient, especially when the variable hardly fits the memory to start with.
Back to the basic examples, here is the last (and first) day of the months in your data
dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729) dates date 1 2001-01-01 2 2001-01-02 3 2001-01-03 4 2001-01-04 ..... dates <- as.data.table(dates) dates date 1: 2001-01-01 2: 2001-01-02 3: 2001-01-03 --- 726: 2002-12-27 727: 2002-12-28 728: 2002-12-29 special.days <- dates[, list(first.day = min(date), last.day = max(date)), by = list(month = substr(date, 1, 7))] special.days month first.day last.day 1: 2001-01 2001-01-01 2001-01-31 2: 2001-02 2001-02-01 2001-02-28 3: 2001-03 2001-03-01 2001-03-31 .....
As you can see the syntax is the same as above – but now we can get the first and last days in a single command! Also note that you don’t have to know up front that you want to use
as.data.table command allows you to cast a
data.frame into a
data.table. Finally, notice how
data.table creates a summary of the head and the tail of the variable if it’s too long to show.
Just like in case of
aggregate, you can use anonymous functions to aggregate in
data.table as well. Let’s have a look at the example for fitting a Gaussian distribution to observations by categories:
library(MASS) categories <- data.table(category = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c")) observations <- data.table(observation = c(rnorm(5, mean = 3, sd = 0.2), rnorm(5, mean = -2, sd = 0.4), rnorm(4, mean = 0, sd = 1))) data <- cbind(categories, observations) data category observation 1: a 2.7446816 2: a 2.8853469 3: a 2.7550775 ..... distr.estimate <- data[, list(mean = fitdistr(observation, densfun = "normal")$estimate[], sd = fitdistr(observation, densfun = "normal")$estimate[]), by = list(category)] distr.estimate category mean sd 1: a 2.8332705 0.06882552 2: b -1.9678460 0.37420857 3: c 0.9233108 0.47680978
distr.estimate <- data[, list(mean = fitdistr(observation, densfun = "normal")$estimate, sd = fitdistr(observation, densfun = "normal")$estimate[]), by = list(category)]
This example shows some weaknesses of using
data.table compared to
aggregate, but it also shows that those weaknesses are nicely balanced by the strength of
data.table. One such weakness is that by design
data.table aggregation requires the variables to be coming from the same
data.table, so we had to
cbind the two variables. Also, the aggregation in
data.table returns only the first variable if the function invoked returns more than variable, hence the equivalence of the two syntaxes showed above. However, as multiple calls can be submitted in the list, this can easily be overcome. Finally note how much simpler the anonymous function construction works: rather than defining the function itself, we can simply pass the relevant variable.
If you have any question about this post please leave a comment below.