Efficient aggregation (and more) using data.table

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my recent post I have written about the aggregate function in base R and gave some examples on its use. This post repeats the same examples using data.table instead, the most efficient implementation of the aggregation logic in R, plus some additional use cases showing the power of the data.table package.

This post focuses on the aggregation aspect of the data.table and only touches upon all other uses of this versatile tool. For a great resource on everything data.table, head to the authors’ own free training material.

All code snippets below require the data.table package to be installed and loaded:

install.packages("data.table")
library(data.table)

Basic examples

Here is the example for the number of appearances of the unique values in the data:

values <- data.table(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))
values
    value
 1:     a
 2:     a
 3:     a
.....

nr.of.appearances <- values[, list(nr.appearances=length(value)), 
                                by = list(unique.values = value)]
nr.of.appearances
   unique.values nr.appearances
1:             a              5
2:             b              3
3:             c              4

You can notice a lot of differences here. First of all, no additional function was invoke. Instead, the [] operator has been overloaded for the data.table class allowing for a different signature: it has three inputs instead of the usual two for a data.frame. We will return to this in a moment. Secondly, the columns of the data.table were not referenced by their name as a string, but as a variable instead. This is a very important aspect of the data.table syntax. Last but not least as implied by the fact that both the aggregating function and the grouping variable are passed on as a list one can not only group by multiple variables as in aggregate but you can also use multiple aggregation functions at the same time. I will show an example of that later.

Coming back to the overloading of the [] operator: a data.table is at the same time also a data.frame. See e.g.

class(values)
"data.table" "data.frame"

This means that you can use all (or at least most of) the data.frame functionality as well. Among others you can use aggregate like you would use for a data.frame:

values <- data.frame(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))

nr.of.appearances <- aggregate(x = values, 
                               by = list(unique.values = values$value), 
                               FUN = length)

You can also use the [] operator in the classic data.frame way by passing on only two input variables:

values[values$value == "a",]
   value
1:     a
2:     a
3:     a
...

Another exciting possibility with data.table is creating a new column in a data.table derived from existing columns – with or without aggregation. Examples of both are shown below:

values[, new.col := paste0(value, value)]
values
    value new.col
 1:     a      aa
 2:     a      aa
 3:     a      aa
 4:     a      aa
.....
values[, new.col := paste0(value, length(value)), by = list(unique.values = value)]
values
    value new.col
 1:     a      a5
 2:     a      a5
 3:     a      a5
 4:     a      a5
.....

Notice that in both cases the data.table was directly modified, rather than left unchanged with the results returned. That’s right: data.table creates side effect by using copy-by-reference rather than copy-by-value as (almost) everything else in R. It is arguable whether this is alien to the nature of a (more or less) functional language like R but one thing is sure: it is extremely efficient, especially when the variable hardly fits the memory to start with.
Back to the basic examples, here is the last (and first) day of the months in your data

dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729)
dates
          date
1   2001-01-01
2   2001-01-02
3   2001-01-03
4   2001-01-04
.....

dates <- as.data.table(dates)
dates
           date
  1: 2001-01-01
  2: 2001-01-02
  3: 2001-01-03
 ---           
726: 2002-12-27
727: 2002-12-28
728: 2002-12-29

special.days <- dates[, list(first.day = min(date), last.day = max(date)), 
                           by = list(month = substr(date, 1, 7))]
special.days
      month  first.day   last.day
 1: 2001-01 2001-01-01 2001-01-31
 2: 2001-02 2001-02-01 2001-02-28
 3: 2001-03 2001-03-01 2001-03-31
.....

As you can see the syntax is the same as above – but now we can get the first and last days in a single command! Also note that you don’t have to know up front that you want to use data.table: the as.data.table command allows you to cast a data.frame into a data.table. Finally, notice how data.table creates a summary of the head and the tail of the variable if it’s too long to show.

Advanced Uses

Just like in case of aggregate, you can use anonymous functions to aggregate in data.table as well. Let’s have a look at the example for fitting a Gaussian distribution to observations by categories:

library(MASS)

categories <- data.table(category = c("a", "a", "a", "a", "a", 
                                      "b", "b", "b", "b", "b",
                                      "c", "c", "c", "c"))

observations <- data.table(observation = c(rnorm(5, mean = 3, sd = 0.2),
                                           rnorm(5, mean = -2, sd = 0.4),
                                           rnorm(4, mean = 0, sd = 1)))

data <- cbind(categories, observations)
data
    category observation
 1:        a   2.7446816
 2:        a   2.8853469
 3:        a   2.7550775
.....
distr.estimate <- data[,
    list(mean = fitdistr(observation, densfun = "normal")$estimate[[1]],
         sd = fitdistr(observation, densfun = "normal")$estimate[[2]]),
    by = list(category)]
distr.estimate
   category       mean         sd
1:        a  2.8332705 0.06882552
2:        b -1.9678460 0.37420857
3:        c  0.9233108 0.47680978

or equivalently

distr.estimate <- data[, 
  list(mean = fitdistr(observation, densfun = "normal")$estimate,
       sd = fitdistr(observation, densfun = "normal")$estimate[[2]]),
  by = list(category)]

This example shows some weaknesses of using data.table compared to aggregate, but it also shows that those weaknesses are nicely balanced by the strength of data.table. One such weakness is that by design data.table aggregation requires the variables to be coming from the same data.table, so we had to cbind the two variables. Also, the aggregation in data.table returns only the first variable if the function invoked returns more than variable, hence the equivalence of the two syntaxes showed above. However, as multiple calls can be submitted in the list, this can easily be overcome. Finally note how much simpler the anonymous function construction works: rather than defining the function itself, we can simply pass the relevant variable.

If you have any question about this post please leave a comment below.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)