Selecting the max value from each group, a case study: data.table

[This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In my last post we looked at how to slice a data.frame by group to obtain the rows for which a particular column in that group is at its maximum value sing base R. In this post, we will be taking a look at how to perform this task using data.table.

data.table Solution(s)

For this exercise we will be using datasets::mtcars and so first, we must convert mtcars to a data.table.

library(data.table)
mtcars_colnames <- colnames(mtcars)
mtcars[, "car"] <- rownames(mtcars)
mtcars <- mtcars[, c("car", mtcars_colnames)]
mtcars <- as.data.table(mtcars)

The current data.table syntax suggests that when grouping data, we should use the .SD syntax. .SD stands for “Subset of Data.table”, so when we group the data by a variable, we are creating subsets of the data. Note that there’s no significance to the initial ., except that it makes it slightly more unlikely that there will be a clash with a user-defined column name. So, one approach to solving our problem can be seen below.

mtcars[, .SD[which.max(mpg)], by = cyl]
#    cyl              car  mpg  disp  hp drat    wt  qsec vs am gear carb
# 1:   6   Hornet 4 Drive 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
# 2:   4   Toyota Corolla 33.9  71.1  65 4.22 1.835 19.90  1  1    4    1
# 3:   8 Pontiac Firebird 19.2 400.0 175 3.08 3.845 17.05  0  0    3    2

So this code essentially treats .SD as a “group”, one for each cylinder level (by), and subsets the row by the index where mpg is at its maximum.

However this wasn’t always the case with data.table, there is some legacy syntax which is still valid within the package. If we take a look at the following code, we will see that we can obtain the row where mpg is at its maximum using the .I syntax.

mtcars[, .I[which.max(mpg)]]
# [1] 20

In other words, here .I is a vector representing the row number where mpg is at its maximum in the original data.table. Now consider the case where we look at this by each cylinder group; we obtain a data.table whose column V1 represents the row indices for each cylinder group where mpg is at its maximum, i.e. one row for each group.

mtcars[, .I[which.max(mpg)], by = cyl]
#    cyl V1
# 1:   6  4
# 2:   4 20
# 3:   8 25

So using this data.table we can subset on the column V1 to extract these row indices and subset the original data.table for those rows.

mtcars[mtcars[, .I[which.max(mpg)], by = cyl]$V1]
#                 car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1:   Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 2:   Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 3: Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

The final way in which we could solve this problem with data.table is in fact very similar to the base R approach that we saw in the previous post. This involves splitting the data.table into three separate data.tables, one for each cylinder group, and applying a function to each group that finds the index of the maximum row, subsetting the grouped data.table on that index. We then bind these lists together using the data.table::rbindlist() function. In fact, we could have even used the do.call(rbind, .) approach we saw in the first post.

rbindlist(lapply(split(mtcars, mtcars[, cyl]), function(x) x[which.max(mpg)]))
#                 car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1:   Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 2:   Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 3: Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

So there we have it, three separate ways to solve the same problem using data.table.

Benchmarks

Now this wouldn’t be a completely informative blog post without some benchmarks. In fact these three options are what inspired me to write this series of posts. We can see the results of the benchmarking below.

We can see that, on average, the .I solution is fastest. So you may be wondering, if the .SD syntax is the newer syntax, why then is the .I code quicker? Well it’s simply because .SD has not yet been completely optimised. .I avoids constructing .SD, which is the bottleneck in the .SD solution.

As a side note, we could speed these results up even more with data.table::setkey(mtcars, cyl). I won’t show the benchmarking results here as the data is so small it isn’t really a useful representation but it is worth considering should you need to perform a similar task on a larger dataset.

Conclusion

If you need to apply some function to each group within a data.table, data.table has many solutions. The newest solution, .SD, may not always be the fastest approach and you should try the old .I approach if speed is important.

Are you aware of any other data.table solutions to this problem? If so, let me know in the comments!

Next time, we will take a look at how to solve this problem in dplyr as well as sparklyr.

To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)