# Selecting the max value from each group, a case study: data.table

September 14, 2019
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

In my last post we looked at how to slice a `data.frame` by group to obtain the rows for which a particular column in that group is at its maximum value sing base R. In this post, we will be taking a look at how to perform this task using `data.table`.

# `data.table` Solution(s)

For this exercise we will be using `datasets::mtcars` and so first, we must convert `mtcars` to a `data.table`.

``````library(data.table)
mtcars_colnames <- colnames(mtcars)
mtcars[, "car"] <- rownames(mtcars)
mtcars <- mtcars[, c("car", mtcars_colnames)]
mtcars <- as.data.table(mtcars)``````

The current `data.table` syntax suggests that when grouping data, we should use the `.SD` syntax. `.SD` stands for “`S`ubset of `D`ata.table”, so when we group the data `by` a variable, we are creating subsets of the data. Note that there’s no significance to the initial `.`, except that it makes it slightly more unlikely that there will be a clash with a user-defined column name. So, one approach to solving our problem can be seen below.

``````mtcars[, .SD[which.max(mpg)], by = cyl]
#    cyl              car  mpg  disp  hp drat    wt  qsec vs am gear carb
# 1:   6   Hornet 4 Drive 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
# 2:   4   Toyota Corolla 33.9  71.1  65 4.22 1.835 19.90  1  1    4    1
# 3:   8 Pontiac Firebird 19.2 400.0 175 3.08 3.845 17.05  0  0    3    2``````

So this code essentially treats `.SD` as a “group”, one for each cylinder level (`by`), and subsets the row by the index where `mpg` is at its maximum.

However this wasn’t always the case with `data.table`, there is some legacy syntax which is still valid within the package. If we take a look at the following code, we will see that we can obtain the row where `mpg` is at its maximum using the `.I` syntax.

``````mtcars[, .I[which.max(mpg)]]
#  20``````

In other words, here `.I` is a vector representing the row number where `mpg` is at its maximum in the original `data.table`. Now consider the case where we look at this by each cylinder group; we obtain a `data.table` whose column `V1` represents the row indices for each cylinder group where `mpg` is at its maximum, i.e. one row for each group.

``````mtcars[, .I[which.max(mpg)], by = cyl]
#    cyl V1
# 1:   6  4
# 2:   4 20
# 3:   8 25``````

So using this `data.table` we can subset on the column `V1` to extract these row indices and subset the original `data.table` for those rows.

``````mtcars[mtcars[, .I[which.max(mpg)], by = cyl]\$V1]
#                 car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1:   Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 2:   Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 3: Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2``````

The final way in which we could solve this problem with `data.table` is in fact very similar to the base R approach that we saw in the previous post. This involves `split`ting the `data.table` into three separate `data.table`s, one for each cylinder group, and applying a function to each group that finds the index of the maximum row, subsetting the grouped `data.table` on that index. We then bind these lists together using the `data.table::rbindlist()` function. In fact, we could have even used the `do.call(rbind, .)` approach we saw in the first post.

``````rbindlist(lapply(split(mtcars, mtcars[, cyl]), function(x) x[which.max(mpg)]))
#                 car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1:   Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 2:   Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 3: Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2``````

So there we have it, three separate ways to solve the same problem using `data.table`.

# Benchmarks

Now this wouldn’t be a completely informative blog post without some benchmarks. In fact these three options are what inspired me to write this series of posts. We can see the results of the benchmarking below.

We can see that, on average, the `.I` solution is fastest. So you may be wondering, if the `.SD` syntax is the newer syntax, why then is the `.I` code quicker? Well it’s simply because `.SD` has not yet been completely optimised. `.I` avoids constructing `.SD`, which is the bottleneck in the `.SD` solution.

As a side note, we could speed these results up even more with `data.table::setkey(mtcars, cyl)`. I won’t show the benchmarking results here as the data is so small it isn’t really a useful representation but it is worth considering should you need to perform a similar task on a larger dataset.

# Conclusion

If you need to apply some function to each group within a `data.table`, `data.table` has many solutions. The newest solution, `.SD`, may not always be the fastest approach and you should try the old `.I` approach if speed is important.

Are you aware of any other `data.table` solutions to this problem? If so, let me know in the comments!

Next time, we will take a look at how to solve this problem in `dplyr` as well as `sparklyr`.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.