In my last post we looked at how to slice a
data.frame by group to obtain the rows for which a particular column in that group is at its maximum value sing base R. In this post, we will be taking a look at how to perform this task using
For this exercise we will be using
datasets::mtcars and so first, we must convert
mtcars to a
library(data.table) mtcars_colnames <- colnames(mtcars) mtcars[, "car"] <- rownames(mtcars) mtcars <- mtcars[, c("car", mtcars_colnames)] mtcars <- as.data.table(mtcars)
data.table syntax suggests that when grouping data, we should use the
.SD stands for “
Data.table”, so when we group the data
by a variable, we are creating subsets of the data. Note that there’s no significance to the initial
., except that it makes it slightly more unlikely that there will be a clash with a user-defined column name. So, one approach to solving our problem can be seen below.
mtcars[, .SD[which.max(mpg)], by = cyl] # cyl car mpg disp hp drat wt qsec vs am gear carb # 1: 6 Hornet 4 Drive 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1 # 2: 4 Toyota Corolla 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1 # 3: 8 Pontiac Firebird 19.2 400.0 175 3.08 3.845 17.05 0 0 3 2
So this code essentially treats
.SD as a “group”, one for each cylinder level (
by), and subsets the row by the index where
mpg is at its maximum.
However this wasn’t always the case with
data.table, there is some legacy syntax which is still valid within the package. If we take a look at the following code, we will see that we can obtain the row where
mpg is at its maximum using the
mtcars[, .I[which.max(mpg)]] #  20
In other words, here
.I is a vector representing the row number where
mpg is at its maximum in the original
data.table. Now consider the case where we look at this by each cylinder group; we obtain a
data.table whose column
V1 represents the row indices for each cylinder group where
mpg is at its maximum, i.e. one row for each group.
mtcars[, .I[which.max(mpg)], by = cyl] # cyl V1 # 1: 6 4 # 2: 4 20 # 3: 8 25
So using this
data.table we can subset on the column
V1 to extract these row indices and subset the original
data.table for those rows.
mtcars[mtcars[, .I[which.max(mpg)], by = cyl]$V1] # car mpg cyl disp hp drat wt qsec vs am gear carb # 1: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # 2: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # 3: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
The final way in which we could solve this problem with
data.table is in fact very similar to the base R approach that we saw in the previous post. This involves
data.table into three separate
data.tables, one for each cylinder group, and applying a function to each group that finds the index of the maximum row, subsetting the grouped
data.table on that index. We then bind these lists together using the
data.table::rbindlist() function. In fact, we could have even used the
do.call(rbind, .) approach we saw in the first post.
rbindlist(lapply(split(mtcars, mtcars[, cyl]), function(x) x[which.max(mpg)])) # car mpg cyl disp hp drat wt qsec vs am gear carb # 1: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # 2: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # 3: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
So there we have it, three separate ways to solve the same problem using
Now this wouldn’t be a completely informative blog post without some benchmarks. In fact these three options are what inspired me to write this series of posts. We can see the results of the benchmarking below.
We can see that, on average, the
.I solution is fastest. So you may be wondering, if the
.SD syntax is the newer syntax, why then is the
.I code quicker? Well it’s simply because
.SD has not yet been completely optimised.
.I avoids constructing
.SD, which is the bottleneck in the
As a side note, we could speed these results up even more with
data.table::setkey(mtcars, cyl). I won’t show the benchmarking results here as the data is so small it isn’t really a useful representation but it is worth considering should you need to perform a similar task on a larger dataset.
If you need to apply some function to each group within a
data.table has many solutions. The newest solution,
.SD, may not always be the fastest approach and you should try the old
.I approach if speed is important.
Are you aware of any other
data.table solutions to this problem? If so, let me know in the comments!
Next time, we will take a look at how to solve this problem in
dplyr as well as