Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Benchmarking the six most used manipulations for data.tables in R

#### Comparing formulations of the data.table package with base R and dplyr formulations

Almost everybody that handles large data sets in R is familiar with the data.table package. It provides several functions to subset, merge, and manipulate tabular data. When we started using the package, we were quite confused about the different commands, e.g., subsetting the columns of a table: dt[, .(c1, c2)] vs. dt[, c("c1", "c2"), with = F]. When we were more familiar with data.tables, we always used the same and from coding perspective easier expression: dt[, .(c1, c2)] and never reconsidered our decision and habits. Until now…

While profiling a shiny application in which large data sets are aggregated and prepared for a dashboard, we put our old decisions and habits regarding their performance to the test. We want to share these insights with you.

The data.table used in all following benchmarks has 4 columns and 10,000 rows. All values are numerics.

dt
##                c1        c2         c3         c4
##      1: 0.7958596 0.7835304 0.62645236 0.33154659
##      2: 0.7644930 0.5094924 0.91021167 0.02995171
##      3: 0.6535826 0.4156394 0.13558778 0.55281306
##      4: 0.8383642 0.9559514 0.30899163 0.57564810
##      5: 0.7829403 0.2674898 0.45422002 0.86812407
##     ---
##  99996: 0.6348011 0.7716307 0.48852835 0.18436878
##  99997: 0.1145093 0.4689724 0.62741292 0.77253526
##  99998: 0.5849055 0.3143366 0.14897752 0.77522017
##  99999: 0.9822468 0.5498836 0.04301053 0.68252273
## 100000: 0.3571940 0.3759371 0.58782003 0.45850694
##### No 1 – Subsetting columns in a data table

In general there are three ways to select a subset of columns from a data.table: dt[, .(c1, c2)], dt[, c("c1", "c2"), with = F], vs. the select function from dplyr package select(dt, c("c1", "c2")).

microbenchmark(
data.table.V1 = dt[, .(c1, c2)],
data.table.V2 = dt[, c("c1", "c2"), with = F],
dplyr = dt %>% select(c("c1", "c2")) %>% as.data.table(),
times = 1000
)
## Unit: microseconds
##           expr      min        lq      mean    median       uq      max
##  data.table.V1  603.761  737.2435 1331.1952  838.8315 1139.178 47501.32
##  data.table.V2  329.487  420.0970  924.7818  496.0470  698.195 42546.43
##          dplyr 1004.111 1213.6990 1924.8967 1347.6495 1738.800 43928.10
##  neval
##   1000
##   1000
##   1000

As the results show, the second formulation dt[, c("c1", "c2"), with = F] is the fastest with around 900 microseconds on average. The median of around 500 microseconds is even faster than the minimal time of the first formulation. The dplyr command is much slower, however, regarding the consistency it seems to show some benefits.

##### No 2 – Manipulating complete columns

A common way to manipulate complete columns is to change them by reference: dt[, c5 := runif(1:10000)]. The data.table package contains an alternative formulation with the function set(). Therefore, we also tested set(x = dt, j = "c5", value = runif(1:10000)) and, using dplyr: dt %>% mutate(c5 = newCol) %>% as.data.table(). As benchmarking function we compare also cbind(dt, newCol) from the base package.

newCol <- runif(1:100000)
microbenchmark(
data.table.V1 = dt[, c5 := newCol],
data.table.V2 = set(x = dt, j = "c5", value = newCol),
dplyr = dt %>% mutate(c5 = newCol) %>% as.data.table(),
cbind = cbind(dt, newCol),
times = 1000
)
## Unit: microseconds
##           expr      min       lq      mean    median        uq      max
##  data.table.V1  278.343  422.800  958.1093  492.0660  861.9235 51296.59
##  data.table.V2   61.976   95.912  618.7856  124.0265  554.5730 45369.20
##          dplyr 1134.843 1401.289 3983.8024 2069.5225 3541.3365 69168.02
##          cbind 1465.393 2822.361 8907.3423 4338.8280 6456.5795 68281.18
##  neval
##   1000
##   1000
##   1000
##   1000

Here, the most convenient formulation with := is also outperformed by the alternative data.table formulation. On average the set function is nearly twice as fast whereas the median is less than a fourth. Again, the dplyr alternative is clearly outperformed by both data.table variants. The base R function cbind is the slowest variant of all.

##### No 3 – Manipulating only a subset of a column with a constant

Sometimes we want to subset some values in a column based on the values of another column. The formulations in the previous example can be easily extended for this purpose: dt[c4 <= 0.5, c5 := 2], set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = 2), and dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = 2)) %>% as.data.table(). newCol <- runif(1:100000) microbenchmark( data.table.V1 = dt[c4 <= 0.5, c5 := 2], data.table.V2 = set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = 2),
dplyr = dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = 2)) %>% as.data.table(),
times = 1000
)
## Unit: microseconds
##           expr      min       lq     mean   median       uq      max neval
##  data.table.V1 1106.839 1675.383 2279.759 1972.356 2205.444 47290.20  1000
##  data.table.V2  717.001 1001.305 1512.075 1259.916 1418.010 43642.51  1000
##          dplyr 2244.515 4628.627 5845.751 5652.471 6318.205 54228.00  1000

Here, again the set function is the fastest formulation with around 1500 microseconds on average, the basic data.table formulation is slightly slower with around 2300 microseconds, whereas the dplyr formulation is about four times slower.

##### No 4 - Manipulating a subset of a column with a dynamic value

Instead of setting a constant as a new value, in most cases we want to set a dynamic value based on the value of another value. The data.table command for this functionality is dt[c4 <= 0.5, c5 := c5 * 2]. The formulation of the corresponding set function is set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = dt$c5[which(dt$c4 <= 0.5)] * 2). When using dplyr the formulation is a bit more complicated: dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = dt$c5[which(dt$c4 <= 0.5)] * 2)) %>% as.data.table(). microbenchmark( data.table.V1 = dt[c4 <= 0.5, c5 := c5 * 2], data.table.V2 = set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = dt$c5[which(dt$c4 <= 0.5)] * 2),
dplyr = dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = dt$c5[which(dt$c4 <= 0.5)] * 2)) %>% as.data.table(),
times = 1000
)
## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval
##  data.table.V1 1.405963 2.075418 2.954876 2.514958 2.978378 52.97853  1000
##  data.table.V2 1.912861 2.726847 3.984105 3.386690 3.853750 74.27546  1000
##          dplyr 3.907851 7.147018 9.107210 8.235551 9.628549 52.82997  1000

The basic data.table function using := outperforms both other variants. The alternative data.table formulation is not significantly slower, the difference in the median is below one millisecond. The dplyr formulation, however, needs more than twice the time than data.table version 1.

##### No 5 - Subsetting rows by value

Probably the most used manipulation of data.tables is to select a subset of rows of the table based on a condition. The common data.table formulation is dt[c4 <= 0.5]. For this purpose, the base R package contains the function subset(). When using dplyr, the function filter() has to be used.

microbenchmark(
data.table.V1 = dt[c4 <= 0.5],
data.table.V2 = subset(dt, c4 <= 0.5),
dplyr = dt %>% filter(c4 <= 0.5) %>% as.data.table(),
times = 1000
)
## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval
##  data.table.V1 1.531858 2.635110 3.621530 3.360079 3.885557 85.05864  1000
##  data.table.V2 2.411013 4.360392 5.212325 4.960347 5.617723 54.45643  1000
##          dplyr 2.574928 4.418532 5.741201 5.461809 6.096123 52.63277  1000

The results show a performance benefit when using the data.table function. The dplyr function is even slower than the base function subset() but is the function that exhibits the least variation in run times.

##### No 6 - Subsetting rows by row numbers

Another way to subset a data.table is to use the row numbers. The easisest way is to use the data.table package formulation, e.g., dt[101:1000] to select only the 1,000 rows with numbers 101 to 1000. The comparable dplyr function is slice(): dt %>% slice(101:1100) %>% as.data.table().

microbenchmark(
data.table = dt[101:1100],
dplyr = dt %>% slice(101:1100) %>% as.data.table(),
times = 1000
)
## Unit: microseconds
##        expr     min       lq     mean   median       uq     max neval
##  data.table 183.044 235.3270 313.0127 271.9340  337.113 3099.24  1000
##       dplyr 616.666 751.6295 935.5561 844.9255 1001.760 6009.97  1000

The results show that the data.table function is on average three times faster than dplyr, the respective quantiles are also three times faster. However, the maximal value shows again that the dplyr function exhibits less variation in maximum run times.

##### What can we learn from this benchmark

We are aware that these benchmark examples do not represent an extensive comparison between the different approaches to manipulate data.tables. There is no unique conclusion we can draw from the presented examples. But we learned that using a more sophisticated package is not always an improvement. Sometimes it could be better to stick to simple and basic R commands. With the microbenchmark package, it is easy to see how your implementation compares to others. We suggest considering all alternatives to solve your data handling problems and use their advantages in different settings.

The post Benchmarking the six most used manipulations for data.tables in R appeared first on Opremic.