# Benchmarking the six most used manipulations for data.tables in R

Posted on April 6, 2018 by Sebastian Vock in R bloggers | 0 Comments

**R – Opremic**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Benchmarking the six most used manipulations for data.tables in R

#### Comparing formulations of the data.table package with base R and dplyr formulations

Almost everybody that handles large data sets in R is familiar with the data.table package. It provides several functions to subset, merge, and manipulate tabular data. When we started using the package, we were quite confused about the different commands, e.g., subsetting the columns of a table: `dt[, .(c1, c2)]`

vs. `dt[, c("c1", "c2"), with = F]`

. When we were more familiar with data.tables, we always used the same and from coding perspective easier expression: `dt[, .(c1, c2)]`

and never reconsidered our decision and habits. Until now…

While profiling a shiny application in which large data sets are aggregated and prepared for a dashboard, we put our old decisions and habits regarding their performance to the test. We want to share these insights with you.

The data.table used in all following benchmarks has 4 columns and 10,000 rows. All values are numerics.

dt ## c1 c2 c3 c4 ## 1: 0.7958596 0.7835304 0.62645236 0.33154659 ## 2: 0.7644930 0.5094924 0.91021167 0.02995171 ## 3: 0.6535826 0.4156394 0.13558778 0.55281306 ## 4: 0.8383642 0.9559514 0.30899163 0.57564810 ## 5: 0.7829403 0.2674898 0.45422002 0.86812407 ## --- ## 99996: 0.6348011 0.7716307 0.48852835 0.18436878 ## 99997: 0.1145093 0.4689724 0.62741292 0.77253526 ## 99998: 0.5849055 0.3143366 0.14897752 0.77522017 ## 99999: 0.9822468 0.5498836 0.04301053 0.68252273 ## 100000: 0.3571940 0.3759371 0.58782003 0.45850694

##### No 1 – Subsetting columns in a data table

In general there are three ways to select a subset of columns from a data.table: `dt[, .(c1, c2)]`

, `dt[, c("c1", "c2"), with = F]`

, vs. the select function from dplyr package `select(dt, c("c1", "c2"))`

.

microbenchmark( data.table.V1 = dt[, .(c1, c2)], data.table.V2 = dt[, c("c1", "c2"), with = F], dplyr = dt %>% select(c("c1", "c2")) %>% as.data.table(), times = 1000 ) ## Unit: microseconds ## expr min lq mean median uq max ## data.table.V1 603.761 737.2435 1331.1952 838.8315 1139.178 47501.32 ## data.table.V2 329.487 420.0970 924.7818 496.0470 698.195 42546.43 ## dplyr 1004.111 1213.6990 1924.8967 1347.6495 1738.800 43928.10 ## neval ## 1000 ## 1000 ## 1000

As the results show, the second formulation `dt[, c("c1", "c2"), with = F]`

is the fastest with around 900 microseconds on average. The median of around 500 microseconds is even faster than the minimal time of the first formulation. The dplyr command is much slower, however, regarding the consistency it seems to show some benefits.

##### No 2 – Manipulating complete columns

A common way to manipulate complete columns is to change them by reference: `dt[, c5 := runif(1:10000)]`

. The data.table package contains an alternative formulation with the function `set()`

. Therefore, we also tested `set(x = dt, j = "c5", value = runif(1:10000))`

and, using dplyr: `dt %>% mutate(c5 = newCol) %>% as.data.table()`

. As benchmarking function we compare also `cbind(dt, newCol)`

from the base package.

newCol <- runif(1:100000) microbenchmark( data.table.V1 = dt[, c5 := newCol], data.table.V2 = set(x = dt, j = "c5", value = newCol), dplyr = dt %>% mutate(c5 = newCol) %>% as.data.table(), cbind = cbind(dt, newCol), times = 1000 ) ## Unit: microseconds ## expr min lq mean median uq max ## data.table.V1 278.343 422.800 958.1093 492.0660 861.9235 51296.59 ## data.table.V2 61.976 95.912 618.7856 124.0265 554.5730 45369.20 ## dplyr 1134.843 1401.289 3983.8024 2069.5225 3541.3365 69168.02 ## cbind 1465.393 2822.361 8907.3423 4338.8280 6456.5795 68281.18 ## neval ## 1000 ## 1000 ## 1000 ## 1000

Here, the most convenient formulation with `:=`

is also outperformed by the alternative data.table formulation. On average the set function is nearly twice as fast whereas the median is less than a fourth. Again, the dplyr alternative is clearly outperformed by both data.table variants. The base R function cbind is the slowest variant of all.

##### No 3 – Manipulating only a subset of a column with a constant

Sometimes we want to subset some values in a column based on the values of another column. The formulations in the previous example can be easily extended for this purpose: `dt[c4 <= 0.5, c5 := 2]`

, `set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = 2)`

, and `dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = 2)) %>% as.data.table()`

.

newCol <- runif(1:100000) microbenchmark( data.table.V1 = dt[c4 <= 0.5, c5 := 2], data.table.V2 = set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = 2), dplyr = dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = 2)) %>% as.data.table(), times = 1000 ) ## Unit: microseconds ## expr min lq mean median uq max neval ## data.table.V1 1106.839 1675.383 2279.759 1972.356 2205.444 47290.20 1000 ## data.table.V2 717.001 1001.305 1512.075 1259.916 1418.010 43642.51 1000 ## dplyr 2244.515 4628.627 5845.751 5652.471 6318.205 54228.00 1000

Here, again the set function is the fastest formulation with around 1500 microseconds on average, the basic data.table formulation is slightly slower with around 2300 microseconds, whereas the dplyr formulation is about four times slower.

##### No 4 - Manipulating a subset of a column with a dynamic value

Instead of setting a constant as a new value, in most cases we want to set a dynamic value based on the value of another value. The data.table command for this functionality is `dt[c4 <= 0.5, c5 := c5 * 2]`

. The formulation of the corresponding set function is `set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = dt$c5[which(dt$c4 <= 0.5)] * 2)`

. When using dplyr the formulation is a bit more complicated: `dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = dt$c5[which(dt$c4 <= 0.5)] * 2)) %>% as.data.table()`

.

microbenchmark( data.table.V1 = dt[c4 <= 0.5, c5 := c5 * 2], data.table.V2 = set(x = dt, i = which(dt$c4 <= 0.5), j = "c5", value = dt$c5[which(dt$c4 <= 0.5)] * 2), dplyr = dt %>% mutate(c5 = replace(x = c5, list = c4 <= 0.5, values = dt$c5[which(dt$c4 <= 0.5)] * 2)) %>% as.data.table(), times = 1000 ) ## Unit: milliseconds ## expr min lq mean median uq max neval ## data.table.V1 1.405963 2.075418 2.954876 2.514958 2.978378 52.97853 1000 ## data.table.V2 1.912861 2.726847 3.984105 3.386690 3.853750 74.27546 1000 ## dplyr 3.907851 7.147018 9.107210 8.235551 9.628549 52.82997 1000

The basic data.table function using `:=`

outperforms both other variants. The alternative data.table formulation is not significantly slower, the difference in the median is below one millisecond. The dplyr formulation, however, needs more than twice the time than data.table version 1.

##### No 5 - Subsetting rows by value

Probably the most used manipulation of data.tables is to select a subset of rows of the table based on a condition. The common data.table formulation is `dt[c4 <= 0.5]`

. For this purpose, the base R package contains the function `subset()`

. When using dplyr, the function `filter()`

has to be used.

microbenchmark( data.table.V1 = dt[c4 <= 0.5], data.table.V2 = subset(dt, c4 <= 0.5), dplyr = dt %>% filter(c4 <= 0.5) %>% as.data.table(), times = 1000 ) ## Unit: milliseconds ## expr min lq mean median uq max neval ## data.table.V1 1.531858 2.635110 3.621530 3.360079 3.885557 85.05864 1000 ## data.table.V2 2.411013 4.360392 5.212325 4.960347 5.617723 54.45643 1000 ## dplyr 2.574928 4.418532 5.741201 5.461809 6.096123 52.63277 1000

The results show a performance benefit when using the data.table function. The dplyr function is even slower than the base function `subset()`

but is the function that exhibits the least variation in run times.

##### No 6 - Subsetting rows by row numbers

Another way to subset a data.table is to use the row numbers. The easisest way is to use the data.table package formulation, e.g., `dt[101:1000]`

to select only the 1,000 rows with numbers 101 to 1000. The comparable dplyr function is `slice()`

: `dt %>% slice(101:1100) %>% as.data.table()`

.

microbenchmark( data.table = dt[101:1100], dplyr = dt %>% slice(101:1100) %>% as.data.table(), times = 1000 ) ## Unit: microseconds ## expr min lq mean median uq max neval ## data.table 183.044 235.3270 313.0127 271.9340 337.113 3099.24 1000 ## dplyr 616.666 751.6295 935.5561 844.9255 1001.760 6009.97 1000

The results show that the data.table function is on average three times faster than dplyr, the respective quantiles are also three times faster. However, the maximal value shows again that the dplyr function exhibits less variation in maximum run times.

##### What can we learn from this benchmark

We are aware that these benchmark examples do not represent an extensive comparison between the different approaches to manipulate data.tables. There is no unique conclusion we can draw from the presented examples. But we learned that using a more sophisticated package is not always an improvement. Sometimes it could be better to stick to simple and basic R commands. With the microbenchmark package, it is easy to see how your implementation compares to others. We suggest considering all alternatives to solve your data handling problems and use their advantages in different settings.

The post Benchmarking the six most used manipulations for data.tables in R appeared first on Opremic.

*Related*

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Opremic**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.