I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Roughly, the task was to add in some derived per-group aggregation columns to a few million row data set. In the application the groups tend to be small session logs from many users. So the groups are numerous and small.

We can create an abstract version of such data in R as follows.

```
```set.seed(2020)
n <- 1000000
mk_data <- function(n) {
d <- data.frame(x = rnorm(n))
d$g <- sprintf("level_%09g",
sample.int(n, size = n, replace = TRUE))
return(d)
}
d <- mk_data(n)

The sampling with replacement has an expected number of unique IDs in the ballpark of `n/log(n)`

via the coupon collector’s problem. So we expect lots of small groups in such data.

Our task can be specified in rquery/rqdatatable notation as follows.

```
```library(rqdatatable)
ops_rqdatatable <- local_td(d, name = 'd') %.>%
extend(.,
rn %:=% row_number(),
cs %:=% cumsum(x),
partitionby = 'g',
orderby = 'x') %.>%
order_rows(.,
c('g', 'x'))

The key step is the `extend()`

, which adds the new columns `rn`

and `cs`

in a per-`g`

group manner in a by-`x`

order. We feel the notation is learnable and expressive. (Note: normally we would use `:=`

for assignment, but as we are also running direct data.table examples we didn’t load this operator and instead used `%:=%`

to stay out of data.table’s way.)

We translated the same task in to several different notations: data.table, dplyr, dtplyr, and data_algebra. The observed task times are given below.

Mean task run times in second (smaller is better)
Method |
Interface Language |
Data Engine |
Mean run time in seconds |

rqdatatable |
R |
data.table |
3.8 |

data.table |
R |
data.table |
2.1 |

dplyr |
R |
dplyr |
35.1 |

dtplyr |
R |
data.table |
5.1 |

data_algebra |
Python |
Pandas |
17.1 |

What is missing is a direct Pandas timing (to confirm if the length of the Python run-time is from data_algebra overhead or from the underlying Pandas engine).

What stands out is how fast data.table, and even the data.table based methods, are compared to all other methods.

Details of the benchmark runs (methods, code, data, versions, and so on) can be found here.

*Related*

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...