New Timings for a Grouped In-Place Aggregation Task

January 2, 2020
By

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Roughly, the task was to add in some derived per-group aggregation columns to a few million row data set. In the application the groups tend to be small session logs from many users. So the groups are numerous and small.

We can create an abstract version of such data in R as follows.

set.seed(2020)
n <- 1000000

mk_data <- function(n) {
  d <- data.frame(x = rnorm(n))
  d$g <- sprintf("level_%09g", 
                 sample.int(n, size = n, replace = TRUE))
  return(d)
}

d <- mk_data(n)

The sampling with replacement has an expected number of unique IDs in the ballpark of n/log(n) via the coupon collector’s problem. So we expect lots of small groups in such data.

Our task can be specified in rquery/rqdatatable notation as follows.

library(rqdatatable)

ops_rqdatatable <- local_td(d, name = 'd') %.>%
  extend(.,
         rn %:=% row_number(),
         cs %:=% cumsum(x),
         partitionby = 'g',
         orderby = 'x') %.>%
  order_rows(.,
             c('g', 'x'))

The key step is the extend(), which adds the new columns rn and cs in a per-g group manner in a by-x order. We feel the notation is learnable and expressive. (Note: normally we would use := for assignment, but as we are also running direct data.table examples we didn’t load this operator and instead used %:=% to stay out of data.table’s way.)

We translated the same task in to several different notations: data.table, dplyr, dtplyr, and data_algebra. The observed task times are given below.

Mean task run times in second (smaller is better)
Method Interface Language Data Engine Mean run time in seconds
rqdatatable R data.table 3.8
data.table R data.table 2.1
dplyr R dplyr 35.1
dtplyr R data.table 5.1
data_algebra Python Pandas 17.1

What is missing is a direct Pandas timing (to confirm if the length of the Python run-time is from data_algebra overhead or from the underlying Pandas engine).

What stands out is how fast data.table, and even the data.table based methods, are compared to all other methods.

Details of the benchmark runs (methods, code, data, versions, and so on) can be found here.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)