# The performance of dplyr blows plyr out of the water

January 22, 2014
By

(This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers)

Together with many other packages written by Hadley Wickham, plyr is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame‘s into smaller summaries. The greatest disadvantage of plyr is the performance. On StackOverflow, the answer is often that you want plyr for the syntax, but that for real performance you need to use data.table.

Recently, Hadley has released the successor to plyr: dplyr. dplyr provides the kind of performance you would expect from data.table, but with a syntax that leans closer to plyr. The following example illustrates this performance difference:

library(plyr)
library(dplyr)

size = 10e6
no_levels = 25
dat = data.frame(num = runif(size),
factor1 = rep(LETTERS[1:no_levels], each = size / no_levels),
factor2 = rep(LETTERS[1:no_levels], size / no_levels))

# plyr solution
system.time(summary_ddply &lt;- ddply(dat, .(factor1, factor2), summarise, mn = mean(num)))
#    user  system elapsed
#   2.829   0.900   3.748

# dplyr solution
data_per_factor = group_by(dat, factor1, factor2)
system.time(summary_dplyr &lt;- summarise(data_per_factor, mn = mean(num)))
#    user  system elapsed
#   0.097   0.000   0.098

In this case, dplyr is about 38x faster. However, some log file processing I did recently was sped up by a factor of 1000. dplyr is an exciting new development, that promises to be the single most influential new package since ggplot2.