The performance of dplyr blows plyr out of the water

Posted on January 22, 2014 by Paul Hiemstra in R bloggers | 0 Comments

[This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Together with many other packages written by Hadley Wickham, plyr is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame‘s into smaller summaries. The greatest disadvantage of plyr is the performance. On StackOverflow, the answer is often that you want plyr for the syntax, but that for real performance you need to use data.table.

Recently, Hadley has released the successor to plyr: dplyr. dplyr provides the kind of performance you would expect from data.table, but with a syntax that leans closer to plyr. The following example illustrates this performance difference:

library(plyr)
library(dplyr)

size = 10e6
no_levels = 25
dat = data.frame(num = runif(size),
                 factor1 = rep(LETTERS[1:no_levels], each = size / no_levels),
                 factor2 = rep(LETTERS[1:no_levels], size / no_levels))

# plyr solution
system.time(summary_ddply <- ddply(dat, .(factor1, factor2), summarise, mn = mean(num)))
#    user  system elapsed 
#   2.829   0.900   3.748 

# dplyr solution
data_per_factor = group_by(dat, factor1, factor2)
system.time(summary_dplyr <- summarise(data_per_factor, mn = mean(num)))
#    user  system elapsed 
#   0.097   0.000   0.098

In this case, dplyr is about 38x faster. However, some log file processing I did recently was sped up by a factor of 1000. dplyr is an exciting new development, that promises to be the single most influential new package since ggplot2.

To leave a comment for the author, please follow the link and comment on their blog: NumberTheory » R stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The performance of dplyr blows plyr out of the water

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)