The performance of dplyr blows plyr out of the water

January 22, 2014
By

(This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers)

Together with many other packages written by Hadley Wickham, plyr is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame‘s into smaller summaries. The greatest disadvantage of plyr is the performance. On StackOverflow, the answer is often that you want plyr for the syntax, but that for real performance you need to use data.table.

Recently, Hadley has released the successor to plyr: dplyr. dplyr provides the kind of performance you would expect from data.table, but with a syntax that leans closer to plyr. The following example illustrates this performance difference:

library(plyr)
library(dplyr)

size = 10e6
no_levels = 25
dat = data.frame(num = runif(size),
                 factor1 = rep(LETTERS[1:no_levels], each = size / no_levels),
                 factor2 = rep(LETTERS[1:no_levels], size / no_levels))

# plyr solution
system.time(summary_ddply <- ddply(dat, .(factor1, factor2), summarise, mn = mean(num)))
#    user  system elapsed 
#   2.829   0.900   3.748 

# dplyr solution
data_per_factor = group_by(dat, factor1, factor2)
system.time(summary_dplyr <- summarise(data_per_factor, mn = mean(num)))
#    user  system elapsed 
#   0.097   0.000   0.098

In this case, dplyr is about 38x faster. However, some log file processing I did recently was sped up by a factor of 1000. dplyr is an exciting new development, that promises to be the single most influential new package since ggplot2.

To leave a comment for the author, please follow the link and comment on their blog: NumberTheory » R stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



plotly webpage

dominolab webpage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)