A speed test comparison of plyr, data.table, and dplyr

[This article was first published on R-statistics blog » RR-statistics blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ssssssspeed_521872450_d085d1e928

Guest post by Jake Russ

For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use plyr. However, the job took plyr roughly 13 hours to complete.

plyr is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.

I asked for some help on the manipulator Google group , and their feedback led me to data.table and dplyr, a new, and still in progress, package project by Hadley.

What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.


Let’s start by making a data frame which fits my description above, but make it reproducible:

set.seed(42)
 
types <- c("A", "B", "C", "D", "E", "F")
 
obs <- 4e+07
 
one <- data.frame(id = as.factor(seq(from = 1, to = 80000, by = 1)), percent = round(runif(obs, 
    min = 0, max = 1), digits = 2), type = as.factor(sample(types, obs, replace = TRUE)))
 
print(object.size(one), units = "GB")
## 0.6 Gb
summary(one)
##        id              percent     type       
##  1      :     500   Min.   :0.00   A:6672132  
##  2      :     500   1st Qu.:0.25   B:6663570  
##  3      :     500   Median :0.50   C:6668009  
##  4      :     500   Mean   :0.50   D:6668684  
##  5      :     500   3rd Qu.:0.75   E:6660437  
##  6      :     500   Max.   :1.00   F:6667168  
##  (Other):39997000

I’ll start the testing with plyr, using ddply, but I’ll also show the difference between subsetting a data frame from within a ddply call and doing the subset first from outside the call. Then I offer a third way to use plyr‘s count function to achieve the same result.

library(plyr)
 
## Test 1 (plyr): Use ddply and subset one with [ ] style indexing from
## within the ddply call.
 
typeSubset <- c("A", "C", "E")
 
system.time(test1 <- ddply(one[one$type %in% typeSubset, ], .(id), summarise, 
    percent_total = sum(percent)))
##    user  system elapsed 
##  104.51   21.23  125.81
## Test 2 (plyr):, Use ddply but subset one outside of the ddply call
 
two <- subset(one, type %in% typeSubset)
 
system.time(test2 <- ddply(two, .(id), summarise, percent_total = sum(percent)))
##    user  system elapsed 
##  101.20   46.14  147.64
## Test 3 (plyr): For a simple sum, an alternative is to use plyr's count
## function
 
system.time(test3 <- count(two, "id", "percent"))
##    user  system elapsed 
##    5.90    0.22    6.12

Doing the subset outside of the ddply call did speed things up, but not as much I as orinially thought it would. For my particular project, doing the subset outside of the ddply call reduced the run time to 12 hours. So largely this is still a “wrong tool” problem, rather than a “when to subset” problem.

Next, I’ll try data.table and for this test and the dplyr one below I’ll operate on the data frame which has been pre-subset:

library(data.table)
 
## Test 4 (data.table): Speed test for package data.table
 
## Define the data table
three <- data.table(two, key = c("id"))
 
tables()  # check that the key columns are correct
##      NAME        NROW  MB COLS            KEY
## [1,] three 20,000,578 310 id,percent,type id 
## Total: 310MB
## Operate on it
system.time(test4 <- three[, list(percent_total = sum(percent)), by = key(three)])
##    user  system elapsed 
##    0.17    0.01    0.19

dplyr is not currently available on CRAN but you can install it from github with:

devtools::install_github("assertthat")
devtools::install_github("dplyr")
library(dplyr)
 
## Test 5 (dplyr): Speed test for package dplyr
 
fourDf <- group_by(two, id)
 
system.time(test5 <- summarise(fourDf, percent_total = sum(percent)))
##    user  system elapsed 
##    1.49    0.03    1.52
 
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.01       Rcpp_0.10.4      data.table_1.8.8 plyr_1.8        
## [5] knitr_1.4.1     
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.3   evaluate_0.4.7 formatR_0.9   
## [5] stringr_0.6.2  tools_3.0.1

Both data.table and dplyr were able to reduce the problem to less than a few seconds. If you’re looking for pure speed data.table is the clear winner. However, it is my understanding that data.table‘s syntax can be frustrating, so if you’re already used to the ‘Hadley ecosystem’ of packages, dplyr is a formitable alternative, even if it is still in the early stages.

Have a nice day!

(Editor’s edition: image credit link)

To leave a comment for the author, please follow the link and comment on their blog: R-statistics blog » RR-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)