# A speed test comparison of plyr, data.table, and dplyr

September 23, 2013
By

(This article was first published on R-statistics blog » RR-statistics blog, and kindly contributed to R-bloggers)

Guest post by Jake Russ

For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use `plyr`. However, the job took `plyr` roughly 13 hours to complete.

`plyr` is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.

I asked for some help on the manipulator Google group , and their feedback led me to `data.table` and `dplyr`, a new, and still in progress, package project by Hadley.

What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.

Let’s start by making a data frame which fits my description above, but make it reproducible:

 ```set.seed(42)   types <- c("A", "B", "C", "D", "E", "F")   obs <- 4e+07   one <- data.frame(id = as.factor(seq(from = 1, to = 80000, by = 1)), percent = round(runif(obs, min = 0, max = 1), digits = 2), type = as.factor(sample(types, obs, replace = TRUE)))   print(object.size(one), units = "GB")```
 `## 0.6 Gb`
 `summary(one)`
 ```## id percent type ## 1 : 500 Min. :0.00 A:6672132 ## 2 : 500 1st Qu.:0.25 B:6663570 ## 3 : 500 Median :0.50 C:6668009 ## 4 : 500 Mean :0.50 D:6668684 ## 5 : 500 3rd Qu.:0.75 E:6660437 ## 6 : 500 Max. :1.00 F:6667168 ## (Other):39997000```

I’ll start the testing with `plyr`, using ddply, but I’ll also show the difference between subsetting a data frame from within a ddply call and doing the subset first from outside the call. Then I offer a third way to use `plyr`‘s count function to achieve the same result.

 ```library(plyr)   ## Test 1 (plyr): Use ddply and subset one with [ ] style indexing from ## within the ddply call.   typeSubset <- c("A", "C", "E")   system.time(test1 <- ddply(one[one\$type %in% typeSubset, ], .(id), summarise, percent_total = sum(percent)))```
 ```## user system elapsed ## 104.51 21.23 125.81```
 ```## Test 2 (plyr):, Use ddply but subset one outside of the ddply call   two <- subset(one, type %in% typeSubset)   system.time(test2 <- ddply(two, .(id), summarise, percent_total = sum(percent)))```
 ```## user system elapsed ## 101.20 46.14 147.64```
 ```## Test 3 (plyr): For a simple sum, an alternative is to use plyr's count ## function   system.time(test3 <- count(two, "id", "percent"))```
 ```## user system elapsed ## 5.90 0.22 6.12```

Doing the subset outside of the ddply call did speed things up, but not as much I as orinially thought it would. For my particular project, doing the subset outside of the ddply call reduced the run time to 12 hours. So largely this is still a “wrong tool” problem, rather than a “when to subset” problem.

Next, I’ll try `data.table` and for this test and the `dplyr` one below I’ll operate on the data frame which has been pre-subset:

 ```library(data.table)   ## Test 4 (data.table): Speed test for package data.table   ## Define the data table three <- data.table(two, key = c("id"))   tables() # check that the key columns are correct```
 ```## NAME NROW MB COLS KEY ## [1,] three 20,000,578 310 id,percent,type id ## Total: 310MB```
 ```## Operate on it system.time(test4 <- three[, list(percent_total = sum(percent)), by = key(three)])```
 ```## user system elapsed ## 0.17 0.01 0.19```

`dplyr` is not currently available on CRAN but you can install it from github with:

 ```devtools::install_github("assertthat") devtools::install_github("dplyr")```
 ```library(dplyr)   ## Test 5 (dplyr): Speed test for package dplyr   fourDf <- group_by(two, id)   system.time(test5 <- summarise(fourDf, percent_total = sum(percent)))```
 ```## user system elapsed ## 1.49 0.03 1.52```
 ```  sessionInfo()```
 ```## R version 3.0.1 (2013-05-16) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] dplyr_0.01 Rcpp_0.10.4 data.table_1.8.8 plyr_1.8 ## [5] knitr_1.4.1 ## ## loaded via a namespace (and not attached): ## [1] assertthat_0.1 digest_0.6.3 evaluate_0.4.7 formatR_0.9 ## [5] stringr_0.6.2 tools_3.0.1```

Both `data.table` and `dplyr` were able to reduce the problem to less than a few seconds. If you’re looking for pure speed `data.table` is the clear winner. However, it is my understanding that `data.table`‘s syntax can be frustrating, so if you’re already used to the ‘Hadley ecosystem’ of packages, `dplyr` is a formitable alternative, even if it is still in the early stages.

Have a nice day!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...