data.table or data.frame?
[This article was first published on - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I spent a portion of today trying to convince a colleague that there are times when the data.table package is faster than traditional methods in R. It took a few of the tests below to prove the point.
Generate a data.frame of characters and numbers for easy plotting.
df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)),
numbers = sample(1:100, 1e+08, replace = TRUE))
head(df)
## letters numbers
## 1 f 69
## 2 j 65
## 3 h 29
## 4 c 69
## 5 j 12
## 6 e 65
Aggregate using the base R function aggregate.
start <- proc.time() aggregate(numbers ~ letters, data = df, FUN = sum) ## letters numbers ## 1 a 504884636 ## 2 b 504587923 ## 3 c 505357057 ## 4 d 505106809 ## 5 e 504788174 ## 6 f 505219078 ## 7 g 504796095 ## 8 h 504693166 ## 9 i 505079861 ## 10 j 505044118 aggregate_time <- proc.time() - start aggregate_time ## user system elapsed ## 120.13 30.51 261.79
Aggregate using ddply from the package plyr.
require("plyr")
## Loading required package: plyr
start <- proc.time()
ddply(df, .(letters), summarize, sums = sum(numbers))
## letters sums
## 1 a 504884636
## 2 b 504587923
## 3 c 505357057
## 4 d 505106809
## 5 e 504788174
## 6 f 505219078
## 7 g 504796095
## 8 h 504693166
## 9 i 505079861
## 10 j 505044118
ddply_time <- proc.time() - start
ddply_time
## user system elapsed
## 22.04 27.38 192.99
Aggregate using the data.table pacakge.
require("data.table")
## Loading required package: data.table
start <- proc.time()
dt <- data.table(df, key = "letters")
dt[, list(sums = sum(numbers)), by = c("letters")]
## letters sums
## 1: a 504884636
## 2: b 504587923
## 3: c 505357057
## 4: d 505106809
## 5: e 504788174
## 6: f 505219078
## 7: g 504796095
## 8: h 504693166
## 9: i 505079861
## 10: j 505044118
dt_time <- proc.time() - start
dt_time
## user system elapsed
## 7.102 7.017 55.957
Comparison of the system times.
# how many times slower is aggregate aggregate_time[2]/ddply_time[2] ## sys.self ## 1.114 aggregate_time[2]/dt_time[2] ## sys.self ## 4.347 # how many times slower is ddply ddply_time[2]/aggregate_time[2] ## sys.self ## 0.8975 ddply_time[2]/dt_time[2] ## sys.self ## 3.902 # how many times slower is data.table dt_time[2]/aggregate_time[2] ## sys.self ## 0.23 dt_time[2]/ddply_time[2] ## sys.self ## 0.2563
Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.
- ddply requires ~0.8975 more system time than aggregate
- aggregate requires ~4.3474x more system time data.table
- ddply requires ~3.902x more system time than data.table
Conclusion - data.table for the win.
To leave a comment for the author, please follow the link and comment on their blog: - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.