(This article was first published on

**- R**, and kindly contributed to R-bloggers)I spent a portion of today trying to convince a colleague that there are times when the `data.table`

package is faster than traditional methods in R. It took a few of the tests below to prove the point.

Generate a data.frame of `characters`

and numbers for easy plotting.

```
df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)),
numbers = sample(1:100, 1e+08, replace = TRUE))
head(df)
## letters numbers
## 1 f 69
## 2 j 65
## 3 h 29
## 4 c 69
## 5 j 12
## 6 e 65
```

Aggregate using the base R function aggregate.

```
start <- proc.time()
aggregate(numbers ~ letters, data = df, FUN = sum)
## letters numbers
## 1 a 504884636
## 2 b 504587923
## 3 c 505357057
## 4 d 505106809
## 5 e 504788174
## 6 f 505219078
## 7 g 504796095
## 8 h 504693166
## 9 i 505079861
## 10 j 505044118
aggregate_time <- proc.time() - start
aggregate_time
## user system elapsed
## 120.13 30.51 261.79
```

Aggregate using `ddply`

from the package `plyr`

.

```
require("plyr")
## Loading required package: plyr
start <- proc.time()
ddply(df, .(letters), summarize, sums = sum(numbers))
## letters sums
## 1 a 504884636
## 2 b 504587923
## 3 c 505357057
## 4 d 505106809
## 5 e 504788174
## 6 f 505219078
## 7 g 504796095
## 8 h 504693166
## 9 i 505079861
## 10 j 505044118
ddply_time <- proc.time() - start
ddply_time
## user system elapsed
## 22.04 27.38 192.99
```

Aggregate using the `data.table`

pacakge.

```
require("data.table")
## Loading required package: data.table
start <- proc.time()
dt <- data.table(df, key = "letters")
dt[, list(sums = sum(numbers)), by = c("letters")]
## letters sums
## 1: a 504884636
## 2: b 504587923
## 3: c 505357057
## 4: d 505106809
## 5: e 504788174
## 6: f 505219078
## 7: g 504796095
## 8: h 504693166
## 9: i 505079861
## 10: j 505044118
dt_time <- proc.time() - start
dt_time
## user system elapsed
## 7.102 7.017 55.957
```

Comparison of the system times.

```
# how many times slower is aggregate
aggregate_time[2]/ddply_time[2]
## sys.self
## 1.114
aggregate_time[2]/dt_time[2]
## sys.self
## 4.347
# how many times slower is ddply
ddply_time[2]/aggregate_time[2]
## sys.self
## 0.8975
ddply_time[2]/dt_time[2]
## sys.self
## 3.902
# how many times slower is data.table
dt_time[2]/aggregate_time[2]
## sys.self
## 0.23
dt_time[2]/ddply_time[2]
## sys.self
## 0.2563
```

Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.

- ddply requires ~0.8975 more system time than aggregate
- aggregate requires ~4.3474x more system time data.table
- ddply requires ~3.902x more system time than data.table

**Conclusion – data.table for the win.**

To

**leave a comment**for the author, please follow the link and comment on his blog:**- R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...