# More about Aggregation by Group in R

December 24, 2012
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivated by my young friend, HongMing Song, I managed to find more handy ways to calculate aggregated statistics by group in R. They require loading additional packages, plyr, doBy, Hmisc, and gdata, and are extremely user-friendly. In terms of CPU time, while the method with summarize() is as efficient as the 2nd method with by() introduced yesterday, summaryBy() in doBy package seems the slowest.

“Learn as if you were to live forever” – Mahatma Gandhi

```> # METHOD 5: USING DDPLY()
> library(plyr)
> summ5 <- ddply(df, .(SELFEMPL, OWNRENT), summarize, INCOME = mean(INCOME), BAD = mean(BAD))
> print(summ5)
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 6: USING DOBy()
> library(doBy)
> summ6 <- summaryBy(INCOME + BAD ~ SELFEMPL + OWNRENT, data = df, fun = c(mean), keep.names = TRUE)
> print(summ6)
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 7: USING SUMMARIZE()
> library(Hmisc)
> summ7 <- summarize(df[c('INCOME', 'BAD', 'SELFEMPL', 'OWNRENT')], df[c('SELFEMPL', 'OWNRENT')], colMeans, stat.name = NULL)
> print(summ7)
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 8: USING FRAMEAPPLY()
> library(gdata)
> summ8 <- frameApply(df, by = c('SELFEMPL', 'OWNRENT'), on = c('INCOME', 'BAD'), fun = colMeans)
> rownames(summ8) <- NULL
> print(summ8)
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
```

Efficiency Comparison

```> test5 <- function(n){
+   for (i in 1:n){
+     summ5 <- ddply(df, .(SELFEMPL, OWNRENT), summarize, INCOME = mean(INCOME), BAD = mean(BAD))
+   }
+ }
> system.time(test5(10))
user  system elapsed
0.524   0.068   0.622
>
> test6 <- function(n){
+   for (i in 1:n){
+     summ6 <- summaryBy(INCOME + BAD ~ SELFEMPL + OWNRENT, data = df, fun = c(mean), keep.names = TRUE)
+   }
+ }
> system.time(test6(10))
user  system elapsed
1.800   0.060   1.903
>
> test7 <- function(n){
+   for (i in 1:n){
+     summ7 <- summarize(df[c('INCOME', 'BAD', 'SELFEMPL', 'OWNRENT')], df[c('SELFEMPL', 'OWNRENT')], colMeans, stat.name = NULL)
+   }
+ }
> system.time(test7(10))
user  system elapsed
0.236   0.020   0.274
>
> test8 <- function(n){
+   for (i in 1:n){
+     summ8 <- frameApply(df, by = c('SELFEMPL', 'OWNRENT'), on = c('INCOME', 'BAD'), fun = colMeans)
+     rownames(summ8) <- NULL
+   }
+ }
> system.time(test8(10))
user  system elapsed
0.580   0.008   0.668
```

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.