RStudio's Hadley Wickham has just introduced a new package for filtering, selecting, restructuring and aggregating tabular data in R: the dplyr package. It's similar in concept to Hadley's original plyr package from 2009, but with several key improvements:
- It works exclusively with data in R data frames;
- It can process data in remote databases (with the transformations done in-database — only the result is returned to R);
- It introduces a “grammar of data manipulation”, allowing you to string together operations with the %.% operator;
- And it's much, much faster than plyr or standard R operations (most processing is done in parallel in C++).
For example, here's how you'd calculate the Top 5 all-time batters based in number of runs scored (based on 2012 MLB data, courtesy of the Lahman package)
Batting %.% group_by(playerID) %.% summarise(total = sum(R)) %.% arrange(desc(total)) %.% head(5)
playerID total 1 henderi01 2295 2 cobbty01 2246 3 bondsba01 2227 4 ruthba01 2174 5 aaronha01 2174
There are plenty of ways you could solve the same problem in standard R, but here's one way:
totals <- aggregate(. ~ playerID, data=Batting[,c("playerID","R")], sum) ranks <- sort.list(-totals$R) totals[ranks[1:5],]
The dplyr version took 0.036 seconds on my MacBook Air, compared to 0.266 seconds for the "standard R" version — about a 7x speedup. More importantly, it'd been so long since I used R's aggregate function that it took me about 5 minutes and several debugging attempts before I got it right. (To be fair, I copy and pasted the dplyr example from Hadley, but still: it looks a lot cleaner and error prone.) There are certainly faster and easier ways to do this in standard R, but that's kind of the point: a standard "grammar of data manipulation" provides consistency, and with a speed boost to boot!
The dplyr package is available on CRAN now. You can read more about dplyr in the introductory vignette and in Hadley's blog post linked below.
RStudio blog: Introducing dplyr