Fast and easy data munging, with dplyr

January 22, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

RStudio's Hadley Wickham has just introduced a new package for filtering, selecting, restructuring and aggregating tabular data in R: the dplyr package. It's similar in concept to Hadley's original plyr package from 2009, but with several key improvements:

  • It works exclusively with data in R data frames;
  • It can process data in remote databases (with the transformations done in-database — only the result is returned to R);
  • It introduces a "grammar of data manipulation", allowing you to string together operations with the %.% operator;
  • And it's much, much faster than plyr or standard R operations (most processing is done in parallel in C++).

For example, here's how you'd calculate the Top 5 all-time batters based in number of runs scored (based on 2012 MLB data, courtesy of the Lahman package)

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(R)) %.%
  arrange(desc(total)) %.%
  head(5)
   playerID total
1 henderi01  2295
2  cobbty01  2246
3 bondsba01  2227
4  ruthba01  2174
5 aaronha01  2174

There are plenty of ways you could solve the same problem in standard R, but here's one way:

totals <- aggregate(. ~ playerID, data=Batting[,c("playerID","R")], sum)
ranks <- sort.list(-totals$R)
totals[ranks[1:5],]

The dplyr version took 0.036 seconds on my MacBook Air, compared to 0.266 seconds for the "standard R" version — about a 7x speedup. More importantly, it'd been so long since I used R's aggregate function that it took me about 5 minutes and several debugging attempts before I got it right. (To be fair, I copy and pasted the dplyr example from Hadley, but still: it looks a lot cleaner and error prone.) There are certainly faster and easier ways to do this in standard R, but that's kind of the point: a standard "grammar of data manipulation" provides consistency, and with a speed boost to boot!

The dplyr package is available on CRAN now. You can read more about dplyr in the introductory vignette and in Hadley's blog post linked below.

RStudio blog: Introducing dplyr

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.