Fast and easy data munging, with dplyr

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RStudio's Hadley Wickham has just introduced a new package for filtering, selecting, restructuring and aggregating tabular data in R: the dplyr package. It's similar in concept to Hadley's original plyr package from 2009, but with several key improvements:

  • It works exclusively with data in R data frames;
  • It can process data in remote databases (with the transformations done in-database — only the result is returned to R);
  • It introduces a “grammar of data manipulation”, allowing you to string together operations with the %.% operator;
  • And it's much, much faster than plyr or standard R operations (most processing is done in parallel in C++).

For example, here's how you'd calculate the Top 5 all-time batters based in number of runs scored (based on 2012 MLB data, courtesy of the Lahman package)

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(R)) %.%
  arrange(desc(total)) %.%
  head(5)
   playerID total
1 henderi01  2295
2  cobbty01  2246
3 bondsba01  2227
4  ruthba01  2174
5 aaronha01  2174

There are plenty of ways you could solve the same problem in standard R, but here's one way:

totals <- aggregate(. ~ playerID, data=Batting[,c("playerID","R")], sum)
ranks <- sort.list(-totals$R)
totals[ranks[1:5],]

The dplyr version took 0.036 seconds on my MacBook Air, compared to 0.266 seconds for the "standard R" version — about a 7x speedup. More importantly, it'd been so long since I used R's aggregate function that it took me about 5 minutes and several debugging attempts before I got it right. (To be fair, I copy and pasted the dplyr example from Hadley, but still: it looks a lot cleaner and error prone.) There are certainly faster and easier ways to do this in standard R, but that's kind of the point: a standard "grammar of data manipulation" provides consistency, and with a speed boost to boot!

The dplyr package is available on CRAN now. You can read more about dplyr in the introductory vignette and in Hadley's blog post linked below.

RStudio blog: Introducing dplyr

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)