The R statistical computing environment is awesome, but weird. How to do database operations in R is a common source of questions. The other day I was looking for an equivalent to SQL group by for R data frames. You need this to compute summary statistics on subsets of a data set divided up by some categorical variable. It turns out there are several ways to get the same effect, some more limited than others.
The best answer seems to be plyr. It automates the The split-apply-combine strategy for data analysis you’d use otherwise. The ddply splits a data frame into subset data frames, performs some function on the subsets, and returns the results as a recombined data frame. Here’s a few links: A Fast Intro to Plyr for R, Block-processing a data frame with plyr and Split, apply, and combine in R using PLYR.
This paper is worth reading. It introduces the library and also gives you a nice framework (split-apply-combine) for thinking about a whole class of data-munging problems. A coworker (thanks, Gustavo) pointed out that this is a lot like Google’s MapReduce.
Some commands that get you part of the way there are: split, by, tapply (nicely explained here), aggregate. The R wiki has an entry on Performing calculations within sub-sets of a data-frame that uses the reshape library. You could always use sqldf or RSQLite. Several options are discussed here. You can cobble up a fully general process using split, some form of sapply, and unsplit. But, that’s what plyr does automatically.
Side notes: While fooling around with this, I noticed that, for some crazy reason, split.data.frame splits matrices into nice subunits, but split has the ugly side-effect of reducing matrices to vectors. Also, Google has a style guide for R.
More R mini-tutorials:
- Mapping R functions to SQL nicely explains the subset, transform and aggregate functions.