If you are just getting started in R, checkout my post on good references for beginners.
Hadly Wickham has come out with yet another R package that is destined to improve my workflow and let me concentrate less on getting R to do things, and more on my research questions. The package is dplyr, a reboot of an earlier package called plyr.
Behind both packages is the notion that it should be easy to do split-apply-combine operations on your data. These operations are where you group your observations by some categorical variable, do the some operation on each subset, and then recombine results. The plyr package was already really good at this.
From my perspective, the 2 most important improvements in dplyr are
- a MASSIVE increase in speed, making dplyr useful on big data sets
- the ability to chain operations together in a natural order
First, we read in the data from the web. This step takes the longest of anything we will do, because we are reading a 2.5 MB text file into memory over http. It is a huge dataset with 5416 observations of 55 variables
URL <- “http://esapubs.org/archive/ecol/e090/184/PanTHERIA_1-0_WR05_Aug2008.txt”
OK. Now we are ready to show off the magic of dplyr. We will use the %.%. operator to chain together commands to manipulate our dataframe. First, we use the mutate()function to create a new column called yearlyOffspring, which is a transformation of two other columns. Then, we pass that result to the filter function, and filter out just the rodents. Next, we add a group_by() clause, and finally, we use summarise(), to calculate the average body mass for each group. Type ?manip in the command line to see the full list of dplyr manipulation functions.
mutate(yearlyOffspring = X16.1_LittersPerYear
* X16.1_LittersPerYear) %.%
filter(MSW05_Order == “Rodentia”) %.%
summarise(meanBM = mean(X5.1_AdultBodyMass_g,na.rm=TRUE),
meanYO = mean(yearlyOffspring,na.rm=TRUE)
This code yields the following, which is exactly what we want!
The beauty of the %.% operator is that it allows you to do things in the order in which you think about them. You start with your data, then mutate it, then filter it, then group and summarise. You could do the same process with plyr, or with base apply-family functions, but dplyr makes it MUCH cleaner and clearer.
Now we can visualize this data, and observe that there is a complex relationship between body mass and reproductive output in rodents!
Please share your experiences with dplyr in the comments section.