Using plyr and doMC for quick and easy apply-family functions

[This article was first published on Fellgernon Bit - rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks back I dedicated a short amount of time to actually read what plyr (Wickham, 2011) is about and I was surprised. The whole idea behind plyr is very simple: expand the apply() family to do things easy. plyr has many functions whose name ends with ply which is short of apply. Then, the functions are identified by two letters before ply which are abbreviations for the input (first letter) and output (second one). For instance, ddply takes an input a data.frame and returns a data.frame while ldply takes as input a list and returns a data.frame.

The syntax is pretty straight forward. For example, here are the arguments for ddply:

library<span class="p">(</span>plyr<span class="p">)</span>
args<span class="p">(</span>ddply<span class="p">)</span>
<span class="c1">## function (.data, .variables, .fun = NULL, ..., .progress = "none", </span>
<span class="c1">##     .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL) </span>
<span class="c1">## NULL</span>

What we basically have to specify are

  • .data which in general is the name of the input data.frame,
  • .variables which is a vector (note the use of the . function) of variable names. In this case, ddply is very useful for applying some function to subsets of the data as specified by these variables,
  • .fun which is the actual function we want to run,
  • and ... which are parameter options for the function we are running.

From the ddply help page we have the following examples:

dfx <span class="o"><-</span> data.frame<span class="p">(</span>
  group <span class="o">=</span> c<span class="p">(</span>rep<span class="p">(</span><span class="s">'A'</span><span class="p">,</span> <span class="m">8</span><span class="p">),</span> rep<span class="p">(</span><span class="s">'B'</span><span class="p">,</span> <span class="m">15</span><span class="p">),</span> rep<span class="p">(</span><span class="s">'C'</span><span class="p">,</span> <span class="m">6</span><span class="p">)),</span>
  sex <span class="o">=</span> sample<span class="p">(</span>c<span class="p">(</span><span class="s">"M"</span><span class="p">,</span> <span class="s">"F"</span><span class="p">),</span> size <span class="o">=</span> <span class="m">29</span><span class="p">,</span> replace <span class="o">=</span> <span class="kc">TRUE</span><span class="p">),</span>
  age <span class="o">=</span> runif<span class="p">(</span>n <span class="o">=</span> <span class="m">29</span><span class="p">,</span> min <span class="o">=</span> <span class="m">18</span><span class="p">,</span> max <span class="o">=</span> <span class="m">54</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Note the use of the '.' function to allow</span>
<span class="c1"># group and sex to be used without quoting</span>
ddply<span class="p">(</span>dfx<span class="p">,</span> .<span class="p">(</span>group<span class="p">,</span> sex<span class="p">),</span> summarize<span class="p">,</span>
 mean <span class="o">=</span> round<span class="p">(</span>mean<span class="p">(</span>age<span class="p">),</span> <span class="m">2</span><span class="p">),</span>
 sd <span class="o">=</span> round<span class="p">(</span>sd<span class="p">(</span>age<span class="p">),</span> <span class="m">2</span><span class="p">))</span>
<span class="c1">##   group sex  mean    sd</span>
<span class="c1">## 1     A   F 40.48 12.72</span>
<span class="c1">## 2     A   M 34.48 15.28</span>
<span class="c1">## 3     B   F 36.05  9.98</span>
<span class="c1">## 4     B   M 38.35  7.97</span>
<span class="c1">## 5     C   F 20.04  1.86</span>
<span class="c1">## 6     C   M 43.81 10.72</span>

<span class="c1"># An example using a formula for .variables</span>
ddply<span class="p">(</span>baseball<span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">,</span> <span class="p">],</span> <span class="o">~</span>year<span class="p">,</span> nrow<span class="p">)</span>

<span class="c1">##   year V1</span>
<span class="c1">## 1 1871  7</span>
<span class="c1">## 2 1872 13</span>
<span class="c1">## 3 1873 13</span>
<span class="c1">## 4 1874 15</span>
<span class="c1">## 5 1875 17</span>
<span class="c1">## 6 1876 15</span>
<span class="c1">## 7 1877 17</span>
<span class="c1">## 8 1878  3</span>

<span class="c1"># Applying two functions; nrow and ncol</span>
ddply<span class="p">(</span>baseball<span class="p">,</span> .<span class="p">(</span>lg<span class="p">),</span> c<span class="p">(</span><span class="s">"nrow"</span><span class="p">,</span> <span class="s">"ncol"</span><span class="p">))</span>

<span class="c1">##   lg  nrow ncol</span>
<span class="c1">## 1       65   22</span>
<span class="c1">## 2 AA   171   22</span>
<span class="c1">## 3 AL 10007   22</span>
<span class="c1">## 4 FL    37   22</span>
<span class="c1">## 5 NL 11378   22</span>
<span class="c1">## 6 PL    32   22</span>
<span class="c1">## 7 UA     9   22</span>

But this is not the end of the story! Something I really liked about plyr is that it can be parallelized via the foreach (Analytics, 2012) package. I don’t know much about foreach, but all I learnt is that you have to use other packages such as doMC (Analytics, 2013) to actually run the code. It’s like foreach specifies the infraestructure to communicate in parallel (and split jobs) and packages like doMC tailor it for specific environments like for running in multi-core.

Running things in parallel can then be very easy. Basically, you load the packages, specify the number of cores, and run your ply function. Here is a short example:

<span class="c1">## Load packages</span>
library<span class="p">(</span>plyr<span class="p">)</span>
library<span class="p">(</span>doMC<span class="p">)</span>

<span class="c1">## Loading required package: foreach</span>
<span class="c1">## Loading required package: iterators</span>
<span class="c1">## Loading required package: parallel</span>

<span class="c1">## Specify the number of cores</span>
registerDoMC<span class="p">(</span><span class="m">4</span><span class="p">)</span>

<span class="c1">## Check how many cores we are using</span>
getDoParWorkers<span class="p">()</span>
<span class="c1">## [1] 4</span>

<span class="c1">## Run your ply function</span>
ddply<span class="p">(</span>dfx<span class="p">,</span> .<span class="p">(</span>group<span class="p">,</span> sex<span class="p">),</span> summarize<span class="p">,</span> mean <span class="o">=</span> round<span class="p">(</span>mean<span class="p">(</span>age<span class="p">),</span> <span class="m">2</span><span class="p">),</span> sd <span class="o">=</span> round<span class="p">(</span>sd<span class="p">(</span>age<span class="p">),</span> 
    <span class="m">2</span><span class="p">),</span> .parallel <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>

<span class="c1">##   group sex  mean    sd</span>
<span class="c1">## 1     A   F 40.48 12.72</span>
<span class="c1">## 2     A   M 34.48 15.28</span>
<span class="c1">## 3     B   F 36.05  9.98</span>
<span class="c1">## 4     B   M 38.35  7.97</span>
<span class="c1">## 5     C   F 20.04  1.86</span>
<span class="c1">## 6     C   M 43.81 10.72</span>

In case that you are interested, here is a short shell script for knitting an Rmd file in the cluster and specifying the appropriate number of cores to then use plyr and doMC.

<span class="c">#!/bin/bash </span>
<span class="c"># To run it in the current working directory</span>
<span class="c">#$ -cwd </span>
<span class="c"># To get an email after the job is done</span>
<span class="c">#$ -m e </span>
<span class="c"># To speficy that we want 4 cores</span>
<span class="c">#$ -pe local 4</span>
<span class="c"># The name of the job</span>
<span class="c">#$ -N myPlyJob</span>

<span class="nb">echo</span> <span class="s2">"**** Job starts ****"</span>
date

<span class="c"># Knit your file: assuming it's called FileToKnit.Rmd</span>
Rscript -e <span class="s2">"library(knitr); knit2html('FileToKnit.Rmd')"</span>

<span class="nb">echo</span> <span class="s2">"**** Job ends ****"</span>
date

Lets say that the bash script is named script.sh. Then you can submit it to the cluster queue using

qsub script.sh

This is what I used to re-format a large data.frame in a few minutes in the cluster for the #jhsph753 class homework project.

So, thank you again Hadley Wickham for making awesome R packages!

Citations made with knitcitations (Boettiger, 2013).

To leave a comment for the author, please follow the link and comment on their blog: Fellgernon Bit - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)