Using the {plyr} (1.2) package parallel processing backend with windows

September 11, 2010

(This article was first published on R-statistics blog, and kindly contributed to R-bloggers)

href="">Hadley Wickham has href="">just announced the release of a new href="">R package “ href="">reshape2” which is (as Hadley wrote) “a reboot of the href="">reshape package”. Alongside, Hadley href="[email protected]/msg109348.html">announced the release of plyr 1.2.1 (now faster and with support to parallel computation!). /> Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

    Reminder what is the `plyr` package all about

    (as written in href="[email protected]/msg109348.html">Hadley’s announcement)

    plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

    • fit the same model each patient subsets of a data frame
    • quickly calculate summary statistics for each group
    • perform group-wise transformations like scaling or standardising

    It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

    • totally consistent names, arguments and outputs
    • convenient parallelisation through the foreach package
    • input from and output to data.frames, matrices and lists
    • progress bars to keep track of long running operations
    • built-in error recovery, and informative error messages
    • labels that are maintained across all transformations

    Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

    You can find out more at  href="" >, including a 20 page introductory guide,  href="" >  You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at  href="" >

    What’s new in `plyr` (1.2.1)

    The exiting news about the release of the new plyr version is the added support for parallel processing.

    l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the /> foreach package.

    The new package also has some minor changes and bug fixes, all can be  href="">read here.

    In the original href="[email protected]/msg109348.html"> announcement by Hadley, he gave an example of using the new parallel backend with the href="">doMC package for unix/linux.  For windows (the OS I’m using) you should use the href="">doSMP package (as David mentioned in href="">his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it  href="">here).  But due to the kind help of  style="font-family: Verdana, Arial, Helvetica, sans-serif; line-height: 25px; font-size: 12.5px;">Tao Shi style="font-size: 13.3333px;">there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

    style="font-size: 13.3333px;">All you need is to install the doSMP package, according to the instructions in the post “ href="">Parallel Multicore Processing with R (on Windows)“, and then use it like this:

    class="wp_codebox_msgheader wp_codebox_hide"> class="right"> href="" title="WP-CodeBox HowTo?"> style="color: #99cc00">? class="left"> href="javascript:;" onclick="javascript:showCodeTxt('p532code4'); return false;">View Code RSPLUS class="codebox_clear">
    id="p5324"> class="line_numbers">
    class="code" id="p532code4">
    require(plyr) # make sure you have 1.2 or later installed
    x <- seq_len(20)
    wait <- function(i) Sys.sleep(0.1)
    system.time(llply(x, wait))
    #   user  system elapsed 
    #      0       0       2 
    workers <- startWorkers(2) # My computer has 2 cores
    system.time(llply(x, wait, .parallel = TRUE))
    #   user  system elapsed 
    #   0.09    0.00    1.11

    To leave a comment for the author, please follow the link and comment on his blog: R-statistics blog. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

    Comments are closed.

    Top 3 Posts from the past 2 days

    Top 9 articles of the week

    1. Scatterplots
    2. In-depth introduction to machine learning in 15 hours of expert videos
    3. The Single Most Important Skill for a Data Scientist
    4. Installing R packages
    5. Illustrated Guide to ROC and AUC
    6. Using apply, sapply, lapply in R
    7. Network analysis with igraph
    8. R vs Python: Survival Analysis with Plotly
    9. KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!