Parallelization using plyr: loading objects and packages into worker nodes

November 14, 2011

(This article was first published on NumberTheory » R stuff, and kindly contributed to R-bloggers)

I really love the plyr package. Apart from having a progress bar and plyr handeling a lot of the overhead, a very interesting feature is being able to run plyr in parallel mode. Essentially, setting .parallel = TRUE runs any plyr function in parallel. This is under the assumption that a parallel backend was registered. In my case, I use the doSNOW package to register a backend that uses the Simple Network of Workstations (SNOW) package for parallel computing.

However, the focus of this post is not on how exactly SNOW works, but on a particular problem I had. My problem will be most effectively explained using the following example (note that I assume that a parallel backend has been registered):

bla = function(x) {
y = 10
dat = data.frame(x = 1:10, category = LETTERS[1:10])
ddply(dat, .(category), bla, .parallel = TRUE)
# Error in do.ply(i) : task 1 failed - "object 'y' not found"

The problem in this case is that the object y is not present in the environment of the worker nodes. Luckily, snow provides a few functions that allow us to load objects, functions and packages into the worker nodes. The clusterExport can be used to load objects and functions, the clusterEvalQ can be used to load packages. To streamline setting up and configuring a cluster that can be used by plyr, I wrote a small function createCluster. It assumes that you have snow and doSNOW installed, installing doSNOW gets you all the required packages. Note that the function also requires an adapted version of the clusterExport function to allow exports from other environments that .GlobalEnv.

clusterExport <- local({
  gets <- function(n, v) { assign(n, v, envir = .GlobalEnv); NULL }
  function(cl, list, envir = .GlobalEnv) {
    ## do this with only one clusterCall--loop on slaves?
    for (name in list) {
      clusterCall(cl, gets, name, get(name, envir = envir))
# Functions
createCluster = function(noCores, logfile = "/dev/null", export = NULL, lib = NULL) {
  cl <- makeCluster(noCores, type = "SOCK", outfile = logfile)
  if(!is.null(export)) clusterExport(cl, export)
  if(!is.null(lib)) {
    l_ply(lib, function(dum) { 
      clusterExport(cl, "dum", envir = environment())
      clusterEvalQ(cl, library(dum, character.only = TRUE))

where noCores is a numeric specifying the number of cores to register, logfile is the location where the output of the workers goes to (set to “” to output to screen), export is a list of objects/functions that needs to be loaded into the workers, and lib is a list of packages that need to be loaded into the workers. The following example show the function in action:

bla = function(arg) {
  dum = ggplot(aes(x = x, y = x), data = arg)
  xi = bla2(arg$x)
bla2 = function(arg) {
  return(arg + 1)
# Constants
y = 10
dat = data.frame(x = 1:10, category = LETTERS[1:10])
# Create a cluster
# Fails
# Error in do.ply(i) : task 1 failed - "could not find function "ggplot""
cl = createCluster(2)
res = ddply(dat, .(category), bla, .parallel = TRUE)
# Fails, pacakge is loaded, function 'bla2' is not
# Error in do.ply(i) : task 1 failed - "could not find function "bla2""
cl = createCluster(2, lib = list("ggplot2"))
res =ddply(dat, .(category), bla, .parallel = TRUE)
# Works! Also export the function 'bla2' and object 'y'
cl = createCluster(2, export = list("bla2","y"), lib = list("ggplot2"))
res = ddply(dat, .(category), bla, .parallel = TRUE)
# Sanity check
all.equal(res, ddply(dat, .(category), bla, .parallel = FALSE))

To leave a comment for the author, please follow the link and comment on their blog: NumberTheory » R stuff. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.


Mango solutions

plotly webpage

dominolab webpage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)