doFuture: A universal foreach adaptor ready to be used by 1,000+ packages

March 18, 2017
By

(This article was first published on jottR, and kindly contributed to R-bloggers)

doFuture 0.4.0 is available on CRAN. The doFuture package provides a universal foreach adaptor enabling any future backend to be used with the foreach() %dopar% { ... } construct. As shown below, this will allow foreach() to parallelize on not only multiple cores, multiple background R sessions, and ad-hoc clusters, but also cloud-based clusters and high performance compute (HPC) environments.

1,300+ R packages on CRAN and Bioconductor depend, directly or indirectly, on foreach for their parallel processing. By using doFuture, a user has the option to parallelize those computations on more compute environments than previously supported, especially HPC clusters. Notably, all plyr code with .parallel = TRUE will be able to take advantage of this without need for modifications – this is possible because internally plyr makes use of foreach for its parallelization.

With doFuture, foreach can process your code in more places than ever before. Alright, it may not be able to process this programmer’s 62,500 punched cards.

What is new in doFuture 0.4.0?

  • Load balancing: The doFuture %dopar% backend will now partition all iterations (elements) and distribute them uniformly such that the each backend worker will receive exactly one partition equally sized to those sent to the other workers. This approach speeds up the processing significantly when iterating over a large set of elements that each has a relatively small processing time.
  • Globals: Global variables and packages needed in order for external R workers to evaluate the foreach expression are now identified by the same algorithm as used for regular future constructs and future::future_lapply().

For full details on updates, please see the NEWS file. The doFuture package installs out-of-the-box on all operating systems.

A quick example

Here is a bootstrap example using foreach adapted from help("clusterApply", package = "parallel"). I use this example to illustrate how to perform foreach() iterations in parallel on a variety of backends.

library("boot")

run <- function(...) {
cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
boot(cd4, corr, R = 10000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
}

## Attach doFuture (and foreach), and tell foreach to use futures
library("doFuture")
registerDoFuture()

## Sequentially on the local machine
plan(sequential)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
## user system elapsed
## 298.728 0.601 304.242

# In parallel on local machine (with 8 cores)
plan(multiprocess)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
## user system elapsed
## 452.241 1.635 68.740

# In parallel on the ad-hoc cluster machine (5 machines with 4 workers each)
nodes <- rep(c("n1", "n2", "n3", "n4", "n5"), each = 4L)
plan(cluster, workers = nodes)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
## user system elapsed
## 2.046 0.188 22.227

# In parallel on Google Compute Engine (10 r-base Docker containers)
vms <- lapply(paste0("node", 1:10), FUN = googleComputeEngineR::gce_vm, template = "r-base")
vms <- lapply(vms, FUN = gce_ssh_setup)
cl <- as.cluster(vms, docker_image = "henrikbengtsson/r-base-future")
plan(cluster, workers = cl)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
## user system elapsed
## 0.952 0.040 26.269

# In parallel on a HPC cluster with a TORQUE / PBS scheduler
# (Note, the below timing includes waiting time on job queue)
plan(future.BatchJobs::batchjobs_torque, workers = 10)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
## user system elapsed
## 15.568 6.778 52.024

About .export and .packages

When using doFuture::registerDoFuture(), there is no need to manually specify which global variables (argument .export) to export. By default, the doFuture backend automatically identifies and exports all globals needed. This is done using recursive static-code inspection. The same is true for packages that need to be attached; those will also be handled automatically and there is no need to specify them manually via argument .packages. This is in line with how it works for regular future constructs, e.g. y %<-% { a * sum(x) }.

Having said this, you may still want to specify arguments .export and .packages because of the risk that your foreach() statement may not work with other foreach adaptors, e.g. doParallel and doSNOW. Exactly when and where a failure may occur depends on the nestedness of your code and the location of your global variables. Specifying .export and .packages manually skips such automatic identification.

Finally, I recommend that you as a developer always try to write your code in such way the users can choose their own futures: The developer decides what should be parallelized – the user chooses how.

Happy futuring!

Links

See also

To leave a comment for the author, please follow the link and comment on their blog: jottR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)