Site icon R-bloggers

Use foreach with HPC schedulers thanks to the future package

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The future package is a powerful and elegant cross-platform framework for orchestrating asynchronous computations in R. It's ideal for working with computations that take a long time to complete; that would benefit from using distributed, parallel frameworks to make them complete faster; and that you'd rather not have locking up your interactive R session. You can get a good sense of the future package from its introductory vignette or from this eRum 2018 presentation by author by Henrik Bengtsson (video embedded below), but at its simplest it allows constructs in R like this:

a %<-% slow_calculation(1:50)
b %<-% slow_calculation(51:100)
a+b

The idea here is that slow_calculation is an R function that takes a lot of time, but with the special %<-% assignment operator the computation begins and the R interpreter is ready immediately. The first two lines of R code above take essentially zero time to execute. The futures package farms off those computations to another process or even a remote system (you specify which with a preceding plan call), and R will only halt when the result is needed, as in the third line above. This is beneficial in Bengtsson's own work, where he uses the future package to parallelize cancer research on DNA sequences on high-performance computing (HPC) clusters.

The future package supports a wide variety of computation frameworks including parallel local R sessions, remote R sessions, and cluster computing frameworks. (If you can't use any of these, it falls back to evaluating the expressions locally, in sequence.) The future package also works in concert other parallel programming systems already available in R. For example, it provides future_lapply as a futurized analog of lapply, which will use whatever computation plan you have defined to run the computations in parallel.

The future package also extends the foreach package thanks to the updated doFuture package. By using registerDoFuture as the foreach backend, your loops can use any computation plan provided by the future package to run the iterations in parallel. (The same applies to R packages that use foreach internally, notably the caret package.) This means you can now use foreach with any of the HPC schedulers supported by future, which includes TORQUE, Slurm, and OpenLava. So if you you share a Slurm HPC cluster with colleagues in your department, you can queue up a parallel simulation on the cluster using code like this:

library("doFuture")
registerDoFuture()
library("future.batchtools")
plan(batchjobs_slurm)

mu <- 1.0
sigma <- 2.0
x <- foreach(i = 1:3, .export = c("mu", "sigma")) %dopar% {
  rnorm(i, mean = mu, sd = sigma)
}

The future package is available on CRAN now, and works consistently on Windows, Mac and Linux systems. You can learn more in the video at the end of this post, or in the recent blog update linked below.

JottR: Maintenance Updates of Future Backends and doFuture

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.