drake’s improved high-performance computing power

Posted on May 17, 2018 by rOpenSci - open tools for open science in R bloggers | 0 Comments

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The drake R package is not only a reproducible research solution, but also a serious high-performance computing engine. The Get Started page introduces drake, and this technical note draws from the guides on high-performance computing and timing.

You can help!

Some of these features are brand new, and others are newly refactored. The GitHub version has all the advertised functionality, but it needs more testing and development before I can submit it to CRAN in good conscience. New issues such as r-lib/processx#113 and HenrikBengtsson/future#226 seem to affect drake, and more may emerge. If you use drake for your own work, please consider supporting the project by field-testing the claims below and posting feedback here.

Let drake schedule your targets.

A typical workflow is a sequence of interdependent data transformations. Consider the example from the Get Started page.

When you call make() on this project, drake takes care of "raw_data.xlsx", then raw_data, and then data in sequence. Once data completes, fit and hist can launch in parallel, and then "report.md" begins once everything else is done. It is drake’s responsibility to deduce this order of execution, hunt for ways to parallelize your work, and free you up to focus on the substance of your research.

Activate parallel processing.

Simply set the jobs argument to an integer greater than 1. The following make() recruits multiple processes on your local machine.

make(plan, jobs = 2)

For parallel deployment to a computing cluster (SLURM, TORQUE, SGE, etc.) drake calls on packages future, batchtools, and future.batchtools. First, create a batchtools template file to declare your resource requirements and environment modules. There are built-in example files in drake, but you will likely need to tweak your own by hand.

drake_batchtools_tmpl_file("slurm") # Writes batchtools.slurm.tmpl.

Next, tell future.batchtools to talk to the cluster.

library(future.batchtools)
future::plan(batchtools_slurm, template = "batchtools.slurm.tmpl")

Finally, set make()’s parallelism argument equal to "future" or "future_lapply".

make(plan, parallelism = "future", jobs = 8)

Choose a scheduling algorithm.

The parallelism argument of make() controls not only where to deploy the workers, but also how to schedule them. The following table categorizes the 7 options.

	Deploy: local	Deploy: remote
Schedule: persistent	“mclapply”, “parLapply”	“future_lapply”
Schedule: transient		“future”, “Makefile”
Schedule: staged	“mclapply_staged”, “parLapply_staged”

Staged scheduling

drake’s first custom parallel algorithm was staged scheduling. It was easier to implement than the other two, but the workers run in lockstep. In other words, all the workers pick up their targets at the same time, and each worker has to finish its target before any worker can move on. The following animation illustrates the concept.

But despite weak parallel efficiency, staged scheduling remains useful because of its low overhead. Without the bottleneck of a formal master process, staged scheduling blasts through armies of tiny conditionally independent targets (example here). Consider it if the bulk of your work is finely diced and perfectly parallel, maybe if your dependency graph is tall and thin.

Persistent scheduling

Persistent scheduling is brand new to drake. Here, make(jobs = 2) deploys three processes: two workers and one master. Whenever a worker is idle, the master assigns it the next target whose dependencies are fully ready. The workers keep running until no more targets remain. See the animation below.

Transient scheduling

If the time limits of your cluster are too strict for persistent workers, consider transient scheduling, another new arrival. Here, make(jobs = 2) starts a brand new worker for each individual target. See the following video.

How many jobs should you choose?

The predict_runtime() function can help. Let’s revisit the mtcars example.

Let’s also

Plan for non-staged scheduling,
Assume each non-file target (black circle) takes 2 hours to build, and
Rest assured that everything else is super quick.

When we declare the runtime assumptions with the known_times argument and cycle over a reasonable range of jobs, predict_runtime() paints a clear picture.

jobs = 4 is a solid choice. Any fewer would slow us down, and the next 2-hour speedup would take double the jobs and the hardware to back it up. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal.

Thanks!

When I attended RStudio::conf(2018), drake relied almost exclusively on staged scheduling. Kirill Müller spent hours on site and hours afterwards helping me approach the problem and educating me on priority queues, message queues, and the knapsack problem. His generous help paved the way for drake’s latest enhancements.

Disclaimer

This post is a product of my own personal experiences and opinions and does not necessarily represent the official views of my employer. I created and embedded the Powtoon videos only as explicitly permitted in the Terms and Conditions of Use, and I make no copyright claim to any of the constituent graphics.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

drake’s improved high-performance computing power

You can help!

Let drake schedule your targets.

Activate parallel processing.

Choose a scheduling algorithm.

Staged scheduling

Persistent scheduling

Transient scheduling

How many jobs should you choose?

Thanks!

Disclaimer

Related

You can help!

Let drake schedule your targets.

Activate parallel processing.

Choose a scheduling algorithm.

Staged scheduling

Persistent scheduling

Transient scheduling

How many jobs should you choose?

Thanks!

Disclaimer

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)