At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.
— David Smith (@revodavid) June 7, 2017
Using the doAzureParallel package is simple. First, you need to define the cluster you're going to use as a JSON file. (You can see an example on the right.) Here, you'll specify your Azure credentials, the size of the cluster, and the type of nodes (CPUs and memory) to use in the cluster. You can also specify here R packages (from CRAN and/or Github) to be pre-loaded onto each node, and the maximum number of simultaneous tasks to run on each node (for within-node parallelism).
New to this update, the poolSize option allows you to specify the number of dedicated (standard) VM nodes to use, in addition to a number of low-priority nodes to use. Low-priority nodes can be pre-empted by the Azure system at any time, but are much cheaper to use. (Even if a node is pre-empted your parallel computation will be continue; it will just take a little longer with the reduced capacity.) You can even specify a minimum and maximum number of nodes of each class to use, in which case the cluster will automatically scale up and down according to either (your choice) the workload or the time of day (e.g. only expand the low-priority part of the cluster on weekends, when pre-emption is less likely).
Once you've defined the parameters of your cluster, all you need to do is declare the cluster as a backend for the foreach package. The body of the
foreach loop runs just like a
for loop, except that multiple iterations run in parallel on the remote cluster. Here are the key parts of the option price simulation example JS presented at the conference.
This same approach can be used for any "embarrassingly parallel" iteration in R, and you can use any R function or package within the body of the loop. For example, you could use a cluster to reduce the time required for parameter tuning and cross-validation with the caret package, or speed up data preparation tasks when using the dplyr package.
In addition to support for auto-scaling clusters, this update to doAzureParallel also includes a few other new features. You'll also find new utility functions for managing multiple long-running R jobs, functions to read data from and write data to Azure Blob storage, and the ability to pre-load data into the cluster by specifying resource files.
Github (Azure): doAzureParallel