Massively-parallel computations on Azure clusters with R, made easy with doAzureParallel
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by JS Tan (Program Manager, Microsoft)
For users of the R language, scaling up their work to take advantage of cloud-based computing has generally been a complex undertaking. We are therefore excited to announce doAzureParallel, a lightweight R package built on Azure Batch that allows you to easily use Azure’s flexible compute resources right from your R session. The doAzureParallel package complements Microsoft R Server and provides the infrastructure you need to run massively parallel simulations on Azure directly from R.
The doAzureParallel package is a parallel backend for the popular foreach package, making it possible to execute multiple processes across a cluster of Azure virtual machines with just a few lines of R code. The package helps you create and manage the cluster in Azure, and register it as a parallel backend to be used with foreach.
With doAzureParallel, there is no need to manually create, configure and manage a cluster of individual VMs. Running your scale jobs is as easy as running algorithms on your local machine. With Azure Batch’s autoscaling capabilities, you can also increase or decrease your cluster size to fit your workloads, saving you time and money. doAzureParallel also uses the Azure Data Science Virtual Machine (DSVM), allowing Azure Batch to easily and quickly configure the appropriate environment in as little time as possible.
doAzureParallel is ideal for running embarrassingly parallel work such as parametric sweeps or Monte Carlo simulations, making it a great fit for many financial modelling algorithms (back-testing, portfolio scenario modelling, etc).
The doAzureParallel is available for download on Github under the open-source MIT license, and there is no additional cost for these capabilities – you only pay for the Azure VMs you use.
Performance Gains with doAzureParallel
To illustrate the kind of speed up you can get with doAzureParallel, we did a test that compares the performance of computing the Mandelbrot set on a single machine of 2 cores, a 5-node clusters of 10 cores, a 10-node cluster of 20 cores, and finally a 20-node cluster of 40 cores.
From the graph, we can see that using doAzureParallel on Azure can get you a speed up of about 2X with a cluster of 5 nodes, a speed up of about 3X with a cluster of 10 nodes, and a speed up of about 6X with a cluster of 20 nodes. From a cost perspective, running the 20-node cluster for about 50 seconds with 6X performance ended up costing only $0.03 when using a cluster of Standard_F2 Linux VMs. (This was for the WestUS region. Pricing varies by region and can change over time.)
The doAzureParallel package is available now on Github, where you can also find detailed documentation. For more detailed information, including installation steps and demo code, check out the announcement at the Microsoft Azure blog linked below.
Microsoft Azure blog: doAzureParallel: Take advantage of Azure’s flexible compute directly from your R session
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.