My colleagues Max Kaznady, Jason Zhang, Arijit Tarafdar and Miguel Fierro recently posted a really useful guide with lots of tips to speed up prototyping models with Microsoft R Server on Apache Spark. These tips apply when using Spark on Azure HDInsight, where you can spin up a Spark cluster the cloud with Microsoft R installed on the head node and worker nodes with just a single request on the Azure portal:
The blog post provides detailed instructions, but the three main efficiency tips are:
Install RStudio Server on the head node. R Tools for Visual Studio doesn't yet support remote execution, and while you can interact with the R command line directly via SSH or Putty, having an IDE makes things much easier and faster. There's a simple guide to install RStudio Server on the head node, whcih you can then use remotely to drive both the head node and the worker nodes.
Use just the head node for iterating on your model. While you probably want to use all of the data to train your final model, you can speed up the iterative process of developing your model (selecting variables, creating features and transformations, and comparing performance of different model types) by working with a sample of your data on the head node only. While this mightn't have the predictive power of using all the data, it's more than sufficient to determine variable importance and compare performance between competing models. Ultimately, the more iterations you have time for when developing the model, the better the final production model (trained on all the data) will be.
Tune the Spark cluster to optimize performance for model training. Now that you've saved time on the model development, save some time for the full-data model training by tuning the Spark cluster. (This is particularly useful if the production model is going to be retrained on a regular basis, say on a weekly or overnight schedule.) Max and Arjit provide some guidelines for setting the on-heap and off-heap memory settings for each executor to maintain the ideal ratio.
You can read the full details at the Azure blog post linked below, and you can try tuning a sample analysis on the Taxi dataset yourself, using the R code provided on GitHub.