If you'd like to manipulate and analyze very large data sets with the R language, one option is to use R and Apache Spark together. R provides the simple, data-oriented language for specifying transformations and models; Spark provides the storage and computation engine to handle data much larger than R alone can handle.
At the KDD 2016 conference last October, a team from Microsoft presented a tutorial on Scalable R on Spark, and made all of the materials available on Github. The materials include an 80-slide presentation covering several tutorials (you can download the 13Mb PowerPoint file here).
Slides 1-29 form an introduction which covers:
- Scaling R scripts on a single machine with the bigmemory and ff packages
- Interfacing to Spark from R with SparkR 1.6
- Installing and using the sparklyr package
- Using Microsoft R Server and the “RevoScaleR” package to offload its computations to Spark
- Comparisons and benchmarks of the techniques to scale R described above
Slides 32-44 form a hands-on tutorial working with the airline arrival data to predict flight delays. In the tutorial, you use SparkR to clean and join the data, R Server's “rxDTree” function to fit a random forest model to predict delays, and then publish a prediction function to Azure with the AzureML package to create a cloud-based flight-delay prediction service. The Microsoft R scripts are available here.
Slides 46-50 form another tutorial, this time working with the NYC Taxi dataset. The first tutorial script uses the sparklyr package to visualize the data and create models to predict the tip amount. This second tutorial script goes further with models, fitting Elastic Net, Random Forest and Gradient Boosted Tree models with both SparkR and sparklyr. In addition this script uses SparkR and SparkSQL to create a map of the trips.
Slides 51-59 demonstrate optimizing the performance of a time series forecasting model, by searching over a large parameter space with the hts package. By running the models in parallel to optimize the MAPE (mean absolute percent error), the total execution time was reduced to 1 day compared to the 40 days to complete the computations serially. The parallelization was achieved with the Microsoft R Server “rxExec” function, which you can replicate with the script available here.
To work though the materials from the tutorial, you'll need access to a Spark cluster configured with Microsoft R Server and the necessary scripts and data files. You can easily create an HDInsight Premium cluster including Microsoft R Server on Microsoft Azure: these instructions provide the details. Once the cluster is ready, you can remotely access it from your desktop using
ssh as described. The clusters are charged by the hour (according to their size and power), so be sure to shut down the cluster when you're done with the tutorial.
These tutorials are hopefully useful to anyone who is trying to learn to use R with Spark. The full collection of materials and slides is available at the Github repository below.