As I mentioned yesterday, Microsoft R Server now available for HDInsight, which means that you can now run R code (including the big-data algorithms of Microsoft R Server) on a managed, cloud-based Hadoop instance.
Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.
To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.
The data behind the analysis is public, so if you'd like to try it out yourself the Microsoft R Server code for the analysis is available on Github, and you can read more details about the analysis in the detailed writeup, linked below. The link also contains details about data exploration and modeling, including references to additional distributed machine learning functions in R, which may be explored to improve model performance.
Scalable Data Analysis using Microsoft R Server (MRS) on Hadoop MapReduce: Using MRS on Azure HDInsight (Premium) for Exploring and Modeling the 2013 New York City Taxi Trip and Fare Data