by Udayan Kumar, Data Scientist at Microsoft
Beginning in 2016, Microsoft rolled out a preview of Microsoft R Server (MRS) for Azure HDInsight clusters. This service provides a preconfigured instance of R server with Spark/Hadoop that can be provisioned within minutes. Recent blog posts (by Max Kaznady and David Smith) have highlighted how to use and tune this service for large scale machine learning tasks. In this post, we push the envelope and show how to build an end-to-end fully operationalized analytics pipeline using Azure Data Factory (ADF) and MRS with HDInsight (specifically Apache Spark). Herein, we provide a walk-through tutorial that shows how to use MRS with ADF. We also provide this Azure Resource Management (ARM) template for easy deployment.
Machine learning and data analysis tasks (such as training/re-training) are often run periodically and have dependence on other ETL tasks. ADF simplifies running and managing such repeated tasks to create production-ready pipelines. Essentially, ADF provides the following:
- ability to run tasks (aka pipeline activities) periodically,
- ability to set dependency between tasks, and
- easy maintainability and problem detection.
By integrating Azure Data Factory with Microsoft R Server and Spark, we show how to configure a scalable training and testing pipeline that operates on large volumes of data. For this, we utilize the NYC taxis dataset. A predictive model is used to determine the tip amount for each trip. The tip received after each completed trip can be used to retrain the model. This is where ADF comes into the picture. With ADF, it is possible to set up a pipeline that retrains the model and scores the incoming inputs at a cadence. Currently ADF only supports Map-reduce jobs on Linux HDI clusters. Since MRS runs only on Linux HDInsight clusters, as a workaround we use a custom task-runner masquerading as a map-reduce job to run MRS script. Detailed instructions on how to do so are available at the link below.