Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory

October 4, 2016
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Udayan Kumar, Data Scientist at Microsoft

Beginning in 2016, Microsoft rolled out a preview of  Microsoft R Server (MRS) for Azure HDInsight clusters. This service provides a preconfigured instance of R server with Spark/Hadoop that can be provisioned within minutes. Recent blog posts (by Max Kaznady and David Smith) have highlighted how to use and tune this service for large scale machine learning tasks. In this post, we push the envelope and show how to build an end-to-end fully operationalized analytics pipeline using Azure Data Factory (ADF) and MRS with HDInsight (specifically Apache Spark). Herein, we provide a walk-through tutorial that shows how to use MRS with ADF. We also provide this Azure Resource Management (ARM) template for easy deployment. 

HDinsight

Architecture diagram of Microsoft R Server with Azure Data Factory

Machine learning and data analysis tasks (such as training/re-training) are often run periodically and have dependence on other ETL tasks. ADF simplifies running and managing such repeated tasks to create production-ready pipelines. Essentially, ADF provides the following:

  1. ability to run tasks (aka pipeline activities) periodically,
  2. ability to set dependency between tasks, and
  3. easy maintainability and problem detection. 

By integrating Azure Data Factory with Microsoft R Server and Spark, we show how to configure a scalable training and testing pipeline that operates on large volumes of data. For this, we utilize the NYC taxis dataset. A predictive model is used to determine the tip amount for each trip. The tip received after each completed trip can be used to retrain the model. This is where ADF comes into the picture. With ADF, it is possible to set up a pipeline that retrains the model and scores the incoming inputs at a cadence. Currently ADF only supports Map-reduce jobs on Linux HDI clusters. Since MRS runs only on Linux HDInsight clusters, as a workaround we use a custom task-runner masquerading as a map-reduce job to run MRS script. Detailed instructions on how to do so are available at the link below. 

Github: Building Scalable Data Pipelines with Microsoft R Server and Azure Data Factory: sample code and walk-through

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)