Building a machine learning model with the MicrosoftML package

January 24, 2017
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Server 9 includes a new R package for machine learning: MicrosoftML. (So do the Data Science Virtual Machine and the free Microsoft R Client edition, incidentally.) This package includes a suite of fast predictive modeling functions implemented by Microsoft Research, including:

  • Linear (rxFastLinear) and logistic (rxLogisticRegression) model functions based on the Stochastic Dual Coordinate Ascent method;
  • Classification/regression trees (rxFastTrees) and random forests (rxFastForests) based on FastRank, an efficient implementation of the MART gradient boosting algorithm;
  • A neural network algorithm (rxNeuralNet) with support for custom, multilayer network topologies; and
  • One-class anomaly detection (rxOneClassSvm) based on support vector machines.

As the function names suggest, the implementations are tuned for speed: most use multiple CPUs, and some will even use the GPU (if available). Not all of the implementations scale to unlimited data sizes, however; all but the linear and logistic regression routines are bound by available RAM.

If you want to give these routines a try, the MIcrosoft R Server Tiger Team has prepared a walkthrough analyzing the famous NYC Taxi data set. Once you have access to Microsoft R Server (or Client), this R script walks you through the process of:

  • Loading the MicrosoftML package
  • Importing the NYC Taxi Data from SQL Server (it comes preinstalled on the Data Science Virtual Machine)
  • Splitting the data into a test set and a training set, with the binary value "tipped" (whether or not the driver was tipped) as the response
  • Fitting several predictive models: logistic regression, linear model,, fast forest, and neural network.
  • Making predictions on the test data
  • Evaluating model performance by comparing AUC (area under the ROC curve)

The ROC curves are shown below. As you'd expect the linear model performs poorly compared to the others, since it's being applied here to a binary variable.

ROC

To try it out yourself, follow the walkthrough linked below, which also provides instructions for running the logistic regression model in SQL Server Management Studio.

Microsoft R Server Tiger Team: Predicting NYC Taxi Tips using MicrosoftML

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)