You may have heard that R and the big-data RevoScaleR package have been integrated with with SQL Server 2016 as SQL Server R Services. If you've been wanting to try out R with SQL Server but haven't been sure where to start, a new MSDN tutorial will take you through all the steps of creating a predictive model: from obtaining data for analysis, to building a statistical model, to creating a stored prodedure to make predictions from the model. To work through the tutorial, you'll need a suitable Windows server on which to install the SQL Server 2016 Community Technology Preview, and make sure you have SQL Server R Services installed. You'll also need a separate Windows machine (say a desktop or laptop) where you'll install Revolution R Open and Revolution R Enterprise. Most of the computations will be happening in SQL Server, though, so this “data science client machine” doesn't need to be as powerful.
The tutorial is made up of five lessons, which together should take you about 90 minutes to run though. If you run into problems, each lesson includes troubleshooting tips at the end.
Lesson 1 begins with downloading the New York City taxi data set (which was also used to create these beautiful data visualizations) and loading it into SQL Server. You'll also set up R to include some useful packages such as ggmap and RODBC.
Lesson 2 starts by having you verify the data using SQL queries. Don't miss the “Next Steps” links near the end, where you'll summarize the data using the RevoScaleR package on the data science client machine, and then visualize the data as a map with the ggmaps package (as shown below).
Lesson 3 focuses on using R to augment the data with new features, such as calculating the distance between pickup and dropoff points using a custom R function or using T-SQL.
Lesson 4 is where you'll use the rxLogit function to train a logistic regression model to predict the probability of a driver receiving a tip for a ride, evauate the model using ROC curves, and then deploy the prediction into SQL Server as a T-SQL stored procedure.
Lesson 5 wraps things up by showing how to use the deployed model in a production environment, both by calculating predictions from a stored dataset in batch mode, and by performing transactional predictions one trip at a time.
To save on cutting-and-pasting, you can find all of the code used in the tutorial on Github. Give it a go, and before long you'll have your own R models running live in SQL Server.