R code to accompany Real-World Machine Learning (Chapter 6): Making Predictions

May 13, 2017
By

(This article was first published on data prone - R, and kindly contributed to R-bloggers)

Abstract

In the latest update to the rwml-R Github repo, R code is provided to complete the analysis of New York City taxi data from Chapter 6 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include exploratory data analysis with ggplot2 to identify and remove bad data; scaling numerical data to remove potential bias; development of machine learning models with caret; plotting of receiver operating characteristic (ROC) curves to evaluate model performance; and tabulation and display of feature importance using caret and gridExtra.

Data import and EDA for NYC taxi example

As mentioned in my previous post,
the complete code-through for Chapter 6 includes
examples of getting the data and performing exploratory data analysis (EDA).

Omitting bad data

A plot of the tip_amount data reveals a large number of rides that
result in no tip. Further analysis shows that these “no tip” rides are
associated with cash payers. In order to develop a model to predict
tipping behavior based on other factors, we must remove the data for
which payment_type == "CSH". This is done easily with the filter
function in dplyr. The remaining data is then plotted in a
histogram with ggplot2.

dataJoined <- dataJoined %>%
  filter(payment_type != "CSH")
p9 <- ggplot(dataJoined, aes(tip_amount)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-1, 16.0), name = "Tip amount (USD)") +
  scale_y_continuous(name = "Count")
p9

The distribution of tip amounts when omitting cash payments

Model development and evaluation

After we have explored the data and identified the specific prediction
that we would like to make (Will non-cash rides result in a tip?), we
move onto model development. The effectiveness of the machine learning models
are evaluated by plotting the ROC curves and computing the Area Under the
Curve (AUC) using code developed for Chapter 4 (with slight
modifications for our purposes here).

For each model, we follow our typical caret workflow of splitting the data
into training and test data sets with createDataPartition, setting up
control parameters with the trainControl function, training the model with
train, and making predictions with predict. The variable importance is
calculated with the varImp function of caret. We start with a simple
logistic regression model with the numerical data only. We then systematically
improve prediction results by moving to a random forest model, adding
categorical variables, and engineering date-time features. The final
ROC curve and table of variable importances are shown below.

The ROC curve and feature importance list for the random forest model, including all categorical features and additional date-time features

Feedback welcome

If you have any feedback on the rwml-R project, please
leave a comment below or use the Tweet button.
As with any of my projects, feel free to fork the rwml-R repo
and submit a pull request if you wish to contribute.
For convenience, I’ve created a project page for rwml-R with
the generated HTML files from knitr, including a page with
all of the examples from chapter 6.

Download
Fork

To leave a comment for the author, please follow the link and comment on their blog: data prone - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)