The rwml-R Github repo is updated with R code for the event modeling examples from Chapter 5 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include reading large data files with the
fread function from
data.table, optimization of model parameters with
caret, computing and plotting ROC curves with
ggplot2, engineering date-time features, and determining the relative importance of features with the
varImp function in
The data for the event modeling examples in chapter 5 of the book is from the Kaggle Event Recommendation Engine Challenge. The rules of the challenge prohibit redristribution of the data, so the reader must download the data from Kaggle.
In order to work through the examples, features from the rather large
event.csv file are processed several times. To save time, an alternative to
read.csv function is needed. This is where the
fread function from the
data.table library comes in. It is similar to
faster and more convenient. On my MacBook Pro, it took only seven seconds to read
lng features from the >1GB
events.csv data file.
Initial cross-validated ROC curve and AUC metric
Once a training data set is built with features from the
users.csv files, the
train function is used
to train a random forsest model
evaluated using 10-fold cross-validation with the
receiver operating characteristic (ROC) curve as the metric.
The ROC curve and area under the curve (AUC) for the model (when applied
to held-out test data from the training data set) are shown in the figure
below. An AUC of 0.86 is achieved with the initial set of six features.
Inclusion of date-time and bag-of-words features lead to over-fitting
Ten date-time features (such as ‘‘week of the year’’ and ‘‘day of the week’’)
are extracted from the timestamp column of the
When added to the model, the AUC actually decreases slightly,
indicating over-fitting. The AUC decreases even more when available
‘‘bag-of-words’’ data is included in the model.
varImp function from the
caret package computes the
relative importance of each variable for objects produced by the
method. A plot of the relative importance for the top seven variables
in the final random forest model is shown below.
If you have any feedback on the rwml-R project, please
leave a comment below or use the Tweet button.
As with any of my projects, feel free to fork the rwml-R repo
and submit a pull request if you wish to contribute.
For convenience, I’ve created a project page for rwml-R with
the generated HTML files from
knitr, including a page with
all of the event-modeling examples from chapter 5.