This post will explore using R’s MLmetrics to evaluate machine learning models. MLmetrics provides several functions to calculate common metrics for ML models, including AUC, precision, recall, accuracy, etc.
Building an example model
Firstly, we need to build a model to use as an example. For this post, we’ll be using a dataset on pulsar stars from Kaggle. Let’s save the file as “pulsar_stars.csv”. Each record in the file represents a pulsar star candidate. The goal will be to predict if a record is a pulsar star based upon the attributes available.
To get started, let’s load the packages we’ll need and read in our dataset.
library(MLmetrics) library(dplyr) stars = read.csv("pulsar_stars.csv")
Next, let’s split our data into train vs. test. We’ll do a standard 70/30 split here.
set.seed(0) train_indexes = sample(1:nrow(stars), .7 * nrow(stars)) train_set <- stars[train_indexes,] test_set <- stars[-train_indexes,]
Now, let’s build a simple logistic regression model.
train_set <- data.frame(train_set %>% select(target_class), train_set %>% select(-target_class)) # build model model <- glm(formula(train_set), train_set, family = "binomial")
AUC / precision / recall / accuracy
Let’s calculate a few metrics. One of the most common metrics for classification is calculating AUC, which can be done using MLMetrics’ AUC function. Intuitively, AUC is a score between 0 and 1 that measures how well a model rank-orders predictions. See here for a more detailed explanation.
# get AUC on test and train set AUC(test_pred, test_set$target_class) # 0.974172 AUC(train_pred, train_set$target_class) # 0.9773794
As a refresher, here’s a quick overview of precision, recall, and accuracy:
Notice how each above metric requires whole number inputs. To handle this, we need to set a threshold on our predicted probabilities. One way to do this would be to assign any prediction above 50% as a predicted pulsar star, while any prediction that is less than 50% would get assigned as not a pulsar star.
For example, if we pick 0.5 as a threshold, our precision on the test set would be 0.9114219.
Precision(test_set$target_class, ifelse(test_pred >= .5, 1, 0), positive = 1) # 0.9114219
Rather than just picking 0.5, though, we can try to optimize the cutoff we choose. One method of accomplishing this is to choose the threshold that optimizes the F1 Score. F1 Score is defined as the harmonic mean between precision and recall (see more here).
Below, we calculate the F1 Score for each threshold 0.01, 0.02, 0.03,…0.99. The threshold that gives the optimal cutoff (optimal F1 Score) is .32, or 32%.
f1_scores <- sapply(seq(0.01, 0.99, .01), function(thresh) F1_Score(train_set$target_class, ifelse(train_pred >= thresh, 1, 0), positive = 1)) which.max(f1_scores) # 32
Using this cutoff, we can calculate precision, recall, and accuracy.
Precision(test_set$target_class, ifelse(test_pred >= .32, 1, 0), positive = 1) Recall(test_set$target_class, ifelse(test_pred >= .32, 1, 0), positive = 1) Accuracy(ifelse(test_pred >= .32, 1, 0), test_set$target_class)
In general, there will be a trade-off between precision and recall, so the selection of a threshold may also vary depending on which of those metrics is more valued. Optimizing based off F1 Score is a good way to try to optimize the threshold based off both precision and recall.
Another metric that can be used in evaluating classification models is the Gini coefficient. Gini is calculated as 2 * AUC – 1. Thus, we get 0.974172 * 2 – 1 = 0.948344.
Gini(test_pred, test_set$target_class) # 0.948344
MLmetrics also has functions for non-classification metrics as well, such as RMSE and RAE.