Top 10 Machine Learning Evaluation Metrics for Classification – Implemented in R

[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Top 10 machine learning evaluation metrics for classification

So, you’ve trained a classification machine learning model. Now what? How do you evaluate it? That’s where machine learning evaluation metrics for classification come in. This article brings you the top 10 metrics you must know, implemented primarily for binary classification problems. Multi-class classification datasets might require you to tweak the formulas slightly. We’ll first train a logistic regression model, and then we’ll go over each metric in detail.

After reading, you’ll have no trouble picking out the right set of machine learning evaluation metrics for classification datasets. You’ll know what each one stands for, what ranges you can expect the metric value to be in, and what it all means for your model’s predictive power. So without any ado, let’s get started!

Data going into a machine learning model has to be preprocessed adequately – Make sure you know how to do this step in R.

Table of contents:

Let’s Train a Binary Classification Machine Learning Model in R

We’ll start this article by training a binary classification model using logistic regression. The dataset of choice will be Titanic, as it’s built into R and requires only minor data preprocessing operations before modeling. Let’s begin by loading the dataset and inspecting what it looks like.

Dataset Loading

Many of the metrics you’ll see today are built into various R packages, hence, we’ll need many imports at the start of the script. Here’s everything you’ll need – feel free to install any you might not have via the install.packages("<package-name>") command:


The Titanic dataset is part of the titanic package, so we’re good to go. We’ll only use the training subset and split it later into two parts.

The following code snippet loads the dataset and prints the first couple of rows:

# Load the Titanic dataset
df <- titanic_train

# Show the first few rows
Image 1 - Head of the Titanic dataset

Image 1 – Head of the Titanic dataset

It’s a good quality dataset but has some missing values and other formatting issues which a machine learning model won’t like. Let’s handle these next.

Dataset Preprocessing

The data preprocessing part for this dataset could be an extensive article in itself, but we’ll keep things lightweight today since this isn’t the main talking point. In this section, we’ll:

  • Drop unnecessary columns – Columns that carry no meaningful information (e.g., PassengerId), and columns that would take too much time and code to preprocess adequately (e.g., Name, Ticket, and Cabin).
  • Impute missing values – Median imputation for Age, and constant imputation for Embarked. Learn more about missing value imputation in R with our extensive guide.
  • Convert categorical variables to factors – This makes it easy for a machine learning model to understand the intra-variable relationships without creating dummy columns.

If you prefer code over text, here’s the snippet for you:

# Drop unnecessary columns
df <- select(df, -c(PassengerId, Name, Ticket, Cabin))
# Missing value imputation
df$Age[$Age)] <- median(df$Age, na.rm = TRUE)
df$Embarked[$Embarked)] <- "S"
# Convert categorical variables to factors
df$Pclass <- factor(df$Pclass)
df$Sex <- factor(df$Sex)
df$Embarked <- factor(df$Embarked)

Image 2 - Head of the Titanic dataset after data preparation

Image 2 – Head of the Titanic dataset after data preparation

The dataset is now much more condensed but carries almost identical predictive performance.

Train/Test Split

The last step before training a machine learning model is to split the dataset into training and testing subsets. We’ll use the caret package for the task, and stick to the traditional 80:20 split:

# Split the data into training and test sets
index <- createDataPartition(df$Survived, p = 0.8, list = FALSE)
train <- df[index, ]
test <- df[-index, ]

Here’s how many rows are in each subset:

Image 3 - Train/test set dimensionality

Image 3 – Train/test set dimensionality

That’s it! Let’s train the model next.

Training a Classification Machine Learning Model

There are many classification algorithms you can choose from, but logistic regression is the one we’ll use today. It strikes a good balance between being easy to understand and offering good predictive performance.

As always in R, you can train a model by writing the model formula. In short, every dataset feature in the training set will be used to predict the Survived target variable:


model <- glm(Survived ~ ., data = train, family = "binomial")
Image 4 - Summary of a logistic regression model

Image 4 – Summary of a logistic regression model

It looks like passenger class, age, gender, and number of siblings/spouses on board have the most impact on the predictive power, indicated by extremely low P-values. On the other hand, the point of embarkment has no impact on the target variable, as you could reasonably assume.

Up next, let’s make actual predictions on previously used data.

Calculating Prediction Probabilities and Classes

Classification metrics require predicted classes (e.g., 0 or 1), while others require prediction probabilities (e.g., 0.7891 chance of belonging to a positive class). For that reason, we’ll calculate both.

The probabilities are first, and you can obtain them by calling the predict() function and passing in our model and the test set, alongside with type = "response":

predict_probs <- predict(model, newdata = test, type = "response")
Image 5 - Prediction probabilities

Image 5 – Prediction probabilities

And now, if the predicted probability is 0.5 or higher, we’ll assign it a class of 1 (survived), or 0 otherwise (not survived):

predict_classes <- ifelse(predict_probs >= 0.5, 1, 0)
Image 6 - Predicted classes

Image 6 – Predicted classes

That’s everything we need to start evaluating our classification model with machine learning evaluation metrics for classification.

Machine Learning Evaluation Metrics for Classification – Theory, Math, and Code

We’ve tried our best in keeping the previous section short and sweet, and now it’s time to dive into the good part. You’ll learn the best machine learning evaluation metrics for classification. Let’s start with the first one, which is a must-have for any machine learning project.

1. Confusion Matrix

You can think of the confusion matrix as a special type of table used to evaluate the performance of a classification model. In terms of binary classification, a confusion matrix is a 2×2 matrix that shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These four values are used extensively when calculating other metrics, such as accuracy, precision, and recall.

Down below you’ll see the confusion matrix “formula”. Take this term lightly, since there’s no calculation involved. It’s just a summation of actual vs. predicted values:

Image 7 – Confusion matrix “formula”

To implement the confusion matrix in R, refer to the snippet below. It uses predicted classes instead of probabilities:

CONFUSION_MATRIX <- confusionMatrix(factor(predict_classes), factor(test$Survived))
Image 8 - Confusion matrix results

Image 8 – Confusion matrix results

Long story short, you want the numbers on the top-left to bottom-right diagonal to be as large as possible. On the other hand, the elements on a top-right to bottom-left diagonal should be minimal, or close to zero.

While we’re here, let’s extract the values for TP, FP, TN, and FN:

TP <- CONFUSION_MATRIX$table[1, 1]
FP <- CONFUSION_MATRIX$table[1, 2]
TN <- CONFUSION_MATRIX$table[2, 2]
FN <- CONFUSION_MATRIX$table[2, 1]

We’ll need these for the upcoming classification metrics.

2. Accuracy

Accuracy measures the proportion of correct predictions to the total number of predictions. It’s a widely-used metric, as it reports the overall predictive performance of your model. But keep in mind – this metric is only relevant if classes are balanced. For example, if you have 99% of records in one class, you can easily obtain an accuracy of 99%. Just think about it.

Anyhow, here’s the formula:

Image 9 - Accuracy formula

Image 9 – Accuracy formula

Since we already have the values for TP, TN, and FP, accuracy calculation in R is as easy as it can be:

ACCURACY <- (TP + TN) / (TP + FP + TN + FN)
Image 10 - Accuracy results

Image 10 – Accuracy results

76% isn’t too bad for a couple of minutes of work in data preprocessing. But the classes aren’t perfectly balanced, so other classification metrics might be more relevant for our use case.

3. Precision

Precision measures the proportion of true positives (TP) to the total number of positive predictions made. It’s a useful metric when false positives are more costly than false negatives, for example in medical diagnosis. If precision is high, it means the model is making few false positive predictions.

Here’s the formula:

Image 11 - Precision formula

Image 11 – Precision formula

Let’s implement Precision in R. Once again, the implementation is trivial since we already have all the values:

Image 12 - Precision results

Image 12 – Precision results

We’re up to 0.8, which isn’t too bad. Let’s see what recall has to say about it.

4. Recall

Recall is the ratio of the number of true positives (TP) to the sum of true positives (TP) and false negatives (FN). This metric measures the percentage of all positive instances in the dataset that are correctly classified by the model. If the recall is high, it means that the model is making a few false negative predictions.

Here’s the recall formula:

Image 13 - Recall formula

Image 13 – Recall formula

Let’s implement it in R and check the score:

RECALL <- TP / (TP + FN)
Image 14 - Recall results

Image 14 – Recall results

Recall is higher than precision, which means the model makes fewer false negatives than false positives.

5. F1-Score

Now you might be wondering, is there a way to strike the balance between precision and recall? That’s where F1 score comes in.

F1-score is a weighted average between precision and recall. It’s a useful metric when precision and recall have an uneven trade-off. The F1 score is a harmonic mean of precision and recall, and it ranges from 0 to 1, with higher values indicating better performance.

Here’s the formula you can use for the calculation:

Image 15 - F1-score formula

Image 15 – F1-score formula


Once again, R implementation is fairly straightforward:

Image 16 - F1-score results

Image 16 – F1-score results

Seems right. It’s just between precision and recall values, which means F1 is the perfect metric to optimize the model for in cases where you don’t have to optimize for false positives or false negatives.

6. AUC Score

AUC, or the Area Under the Receiver Operating Characteristic curve measures how well a binary classifier distinguishes between positive and negative classes. Traditionally, you would plot the ROC curve, and the AUC measures the area under the curve. Higher AUC means better performance, and vice-versa.

The formula includes integrals since we’re calculating the area under the curve:

Image 17 - ROC AUC formula

Image 17 – ROC AUC formula

Unlike other metrics, AUC needs prediction probabilities for calculation:

AUC_SCORE <- AUC(predict_probs, test$Survived)
Image 18 - ROC AUC results

Image 18 – ROC AUC results

AUC ranges from 0 to 1, so a score of 0.834 sounds good. For reference, a score of 1 would mean the model is perfectly capable of distinguishing between classes, which is almost never the case in practice. On the other end, the AUC score of 0.5 means the model is no better than a random guess. Overall, there’s still some room for improvement, but we’re far from an unusable model.

7. Specificity

Specificity measures how well a model is able to correctly identify negative samples (TN) out of all negative samples in the dataset. In other words, it measures the proportion of actual negative cases that were correctly classified as negative by the model.

This metric is widely used in areas such as medical diagnosis. In this field, a low specificity indicates that the model is incorrectly identifying negative cases as positive, which can lead to false alarms or missed diagnoses. The opposite is true the other way around.

The formula is once again as simple as it can be:

Image 19 - Specificity formula

Image 19 – Specificity formula

And so is the R implementation:

Image 20 - Specificity results

Image 20 – Specificity results

A result of 0.625 isn’t something to brag about, and there’s definitely room for improvement.

8. Balanced Accuracy

Let’s take a step back and discuss accuracy once again. As we said previously, the vanilla accuracy metric isn’t the most representative when classes are imbalanced. That’s where balanced accuracy comes into play.

It’s a useful metric when the dataset is imbalanced, and it provides a more accurate evaluation of the model’s performance.

Anyhow, here’s how to calculate it:

Image 21 - Balanced accuracy formula

Image 21 – Balanced accuracy formula

R implementation requires us to calculate the ratios of true positives and true negatives first:

TPR <- TP / (TP + FN)
TNR <- TN / (TN + FP)
Image 22 - Balanced accuracy results

Image 22 – Balanced accuracy results

So, taking into account class imbalance, our model is only 73.4% accurate. There’s definitely room for improvement.

9. Matthews Correlation Coefficient (MCC)

Matthews Correlation Coefficient is a metric that takes into account TP, TN, FP, and FN scores. It measures the correlation between the predicted and actual classes while taking into account the class imbalance and misclassification rates. It’s particularly useful in situations where the classes are imbalanced, which is obviously the case with the Titanic dataset.

MCC ranges from -1 to +1. If you see a value of +1, it indicates a perfect classification, 0 indicates a random classification, and -1 indicates an entirely wrong classification.

Here’s the math formula for MCC:

Image 23 - Matthews correlation coefficient formula

Image 23 – Matthews correlation coefficient formula

We don’t have to calculate it manually since MCC is built into the mltools R package:

MCC <- mltools::mcc(predict_classes, test$Survived)
Image 24 - Matthews correlation coefficient results

Image 24 – Matthews correlation coefficient results

A score of 0.478 isn’t something to write home about, but it definitely proves our model is far from a random classification.

10. Logarithmic Loss

And finally, let’s discuss logarithmic loss or log loss for short. It measures the performance of a probabilistic classifier by penalizing false classifications. Log loss is commonly used in multiclass classification problems, but there’s no one stopping us from using it on a binary dataset.

Unlike the other metrics, there’s no hard range defined for this metric. A lower log loss score indicates better performance, but how low is low enough? It’s impossible to answer when evaluating a single model, so use this metric to compare multiple models instead.

Here’s the log loss formula:

Image 25 - Logarithmic loss formula

Image 25 – Logarithmic loss formula

The function for calculating log loss in R comes with the MLmetrics pacakge, so we don’t have to implement it manually:

LOG_LOSS <- LogLoss(predict_classes, test$Survived)
Image 26 - Logarithmic loss results

Image 26 – Logarithmic loss results

Is 8.15 good or bad? It’s impossible to tell without training a couple more machine learning models and comparing the results. Do this as a homework assignment and report back which model yielded the lowest log loss value.

Summing Up Machine Learning Evaluation Metrics for Classification

To recap, these 10 machine learning evaluation metrics for classification should be all you need 99% of the time. You’re likely to use only a few, such as the confusion matrix, and optimize the model for precision, recall, or overall accuracy.

That being said, it doesn’t hurt to know the other evaluation metrics you have at your disposal. We hope this article have you a clear picture of how easy it is to evaluate machine learning models in R, and that you now understand these metrics on a deeper level.

Do you have a favorite classification evaluation metric? What do you prefer when classes are imbalanced? Make sure to let us know in the comment section below. Or even better – reach out on Twitter – @appsilon. We’d love to hear from you.

Deep Learning in R with… Keras? Train an MNIST digit classifier with TensorFlow’s high-level API.

The post appeared first on

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)