Accuracy can be misleading, especially with class-imbalanced datasets. That’s why you should replace it with a more robust metric. Today you’ll learn three of them – and implement them from scratch.
Here’s what you’ll learn today:
Why accuracy sucks
Let’s say you’ve evaluated your models only with accuracy so far. You know that top-left and bottom-right values in a confusion matrix should be high, and the other two should be low.
But what do these numbers mean? What’s wrong with good ol’ accuracy?
Just imagine you’re trying to classify terrorists from face images. Let’s say that out of 1.000.000 people, 10 are terrorists. If you were to build a dummy model that classifies every image as non-terrorist, you would have a 99.999% accurate model!
Please don’t put “I’ve built SOTA terrorist classification models” on your resume just yet. Accuracy can be misleading.
The goal of your model should be to correctly classify terrorists every time. And it’s managed to do so exactly 0 times.
In other words – the recall value is zero. And your model sucks.
Confusion matrix crash course
Before diving deep into these metrics, let’s make a quick refresher on the confusion matrix. Here’s how it generally looks like:
Let’s make it a bit less abstract. I’ve gone and trained a wine classifier model and obtained a confusion matrix. Here’s how it looks like:
Is this good? Who knows.
Accuracy is around 88%, but that doesn’t necessarily mean anything. That’s where precision, recall, and F-beta metrics come into play.
In the most simple words, precision is a metric that shows you the number of correct positive predictions. It is calculated as the number of true positives divided by the sum of true positives and false positives:
Still a bit confusing? Continue reading.
You know what a true positive is – an instance that was actually positive, and the model classified it as positive (good wine classified as a good wine). But what are false positives? Put simply, an instance that’s negative but classified as positive (bad wine classified as good).
Here’s a more alarming example of false positives: a patient doesn’t have cancer, but the doctor says he has.
Back to the wine example. You can calculate the precision score from the formula mentioned above. Here’s a complete walkthrough:
So, around 0.84. Both precision and recall range from 0 to 1 (higher is better), so this value seems to be pretty good.
In other words – your model doesn’t produce a lot of false positives.
You now know what precision is, but what the heck is recall? Let’s demystify that next.
Recall might be the most useful metric for many classification problems. It tells you the number of correct positive predictions made out of all positive predictions. It is calculated as the number of true positives divided by the sum of true positives and false negatives:
If you’re even remotely like me, it’s a chance you’ll find the above definition a bit abstract.
Here’s how to apply it to classifying wines: Out of all good wines, how many did you classify correctly?
This is where you need to know what false negatives are. A false negative is a positive instance classified as negative. Sure, it’s all fun and games when classifying wines, but what about a more serious scenario?
In our earlier medical example, false negative means the following: a patient has cancer, but the doctor says he doesn’t.
As you can see, false negatives can sometimes be more costly than false positives. It’s essential to recognize which one is more important for your problem.
Back to the wine example. You can calculate the recall score from the formula mentioned above. Here’s a complete walkthrough:
Just as precision, recall also ranges between 0 and 1 (higher is better). 0.61 isn’t that great.
In other words – your model produces a decent amount of false negatives.
But what if you want both precision and recall to be somewhat decent? Then you’ll fall in love with the F-Beta metric.
F-measure provide you with some balance between precision and recall. The default F-measure is the F1, which tries not to favor either of the two previously discussed metrics.
Here’s the formula for calculating the F1 score:
As you can see, to calculate F1 you need to know the values for precision and recall beforehand. Here’s the full calculation walkthrough for our example:
But what the deal with the beta parameter?
During the F-score calculation, you can emphasize recall or precision by altering the beta parameter. Here’s how the more generalized formula for calculating F-scores looks like:
If beta is 1, then you’re calculating the F1 score and can simplify the formula to the one seen earlier in this section.
Here is the general rule of thumb for selecting the best value for the beta:
- Beta = 0.5 (F0.5-measure): You want a balance between precision and recall, with more weight on precision
- Beta = 1 (F1-measure): You want a pure balance between precision and recall
- Beta = 2 (F2-measure): You want a balance between precision and recall, with more weight on recall
To simplify, you can calculate F0.5-measure with the following formula:
And F2-measure with this one:
These values for the beta aren’t set in stone, so feel free to experiment, depending on the problem you’re solving.
In a nutshell – accuracy can be misleading. Be careful when using it. If predicting positives and negatives is equally important, and both classes are balanced equally, accuracy can still be useful.
That’s not the case most of the time. Take your time to study the dataset and the problem and decide what’s more important to you – lesser false positives or lesser false negatives.
Metric selection is a joke from that point.
Which metric(s) do you use for classification problems? Let me know in the comment section.
The post Top 3 Classification Machine Learning Metrics – Ditch Accuracy Once and For All appeared first on Better Data Science.