Hello, today I’m going to show you the difference of using two different common performance measures (useful not only for Machine Learning purposes, is useful in every scientific field). Until now, I have found more the accuracy values than F scores in the performance measuring of some methods which ranges from metaheuristics (Genetic Algorithms fitness functions) to promoter recognition programs, diagnose methods and so on.
But, I would really recommend to avoid using the accuracy measure. The reason is shown below with a nice example in R programming language (all the functions used in the simulation are included, you can download them clicking here).
Case study 1:
Imagine that you are in a Computer Vision project and your task is to “teach” a program to recognize among electric guitars and acoustic guitars showing the program pictures of different guitars.
Suppose that you’ve already developed that program and now you want to measure the performance of this Boolean classifier (this is for example, you show the program a picture of a an electric guitar, and the program has to decide whether it will recognize and “classify” it as an electric or as an acoustic guitar).
For the function of this post, lets write down some useful concepts
Consider the following:
TP: a true positive is when the program classifies an electric guitar as an electric guitar, we will use the letter “E” to denote the electric guitar “class”
FP: a false positive is when the program classifies an acoustic guitar as an electric guitar, we will use the letter “A” to denote the acoustic guitar “class”
FN: a false negative is when the program classifies an electric guitar as an acoustic guitar
TN: a true negative is when the program classifies an acoustic guitar as an acoustic guitar
Now that we are ready, we shall begin with the calculations
In R, I have simulated the results of the program. Say, for 1,000 electric guitar pictures and 1,000 acoustic guitar pictures
The program prompt the following results:
TRUE.E 485 515
TRUE.A 9 991
If you notice, from the 1000 electric guitar pictures, only 485 were labeled as electric (TP=485), the rest were labeled as acoustic (FN=515). I feel bad for the hypothetical programmer of this hypothetical example.
On the other hand, from the 1000 acoustic guitars, 991 were labeled as acoustic (TN=991) and only 9 of them were labeled as electric (FP=9). Well not bad!….. or it is?
The accuracy value of this program is = 0.738
And, for computing the F score is necessary to compute the precision and the recall first, where:
precision = 0.9817814 and recall = 0.485
Then, the F score is equal to 0.6492637
Well, the F scores seems to be more “strict”, and in fact it is in comparison of the accuracy performance measure. But this example is not very “cool”. Lets pass to the case study 2
Case study 2:
Now we have 1,000 electric guitar pictures and 100,000 acoustic guitar pictures, the confusion matrix of the results are:
TRUE.E 493 507
TRUE.A 1017 98983
If you notice, from the 1,000 electric guitar pictures, only 493 were labeled as electric (TP=493), the rest were labeled as acoustic (FN=507)
On the other hand, from the 100,000 acoustic guitars, 98983 were labeled as acoustic (TN=98983) and only 1017 of them were labeled as electric (FP=1017)
Now (cha cha chan!), the performance values are:
F score: 0.3928287
Now you see it?, how come or how is possible that missing almost the 50% of the labels of the electric guitars, the performance of the program in accuracy is almost 0.99?, despite of having a precision and recall not greater than 0.50. Then we have a winner and is the F score measure.
for references visit the following pages: