Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC).
R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots are produced and how AUC can be calculated. Bob Horton’s article showed how elegantly the points on the ROC plot are expressed in terms of sorting and cumulative summation.
The next step is computing AUC. Obviously computing area is a solved problem. The issue is how you deal with interpolating between points and the conventions of what to do with data that has identical scores. An elegant interpretation of the usual tie breaking rules is: for every point on the ROC curve we must have either all of the data above a given score threshold or none of the data above a given score threshold. This is the issue alluded to when the original article states:
This brings up another limitation of this simple approach; by assuming that the rank order of the outcomes embodies predictive information from the model, it does not properly handle sequences of cases that all have the same score.
This problem is quite easy to explain with an example. Consider the following data.
Using code adapted from the original article we can quickly get an interesting summary.
The problem is: we need to take all of the points with the same prediction score as an atomic unit (we take all of them or none of them). Notice also
TPR is always 1 (an undesirable effect).
We do not really want rows 1 and 3 in our plot or area calculations. In fact the values in row 1 and 3 are not fully determined as they can vary depending on details of tie breaking in the sorting (though the values recorded in rows 2 and 4 can not so vary). Also (especially after deleting rows) we may need to add in ideal points with
(FPR,TPR)=(1,1) to complete our plot and area calculations.
What we want is a plot where ties are handled. Such plots look like the following:
library('WVPlots') # see: https://github.com/WinVector/WVPlots
There is a fairly elegant way to get the necessary adjusted plotting frame: use differencing (the opposite of cumulative sums) to find where the
pred column changes, and limit to those rows.
The code is as follows (also found in our
sigr library here):
This correctly calculates the AUC.
library('sigr') # see: https://github.com/WinVector/sigr
##  0.8333333
I think this extension maintains the spirit of the original. We have also shown how complexity increases as you move from code known to work on a particular data set at hand, to library code that may be exposed to data with unanticipated structures or degeneracies (this is why Quicksort, which has an elegant description, often has monstrous implementations; please see here for a rant on that topic).