This week, the Bay Area useR Group (BARUG) held a mini-conference focused on ROC Curves. Talks discussed the history of the ROC, extending ROC analysis to multiclass problems, various ways to think about and interpret ROC curves, and how to translate concrete business goals into the ROC framework, and pick the optimal threshold for a given problem.
I introduced the session with a very brief eclectic “history” of the ROC anchored on a few key papers that seem to me to represent inflection points in its development and adoption.
Anecdotal accounts of early ROC such as this brief mention in Deranged Physiology make it clear that Receiver Operating Characteristic referred to the ability of a radar technician, sitting at at a receiver to look at a blimp on the screen and distinguish an aircraft from background noise. The DoD report written by Peterson and Birdsall in 1953 shows that the underlying mathematical theory, and many of the statistical characteristics of the ROC, had already been worked out by that time. Thereafter, (see the references below) the ROC became a popular tool in Psychology, Medicine and many other disciplines seeking to make optimal decisions based on the ability to detect signals.
Jumping to “modern times” his 1996 paper Bradley argues for the ROC to replace overall accuracy as the single best measure to describe classifier performance. Given the prevalent use of ROC curves, it is interesting to contemplate a time when that was not so. Finally, the landmark 2009 paper by David Hand indicates that soon after the adoption of the ROC, researchers were already noticing problems using the area under the curve (AUC) to compare the performance of classifiers whose ROC curves cross. Additionally, Hand observes that:
(The AUC) is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. …
Hand goes on to propose H Measure as an alternative to AUC.
In his talk, ROC Curves extended to multiclass classification, and how they do or do not map to the binary case (slides here), Mario Inchiosa discusses extensions of the ROC curve to multiclass classification and why these extensions don’t all apply to the binary case. He distinguishes between multiclass and multilabel classification and discusses the pros and cons of different averaging techniques in the multiclass One vs. Rest scenario. He also points (see references below) to both R and scikit-learn packages useful in this kind of analysis.
Intrepreting the ROC
In his highly original talk, Six Ways to Think About ROC Curves (slides here), Robert Horton challenges you to see the ROC curve from multiple perspectives. Even if you have been working with ROC curves for some time you are likely to learn something new here. The “Turtle Eye” view is eye opening for many.
- The discrete “Turtle’s Eye” view, where labeled cases are sorted by score, and the path of the curve is determined by the order of positive and negative cases.
- The categorical view, where we have to handle tied scores, or when scores put cases in sortable buckets.
- The continuous view, where the cumulative distribution function (CDF) for the positive cases is plotted against the CDF for the negative cases.
- The ROC curve can be thought of as the limit of the cumulative gain curve (or “Total Operating Characteristic” curve) as the prevalence of positive cases goes to zero.
- The probabilistic view, where AUC is the probability that a randomly chosen positive case will have a higher score than a randomly chosen negative case.
- The ROC curve emerges from a graphical interpretation of the Mann-Whitney Wilcoxon U Test Statistic, which illustrates how AUC relates to this commonly used non-parametric hypothesis test.
Picking the Optimal Utility Threshold
John Mount with a talk on How to Pick an Optimal Utility Threshold Using the ROC Plot (slides here) closed out the evening with some original work on how to translate concrete business goals into the ROC framework and then use the ROC plot to pick the optimal classification threshold for a given problem. John emphasizes the advantages of working with parametric representations of ROC curves and the importance of discovering utility requirements through iterated negotiation. All of this flows from John’s original and insightful definition of an ROC plot.
Finally, the zoom video covering the talks by Inchiosa, Horton and Mount is well-worth watching.
Horton Talk References
- Fawcett (2006) An Introduction to ROC Analysis]
- Berrizbeitia Receiver Operating Characteristic (ROC) Curves – Shiny App
- Kanchanaraksa (2008) [Evaluation of Diagnostic and Screening Tests: Validity and Reliability]
- Kruchten (2016) ML Meets Economics
- Mount and Zumel The Win-Vector blog]()
Inchiosa Talk References
- Multiclass Classification
- roc auc score
- roc metrics
- plot roc
- Hand and Till (2001) reference for one-vs-one
- HandTill2001 package for Hand & Till’s “M” measure that extends AUC to multiclass using One vs. One
Rickert Talk References
- Bradley (1996) The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms – recommends ROC replace overall accuracy as a single measure of classifier performance
- Deranged Physiology ROC characteristic of radar operator
- Hajian-Tilake (2013) Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation
- Hand (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve
- McClelland (2011) Use of Signal Detection Theory as a Tool for Enhancing Performance and Evaluating Tradecraft in Intelligence Analysis
- Lusted (1984) Editorial on medical uses of ROC
- Pelli and Farell (1995) Psychophysical Methods
- Peterson and Birdsall (1953) DoD Report on The Theory of Signal Detectability – Early paper referencing ROC
- Woodward (1953) Probability and Information Theory, with Applications to Radar – early book mentioning ROC
- hmeasure The H-Measure and Other Scalar Classification Performance Metrics