# ROC curves and classification

September 30, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> attach(db)

The first step is to get a model. For instance, a logistic regression, where some factors were merged together,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db\$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)

From this model, we can predict a probability, not a $\{0,1\}$ variable,

> S=predict(reg,type="response")

Let $\widehat{S}$ denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a $\{0,1\}$ variable. As we usually do in decision theory. The idea is to consider a threshold , so that

• if , then  will be $1$, or “positive” (using a standard terminology)
• si , then  will be $0$, or “negative

Then we derive a contingency table, or a confusion matrix

 observed value predicted value “positive“ “négative“ “positive“ TP FP “négative“ FN TN

where TP are the so-called true positive, TN  the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold

> roc.curve=function(s,print=FALSE){
+ Ps=(S>s)*1
+ FP=sum((Ps==1)*(Y==0))/sum(Y==0)
+ TP=sum((Ps==1)*(Y==1))/sum(Y==1)
+ if(print==TRUE){
+ print(table(Observed=Y,Predicted=Ps))
+ }
+ vect=c(FP,TP)
+ names(vect)=c("FPR","TPR")
+ return(vect)
+ }
> threshold = 0.5
> roc.curve(threshold,print=TRUE)
Predicted
Observed   0   1
0   5 231
1  19 745
FPR       TPR
0.9788136 0.9751309

Here, we also compute the false positive rates, and the true positive rates,

• TPR = TP / P = TP / (TP + FN) also called sensibility, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
• FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)

The ROC curve is then obtained using severall values for the threshold. For convenience, define

> ROC.curve=Vectorize(roc.curve)

First, we can plot $(\widehat{S}_i,Y_i)$ (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors

> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1)))
> plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="")
> abline(v=seuil,col="gray")

And for the ROC curve, simply use

> M.ROC=ROC.curve(seq(0,1,by=.01))
> plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")

This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree

> library(tree)
> ctr <- tree(Y~X1+X2+X3bis,data=db)
> plot(ctr)
> text(ctr)

To plot the ROC curve, we just need to use the prediction obtained using this second model,

> S=predict(ctr)

All the code described above can be used. Again, we can plot $(\widehat{S}_i,Y_i)$ (observe that we have 5 possible values for $\widehat{S}_i$, which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,

An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models

> plot(M.ROC[1,],M.ROC[2,],type="l")
> lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)

The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.