ROC curves and classification

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> attach(db)

The first step is to get a model. For instance, a logistic regression, where some factors were merged together,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)

From this model, we can predict a probability, not a  variable,

> S=predict(reg,type="response")

Let http://latex.codecogs.com/gif.latex?\widehat{S} denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a  variable. As we usually do in decision theory. The idea is to consider a threshold https://i2.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-04.png?w=578, so that

  • if https://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-05.png?w=578, then  https://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-02.png?w=578 will be http://latex.codecogs.com/gif.latex?1, or “positive” (using a standard terminology)
  • si https://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-06.png?w=578, then  https://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-02.png?w=578 will be http://latex.codecogs.com/gif.latex?0, or “negative

Then we derive a contingency table, or a confusion matrix

     observed value https://i2.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-01.png?w=578
predicted
value
https://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-02.png?w=578
“positive“ “négative“
“positive“ TP FP
“négative“ FN TN

where TP are the so-called true positive, TN  the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold https://i2.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-04.png?w=578

> roc.curve=function(s,print=FALSE){
+ Ps=(S>s)*1
+ FP=sum((Ps==1)*(Y==0))/sum(Y==0)
+ TP=sum((Ps==1)*(Y==1))/sum(Y==1)
+ if(print==TRUE){
+ print(table(Observed=Y,Predicted=Ps))
+ }
+ vect=c(FP,TP)
+ names(vect)=c("FPR","TPR")
+ return(vect)
+ }
> threshold = 0.5
> roc.curve(threshold,print=TRUE)
        Predicted
Observed   0   1
       0   5 231
       1  19 745
      FPR       TPR 
0.9788136 0.9751309

Here, we also compute the false positive rates, and the true positive rates,

  • TPR = TP / P = TP / (TP + FN) also called sensibility, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
  • FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)

The ROC curve is then obtained using severall values for the threshold. For convenience, define

> ROC.curve=Vectorize(roc.curve)

First, we can plot http://latex.codecogs.com/gif.latex?(\widehat{S}_i,Y_i) (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors

> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1)))
> plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="")
> abline(v=seuil,col="gray")

And for the ROC curve, simply use

> M.ROC=ROC.curve(seq(0,1,by=.01))
> plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")

This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree

> library(tree)
> ctr <- tree(Y~X1+X2+X3bis,data=db)
> plot(ctr)
> text(ctr)

To plot the ROC curve, we just need to use the prediction obtained using this second model,

> S=predict(ctr)

All the code described above can be used. Again, we can plot http://latex.codecogs.com/gif.latex?(\widehat{S}_i,Y_i) (observe that we have 5 possible values for http://latex.codecogs.com/gif.latex?\widehat{S}_i, which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,

An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models

> plot(M.ROC[1,],M.ROC[2,],type="l")
> lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)

The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)