ROC curves and classification

September 30, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> attach(db)

The first step is to get a model. For instance, a logistic regression, where some factors were merged together,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)

From this model, we can predict a probability, not a  variable,

> S=predict(reg,type="response")

Let http://latex.codecogs.com/gif.latex?\widehat{S} denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a  variable. As we usually do in decision theory. The idea is to consider a threshold http://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-04.png?w=456, so that

  • if http://i2.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-05.png?w=456, then  http://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-02.png?w=456 will be http://latex.codecogs.com/gif.latex?1, or “positive” (using a standard terminology)
  • si http://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-06.png?w=456, then  http://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-02.png?w=456 will be http://latex.codecogs.com/gif.latex?0, or “negative

Then we derive a contingency table, or a confusion matrix

     observed value http://i2.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-01.png?w=456
predicted
value
http://i0.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-02.png?w=456
“positive“ “négative“
“positive“ TP FP
“négative“ FN TN

where TP are the so-called true positive, TN  the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold http://i1.wp.com/perso.univ-rennes1.fr/arthur.charpentier/latex/ROC-04.png?w=456

> roc.curve=function(s,print=FALSE){
+ Ps=(S>s)*1
+ FP=sum((Ps==1)*(Y==0))/sum(Y==0)
+ TP=sum((Ps==1)*(Y==1))/sum(Y==1)
+ if(print==TRUE){
+ print(table(Observed=Y,Predicted=Ps))
+ }
+ vect=c(FP,TP)
+ names(vect)=c("FPR","TPR")
+ return(vect)
+ }
> threshold = 0.5
> roc.curve(threshold,print=TRUE)
        Predicted
Observed   0   1
       0   5 231
       1  19 745
      FPR       TPR 
0.9788136 0.9751309

Here, we also compute the false positive rates, and the true positive rates,

  • TPR = TP / P = TP / (TP + FN) also called sensibility, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
  • FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)

The ROC curve is then obtained using severall values for the threshold. For convenience, define

> ROC.curve=Vectorize(roc.curve)

First, we can plot http://latex.codecogs.com/gif.latex?(\widehat{S}_i,Y_i) (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors

> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1)))
> plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="")
> abline(v=seuil,col="gray")

And for the ROC curve, simply use

> M.ROC=ROC.curve(seq(0,1,by=.01))
> plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")

This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree

> library(tree)
> ctr <- tree(Y~X1+X2+X3bis,data=db)
> plot(ctr)
> text(ctr)

To plot the ROC curve, we just need to use the prediction obtained using this second model,

> S=predict(ctr)

All the code described above can be used. Again, we can plot http://latex.codecogs.com/gif.latex?(\widehat{S}_i,Y_i) (observe that we have 5 possible values for http://latex.codecogs.com/gif.latex?\widehat{S}_i, which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,

An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models

> plot(M.ROC[1,],M.ROC[2,],type="l")
> lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)

The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.