Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. I was recently asked to summarise an analysis using a ROC (Receiver-operator characteristics) plot. R has a number of particularly good tools to produce ROC plots – ROCR, pROC and the Bioconductor package ROC to name a few. However I thought it would be a useful exercise to build such a tool from first principles – partly so I could customise the output to my needs but mainly to understand better the methods behind computing such a plot.

If you’re not sure what a ROC plot is a good summary is given here. So now that we know what we will be building, just how can we accomplish this in R? Let’s start by building a function that computes the co-ordinates for the ROC curve and gives us some summary statistics.

rocdata <- function(grp, pred){ # Produces x and y co-ordinates for ROC curve plot # Arguments: grp - labels classifying subject status # pred - values of each observation # Output: List with 2 components: # roc = data.frame with x and y co-ordinates of plot # stats = data.frame containing: area under ROC curve, p value, upper and lower 95% confidence interval   grp <- as.factor(grp) if (length(pred) != length(grp)) { stop("The number of classifiers must match the number of data points") }   if (length(levels(grp)) != 2) { stop("There must only be 2 values for the classifier") }   cut <- unique(pred) tp <- sapply(cut, function(x) length(which(pred > x & grp == levels(grp)))) fn <- sapply(cut, function(x) length(which(pred < x & grp == levels(grp)))) fp <- sapply(cut, function(x) length(which(pred > x & grp == levels(grp)))) tn <- sapply(cut, function(x) length(which(pred < x & grp == levels(grp)))) tpr <- tp / (tp + fn) fpr <- fp / (fp + tn) roc = data.frame(x = fpr, y = tpr) roc <- roc[order(roc$x, roc$y),]   i <- 2:nrow(roc) auc <- (roc$x[i] - roc$x[i - 1]) %*% (roc$y[i] + roc$y[i - 1])/2   pos <- pred[grp == levels(grp)] neg <- pred[grp == levels(grp)] q1 <- auc/(2-auc) q2 <- (2*auc^2)/(1+auc) se.auc <- sqrt(((auc * (1 - auc)) + ((length(pos) -1)*(q1 - auc^2)) + ((length(neg) -1)*(q2 - auc^2)))/(length(pos)*length(neg))) ci.upper <- auc + (se.auc * 0.96) ci.lower <- auc - (se.auc * 0.96)   se.auc.null <- sqrt((1 + length(pos) + length(neg))/(12*length(pos)*length(neg))) z <- (auc - 0.5)/se.auc.null p <- 2*pnorm(-abs(z))   stats <- data.frame (auc = auc, p.value = p, ci.upper = ci.upper, ci.lower = ci.lower )   return (list(roc = roc, stats = stats)) }

So let’s go through how this works. The function requires two arguments: Firstly a numeric vector giving the results of the test being applied, and secondly a factor vector (or a vector that can be coerced to a factor with two levels) containing the status of the instance being tested eg. disease or healthy, case or control etc. We firstly ensure this grouping vector is a factor

grp <- as.factor(grp)

and then run some checks to ensure there is one classifier for each test result and that there are only two levels of the classifier.

if (length(pred) != length(grp)) { stop("The number of classifiers must match the number of data points") }   if (length(levels(grp)) != 2) { stop("There must only be 2 values for the classifier") }

The ROC plot should show the proportion of true positives against the proportion of false positives for any threshold value. However using the dataset we have we can define a set of thresholds at all the unique values of the test result.

cut <- unique(pred)

Then at each cutoff point we want to calculate the number of true positives, false positives, true negatives and false negatives. We can do this using the sapply function in R. Looking at the true positives as an example, for every cutoff value we count the number of entries where the predictor vector is greater than the cutoff AND the group vector is the higher factor level (hence the reason for ensuring it is an ordered factor).

tp <- sapply(cut, function(x) length(which(pred > x & grp == levels(grp)))) fn <- sapply(cut, function(x) length(which(pred < x & grp == levels(grp)))) fp <- sapply(cut, function(x) length(which(pred > x & grp == levels(grp)))) tn <- sapply(cut, function(x) length(which(pred < x & grp == levels(grp))))

Each of the above produces a vector with the same length as cut. We can then calculate the true positive rate as the vector of true positives divided by the sum of the true positives and false negatives vector

tpr <- tp / (tp + fn)

and the false positive rate as the false positive vector divided by the sum of the false positive vector and the true negative vector.

fpr <- fp / (fp + tn)

these two vectors give our y and x co-ordinates for our ROC plot. It’s then a simple matter of writing them into a data frame and ordering them for plotting to produce our basic ROC plot data.

roc = data.frame(x = fpr, y = tpr) roc <- roc[order(roc$x, roc$y),]

Summary Statistics

Probably the commonest way of summarising the ROC plot is to compute the area under the curve (AUC). This reduces the ROC plot to a single scalar value representing the performance of the test. A perfect test has an AUC of 1 and a test that is equivalent to random guessing has an AUC of 0.5. There are a number of ways of estimating the AUC of a plot; the trapezoid rule is employed here. The function used here is the same as the trapz function from the caTools package.

i <- 2:nrow(roc) auc <- (roc$x[i] - roc$x[i - 1]) %*% (roc$y[i] + roc$y[i - 1])/2

An interesting statistical property of the AUC is that it represents the probability that the test value of a positive case will rank higher than a negative case. This is essentially equivalent to the Wilcoxon-Mann-Whitney statistic. Hence an alternative way of computing the AUC would be

pos <- pred[grp == levels(grp)] neg <- pred[grp == levels(grp)] auc.wilcox <- wilcox.test(pos,neg,exact=0)\$statistic /(length(pos)*length(neg))

The standard error of the ROC curve is given by: (Hanley and McNeil, 1982)

(1) where is the AUC, is the number of positive cases, is the number of negative cases, and . Using the standard error we can calculate the upper and lower 95% confidence intervals for the AUC

pos <- pred[grp == levels(grp)] neg <- pred[grp == levels(grp)] q1 <- auc/(2-auc) q2 <- (2*auc^2)/(1+auc) se.auc <- sqrt(((auc * (1 - auc)) + ((length(pos) -1)*(q1 - auc^2)) + ((length(neg) -1)*(q2 - auc^2)))/(length(pos)*length(neg))) ci.upper <- auc + (se.auc * 0.96) ci.lower <- auc - (se.auc * 0.96)

We can also use equation (1) to calculate a p value for the AUC – ie. the probabality that the AUC = 0.5 (no discrimination). Substituting 0.5 into (1) and simplifying gives us the standard error for our null hypothesis.

(2) We can then calculate a Z score and from that the p value using the normal distribution

pos <- pred[grp == levels(grp)] neg <- pred[grp == levels(grp)] se.auc.null <- sqrt((1 + length(pos) + length(neg))/(12*length(pos)*length(neg))) z <- (auc - 0.5)/se.auc.null p <- 2*pnorm(-abs(z))

To finish off our rocdata function we just need to output our roc curve co-ordinates, auc and p value. I’ve chosen to put the stats measures into one data frame and the roc curve data into another and then put the two dataframes into a list.

stats <- data.frame (auc = auc, p.value = p, ci.upper = ci.upper, ci.lower = ci.lower )   return (list(roc = roc, stats = stats))

References
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29-36.