Visualizing ROC Curves in R using Plotly
[This article was first published on R – Modern Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post we’ll create some simple functions to generate and chart a Receiver Operator (ROC) curve and visualize it using Plotly. See Carson’s plotly book for more details around changes in syntax.
We’ll do this from a credit risk perspective i.e. validating a bank’s internal rating model (we’ll create a sample dataset keeping this in mind)
We’ll replicate computations highlighted in this paper.
library(plotly) library(dplyr) library(flux)
Sample data
set.seed(123)
n <- 100000
lowest.rating <- 10
# Sample internal ratings
# Say we have a rating scale of 1 to 10̥
ratings <- sample(1:lowest.rating, size = n, replace = T)
# Defaults
# We'll randomly assign defaults concentrating more defaults
# in the lower rating ranges. We'll do this by creating exponentially
# increasing PDs across the rating range
power <- 5
PD <- log(1:lowest.rating)
PD <- PD ^ power
#PD <- exp((1:lowest.rating))
PD <- PD/(max(PD) * 1.2) # increased denominator to make the PDs more realistic
Now given PD for eac rating category sample from a binomial distribution
# to assign actual defaults
defaults <- rep(0, n)
k <- 1
for(i in ratings){
defaults[k] <- rbinom(1, 1, PD[i])
k <- k + 1
}
dataset <- data.frame(Rating = ratings,
Default = defaults)
# Check if dataset looks realistic̥
# df <- dataset %>%
# group_by(Rating) %>%
# summarize(Def = sum(Default == 1), nDef = sum(Default == 0))
ROC Curve Computation
Now that we have a sample dataset to work with we can start to create the ROC curve
ROCFunc <- function(cutoff, df){
# Function counts the number of defaults hap̥pening in all the rating
# buckets less than or equal to the cutoff
# Number of hits = number of defaults with rating < cutoff / total defaults
# Number of false alarms = number ofnon defaults with rating < cutoff / total non defaults
nDefault <- sum(df$Default == 1)
notDefault <- sum(df$Default == 0)
temp <- df %>% filter(Rating >= cutoff)
hits <- sum(temp$Default == 1)/nDefault
falsealarm <- sum(temp$Default == 0)/notDefault
ret <- matrix(c(hits, falsealarm), nrow = 1)
colnames(ret) <- c("Hits", "Falsealarm")
return(ret)
}
# Arrange ratings in decreasing order
# A lower rating is better than a higher rating
vec <- sort(unique(ratings), decreasing = T)
ROC.df <- data.frame()
for(i in vec){
ROC.df <- rbind(ROC.df, ROCFunc(i, dataset))
}
# Last row to complete polygon
labels <- data.frame(x = ROC.df$Falsealarm,
y = ROC.df$Hits,
text = vec)
ROC.df <- rbind(c(0,0), ROC.df)
# Area under curve
AUC <- round(auc(ROC.df$Falsealarm, ROC.df$Hits),3)
Plot
plot_ly(ROC.df, y = ~Hits, x = ~Falsealarm, hoverinfo = "none") %>%
add_lines(name = "Model",
line = list(shape = "spline", color = "#737373", width = 7),
fill = "tozeroy", fillcolor = "#2A3356") %>%
add_annotations(y = labels$y, x = labels$x, text = labels$text,
ax = 20, ay = 20,
arrowcolor = "white",
arrowhead = 3,
font = list(color = "white")) %>%
add_segments(x = 0, y = 0, xend = 1, yend = 1,
line = list(dash = "7px", color = "#F35B25", width = 4),
name = "Random") %>%
add_segments(x = 0, y = 0, xend = 0, yend = 1,
line = list(dash = "10px", color = "black", width = 4),
showlegend = F) %>%
add_segments(x = 0, y = 1, xend = 1, yend = 1,
line = list(dash = "10px", color = "black", width = 4),
showlegend = F) %>%
add_annotations(x = 0.8, y = 0.2, showarrow = F,
text = paste0("Area Under Curve: ", AUC),
font = list(family = "serif", size = 18, color = "#E8E2E2")) %>%
add_annotations(x = 0, y = 1, showarrow = F, xanchor = "left",
xref = "paper", yref = "paper",
text = paste0("Receiver Operator Curve"),
font = list(family = "arial", size = 30, color = "#595959")) %>%
add_annotations(x = 0, y = 0.95, showarrow = F, xanchor = "left",
xref = "paper", yref = "paper",
text = paste0("Charts the percentage of correctly identified defaults (hits) against the percentage of non defaults incorrectly identifed as defaults (false alarms)"),
font = list(family = "serif", size = 14, color = "#999999")) %>%
layout(xaxis = list(range = c(0,1), zeroline = F, showgrid = F,
title = "Number of False Alarms"),
yaxis = list(range = c(0,1), zeroline = F, showgrid = F,
domain = c(0, 0.9),
title = "Number of Hits"),
plot_bgcolor = "#E8E2E2",
height = 800, width = 1024)
To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.