Site icon R-bloggers

Credit Scoring in R 101

[This article was first published on jkunst.com: Entries for category R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post we’ll fit some predicitve models in (well know) data bases, and evalute the performance of each model. Disclaimer1: for simplicity the predictive variables are treating without apply any transformation to get a better performance or stability, etc. We’ll use two datas to evaluate the performances of the models. Both data have categorical and continous variables and we’ll use 50-50 split to have a train and test data. The datas are:

##         sample
## response test  train
##     bad  14.9% 15.1%
##     good 38.1% 31.9%
##         sample
## response test  train
##     bad  12.7% 12.4%
##     good 37.3% 37.6%

The models to compare are logistic, conditional inference trees (party package), single-hidden-layer neural network (nnet pakage) and linear discriminant analysis. To evalute the performance there are some indicators like KS statistic, Area under ROC curve among others. If you are not familiar with this terms check this link. Now, let’s go with the results.

German Credit Data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.32 0.54   0.83   0.25   0.48   0.64   0.74   0.82
## 4    SCORE_CTree  train 0.32 0.40   0.76   0.53   0.53   0.61   0.87   0.87
## 6   SCORE_SLNNET  train 0.32 0.64   0.90   0.30   0.54   0.73   0.82   0.89
## 8      SCORE_LDA  train 0.32 0.54   0.83   0.24   0.47   0.63   0.76   0.82
## 1 SCORE_Logistic   test 0.28 0.46   0.78   0.23   0.43   0.58   0.72   0.82
## 3    SCORE_CTree   test 0.28 0.37   0.74   0.50   0.50   0.57   0.87   0.87
## 5   SCORE_SLNNET   test 0.28 0.42   0.77   0.21   0.40   0.58   0.68   0.79
## 7      SCORE_LDA   test 0.28 0.47   0.79   0.23   0.46   0.57   0.73   0.82

daux <- subset(data1, SAMPLE == "test")
daux_roc <- ldply(str_pattern(names(daux),"SCORE"), function(score){
  perf <- performance(prediction(daux[[score]], daux$GB), "tpr","fpr")
  df <- data.frame(x = unlist(perf@"x.values") , y = unlist(perf@"y.values"))
  df$score <- score
  df
})

ggplot(daux_roc) + geom_line(aes(x,y, color = score), size = 1.2) + 
  scale_color_manual('',values=brewer.pal(length(unique(daux_roc$score)), "RdBu")) +
  geom_path(data=data.frame(x = c(0,1), y = c(0,1)),
            aes(x,y), colour = "gray", size = 1) +
  scale_x_continuous("False Positive Rate (1 - Specificity)",
                     labels = percent_format(), limits = c(0, 1)) +
  scale_y_continuous("True Positive Rate (Sensivity or Recall)",
                     labels = percent_format(), limits = c(0, 1)) +
  theme(legend.position = "top") +
  ggtitle("ROC Curves for German Credit Data (validation)")

Now we can plot the distributions of good/bads in each model. We’ll transform the data whith melt function and then plot faceting by score.

daux <- subset(data1, SAMPLE == "test",
               select = c("GB", "SCORE_Logistic",
                          "SCORE_CTree","SCORE_SLNNET","SCORE_LDA"))
daux <- melt(daux, id = "GB")

ggplot(daux, aes(x=value, fill = factor(GB))) +
  geom_density(alpha = 0.6, size = .75) +
  facet_wrap(~variable, ncol=2) +
  scale_fill_manual(values = brewer.pal(3, "Dark2")) +
  theme(legend.position = "none",
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        plot.margin = unit(rep(0.5, 4), "lines"),
        title = element_text(size = 9))

Bankloan Binning data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.25 0.54   0.84   0.31   0.51   0.67   0.79   0.89
## 4    SCORE_CTree  train 0.25 0.52   0.84   0.34   0.50   0.82   0.82   0.90
## 6   SCORE_SLNNET  train 0.25 0.56   0.86   0.32   0.52   0.68   0.82   0.89
## 8      SCORE_LDA  train 0.25 0.52   0.84   0.31   0.51   0.66   0.78   0.88
## 1 SCORE_Logistic   test 0.25 0.53   0.84   0.31   0.51   0.68   0.78   0.87
## 3    SCORE_CTree   test 0.25 0.45   0.79   0.30   0.51   0.76   0.76   0.84
## 5   SCORE_SLNNET   test 0.25 0.51   0.83   0.30   0.51   0.66   0.77   0.85
## 7      SCORE_LDA   test 0.25 0.50   0.83   0.31   0.51   0.67   0.77   0.85

Do you want to comment about the results? If you are interesting in this topic reproduce this example. And if you have questions and/or improvements or want to know more details for the code please comment.

References

  1. Ggplot2
  2. RStudio
  3. Knitr
  4. Guide to credit scoring in R

To leave a comment for the author, please follow the link and comment on their blog: jkunst.com: Entries for category R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.