Credit Scoring in R 101

(This article was first published on jkunst.com: Entries for category R, and kindly contributed to R-bloggers)

In this post we'll fit some predicitve models in (well know) data bases, and evalute the performance of each model. Disclaimer1: for simplicity the predictive variables are treating without apply any transformation to get a better performance or stability, etc. We'll use two datas to evaluate the performances of the models. Both data have categorical and continous variables and we'll use 50-50 split to have a train and test data. The datas are:

  • German Credit: The German Credit data frame has 1000 rows and 8 columns. This are data for clients of a south german bank, 700 good payers and 300 bad payers. They are used to construct a credit scoring method. This data have 20 predictive variables and 1000 observations and have a bad rate of 30%. So, after the select subsets to fit the models the distributions in the data are:
##         sample
## response test  train
##     bad  14.9% 15.1%
##     good 38.1% 31.9%
  • Bankloan Binning: This is a hypothetical data file containing financial and demographic information on past customers. Here is the source. This data have 8 predictive variables and 5000 observations and have a bad rate of 25.1%.
##         sample
## response test  train
##     bad  12.7% 12.4%
##     good 37.3% 37.6%

The models to compare are logistic, conditional inference trees (party package), single-hidden-layer neural network (nnet pakage) and linear discriminant analysis. To evalute the performance there are some indicators like KS statistic, Area under ROC curve among others. If you are not familiar with this terms check this link. Now, let's go with the results.

German Credit Data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.32 0.54   0.83   0.25   0.48   0.64   0.74   0.82
## 4    SCORE_CTree  train 0.32 0.40   0.76   0.53   0.53   0.61   0.87   0.87
## 6   SCORE_SLNNET  train 0.32 0.64   0.90   0.30   0.54   0.73   0.82   0.89
## 8      SCORE_LDA  train 0.32 0.54   0.83   0.24   0.47   0.63   0.76   0.82
## 1 SCORE_Logistic   test 0.28 0.46   0.78   0.23   0.43   0.58   0.72   0.82
## 3    SCORE_CTree   test 0.28 0.37   0.74   0.50   0.50   0.57   0.87   0.87
## 5   SCORE_SLNNET   test 0.28 0.42   0.77   0.21   0.40   0.58   0.68   0.79
## 7      SCORE_LDA   test 0.28 0.47   0.79   0.23   0.46   0.57   0.73   0.82
daux <- subset(data1, SAMPLE == "test")
daux_roc <- ldply(str_pattern(names(daux),"SCORE"), function(score){
  perf <- performance(prediction(daux[[score]], daux$GB), "tpr","fpr")
  df <- data.frame(x = unlist(perf@"x.values") , y = unlist(perf@"y.values"))
  df$score <- score
  df
})

ggplot(daux_roc) + geom_line(aes(x,y, color = score), size = 1.2) + 
  scale_color_manual('',values=brewer.pal(length(unique(daux_roc$score)), "RdBu")) +
  geom_path(data=data.frame(x = c(0,1), y = c(0,1)),
            aes(x,y), colour = "gray", size = 1) +
  scale_x_continuous("False Positive Rate (1 - Specificity)",
                     labels = percent_format(), limits = c(0, 1)) +
  scale_y_continuous("True Positive Rate (Sensivity or Recall)",
                     labels = percent_format(), limits = c(0, 1)) +
  theme(legend.position = "top") +
  ggtitle("ROC Curves for German Credit Data (validation)")

plot of chunk roc-curve-1

Now we can plot the distributions of good/bads in each model. We'll transform the data whith melt function and then plot faceting by score.

daux <- subset(data1, SAMPLE == "test",
               select = c("GB", "SCORE_Logistic",
                          "SCORE_CTree","SCORE_SLNNET","SCORE_LDA"))
daux <- melt(daux, id = "GB")

ggplot(daux, aes(x=value, fill = factor(GB))) +
  geom_density(alpha = 0.6, size = .75) +
  facet_wrap(~variable, ncol=2) +
  scale_fill_manual(values = brewer.pal(3, "Dark2")) +
  theme(legend.position = "none",
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        plot.margin = unit(rep(0.5, 4), "lines"),
        title = element_text(size = 9))

plot of chunk distributions-1

Bankloan Binning data

##            SCORE SAMPLE   BR   KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50
## 2 SCORE_Logistic  train 0.25 0.54   0.84   0.31   0.51   0.67   0.79   0.89
## 4    SCORE_CTree  train 0.25 0.52   0.84   0.34   0.50   0.82   0.82   0.90
## 6   SCORE_SLNNET  train 0.25 0.56   0.86   0.32   0.52   0.68   0.82   0.89
## 8      SCORE_LDA  train 0.25 0.52   0.84   0.31   0.51   0.66   0.78   0.88
## 1 SCORE_Logistic   test 0.25 0.53   0.84   0.31   0.51   0.68   0.78   0.87
## 3    SCORE_CTree   test 0.25 0.45   0.79   0.30   0.51   0.76   0.76   0.84
## 5   SCORE_SLNNET   test 0.25 0.51   0.83   0.30   0.51   0.66   0.77   0.85
## 7      SCORE_LDA   test 0.25 0.50   0.83   0.31   0.51   0.67   0.77   0.85

plot of chunk roc-curve-2

plot of chunk distributions-2

Do you want to comment about the results? If you are interesting in this topic reproduce this example. And if you have questions and/or improvements or want to know more details for the code please comment.

References

  1. Ggplot2
  2. RStudio
  3. Knitr
  4. Guide to credit scoring in R

To leave a comment for the author, please follow the link and comment on his blog: jkunst.com: Entries for category R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.