Predicting creditworthiness: part-2

[This article was first published on R on Know Your Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Refining the credit model(s)

To continue with the creditworthiness case, I want to explore this case a little bit more by adding more meta algorithms such as boosting, winnowing, cross validation etc. Additionally, I’ll use randomforest as classifier algorithm.

I’m still using the same german credit data as in the previous post. I’m also using the same train/testest. Each model is stored into one object models.

# object that will store all the models in a list
models <- list()

I start with three different models, which are all generated with the C5.0 algorithm. First model is a default model with no extra features. Second model is amplified with the boosting feature: instead of generating just one classifier it will generate several classifiers. After each iteration it will focus more on misclassified examples for reducing bias. The third model has the winnow parameter set to TRUE. Basicly, it will search over the 20 attributes of the dataset and pre-select a subset of attribute that will be used to construct the decision tree or ruleset. Read more at C5.0 tutorial

# C5.0 package
set.seed(2)
# train model
baseMod <- 
  C5.0(
    training[,-1],
    training$Creditability)
# store basemodel into the models object
models$baseMod <- baseMod

# Using boosting with C5.0 model
set.seed(2)
# train model
BoostMod <- 
  C5.0(
    training[,-1],
    training$Creditability,
    trials = 100)
# store boostmod into the models object
models$BoostMod <- BoostMod

# Using winnow and boosting
set.seed(2)
# train model
WinnowMod <- 
  C5.0(
    training[,-1],
    training$Creditability,
    control = C5.0Control(winnow = TRUE),
    trials = 100)
# store winnowmod into the models object
models$WinnowMod <- WinnowMod

So the models created thus far:

names(models)
## [1] "baseMod"   "BoostMod"  "WinnowMod"

Performance measures

After training, lets gather the performance of those models on new examples. With the ROCR package we can do lots of performance tests such as: Area under the Curve, sensitivity, specificity, accuracy, etc. I made a few functions accuracyTester, getPerformance, getSensSpec and getVarImportance so I can run those function for each model in the models object.

# function for returning accuracy on test dataset
accuracyTester <- 
  function(predictModel) {
    temp <- predict(predictModel, testing)
    postResample(temp, testing$Creditability)
  }
# function for calculating performance
# input for the ROC curve
getPerformance <- 
  function(modelName) {
    score <- predict(modelName, type= "prob", testing)
    pred <-  prediction(score[,1], testing$Creditability)
    perf <- performance(pred, "tpr", "fpr")
    return(perf)
  }
# function for calculating specificity and sensitivity
getSensSpec <- 
  function(modelName) {
    score <- predict(modelName, type= "prob", testing)
    pred <-  prediction(score[,1], testing$Creditability)
    perf <- performance(pred, "sens", "spec")
    return(perf)
  }

ROC and Accuracy plot

One method to evaluate the models is by calculating the overall accuracy of each model. The BoostMod has the highest accuracy.

A more reliable method to evaluate the model’s performace is the Receiver Operating Characteristics. It’s a well used visualization technique to evaluate binary classifiers. Predicting good or bad creditworthiness is indeed a binary classification. A ROC curve is created by plotting the true positive rate (TPR) or sensitivity against the false positive rate (FPR), thus it shows the tpr as a function of fpr.

For each fpr it is shown that the BoostMod has the highest tpr.

Caret package

Another great package I found is the caret package. It has an uniform interface to a lot of predictive algorithms. Also, it provides a generic approach for visualization, pre-processing, data-splitting, variable importance, model performance and parallel processing. This can be handy, since different modeling functions have different syntax for model training, predicting and parameter tuning.

Caret has bindings to the C5.0 algorithm, therefor it will also tune the parameters boosting and winnowing. Another way to get a more reliable estimate of accuracy is by K-fold cross-validation. Just for illustration I will use a 10-fold cross validation, but will use it only on our training set. So I can use the testset for the other performance measures.

This image illustrates the mechanics of cross-validation: 10-fold cross-validation

# a list of values that define how this function acts
ctrl <- 
  trainControl(method = 'repeatedcv', # 10-fold cv
               number=10,             # 10-fold cross-validation
               repeats=5)             # 5-repeats 
# train model
set.seed(2)
cvMod <- 
  train(form = Creditability ~.,
        data = training,
        method     = "C5.0",
        trControl  = ctrl,
        tuneGrid = expand.grid(trials = 15, model = c("tree", "rules"), winnow = c(T,F)))
# store rfmodel into the models object
models$cvMod <- cvMod

So far I started exploring the construction of a single classification tree with the C5.0 packages. I tried to improve the performance by adding an ensemble learner (boosting). Looking at the ROC and Accuracy plot this seems to be the best performing model so far. Another ensemble learner can be done for example with the randomForest package. Intead of using boosting or cross-validation it will use another technique called bagging (__b__ootstrap __agg__regating).

Here, I’m only learning a forest on the training set, so I can evaluate its performance just like the other models.

library(randomForest)
# RANDOMFOREST
set.seed(2)
rfModel <- 
  randomForest(form = Creditability ~., data = training,
               ntree=500, 
               importance=T, 
               proximity=T,
               keep.forest = TRUE
  )
# store rfmodel into the models object
models$rfModel <- rfModel

A slight improvement can be seen on the ROC curve. The model with cross-validation and randomforest are slightly higher on the curve.

Another way to evaluate the ROC performance can be done by calculating the Area under the Curve. Again, both previous best models have the same and highest AUC.

baseMod BoostMod WinnowMod cvMod rfModel
0.66 0.77 0.73 0.79 0.79

The accuracy of rfModel has the highest accuracy, but at what cost?

I guess a bank will choose a more conservative approach and follows a strategy with a more precise prediction for bad creditworthiness. Thus, a bank will prefer to avoid more false positives (predicted good, actual bad) than false negatives (predicted bad, actual true). So I would assume banks will likely choose a model with a good specificity.

Given this notion a bank can evaluate its options by looking at both:

  • sensitivity = \(number of true positives \over number of true positives + number of false negatives\)
  • specificity = \(number of true negatives \over number of true negatives + number of false positives\)

For each given model:

cut sens spec
baseMod 0.78 0.81 0.50
BoostMod 0.66 0.66 0.78
WinnowMod 0.62 0.78 0.60
cvMod 0.64 0.70 0.76
rfModel 0.68 0.66 0.80

To leave a comment for the author, please follow the link and comment on their blog: R on Know Your Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)