# Predicting creditworthiness: part-2

**R on Know Your Data**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Refining the credit model(s)

To continue with the creditworthiness case, I want to explore this case a little bit more by adding more meta algorithms such as boosting, winnowing, cross validation etc. Additionally, I’ll use `randomforest`

as classifier algorithm.

I’m still using the same german credit data as in the previous post. I’m also using the same train/testest. Each model is stored into one object `models`

.

# object that will store all the models in a list models <- list()

I start with three different models, which are all generated with the `C5.0`

algorithm. **First model** is a default model with no extra features. **Second model** is amplified with the boosting feature: instead of generating just one classifier it will generate several classifiers. After each iteration it will focus more on misclassified examples for reducing bias. The **third model** has the `winnow`

parameter set to `TRUE`

. Basicly, it will search over the 20 attributes of the dataset and pre-select a subset of attribute that will be used to construct the decision tree or ruleset. Read more at C5.0 tutorial

# C5.0 package set.seed(2) # train model baseMod <- C5.0( training[,-1], training$Creditability) # store basemodel into the models object models$baseMod <- baseMod # Using boosting with C5.0 model set.seed(2) # train model BoostMod <- C5.0( training[,-1], training$Creditability, trials = 100) # store boostmod into the models object models$BoostMod <- BoostMod # Using winnow and boosting set.seed(2) # train model WinnowMod <- C5.0( training[,-1], training$Creditability, control = C5.0Control(winnow = TRUE), trials = 100) # store winnowmod into the models object models$WinnowMod <- WinnowMod

So the models created thus far:

names(models) ## [1] "baseMod" "BoostMod" "WinnowMod"

## Performance measures

After training, lets gather the performance of those models on new examples. With the `ROCR`

package we can do lots of performance tests such as: Area under the Curve, sensitivity, specificity, accuracy, etc. I made a few functions `accuracyTester`

, `getPerformance`

, `getSensSpec`

and `getVarImportance`

so I can run those function for each model in the `models`

object.

# function for returning accuracy on test dataset accuracyTester <- function(predictModel) { temp <- predict(predictModel, testing) postResample(temp, testing$Creditability) } # function for calculating performance # input for the ROC curve getPerformance <- function(modelName) { score <- predict(modelName, type= "prob", testing) pred <- prediction(score[,1], testing$Creditability) perf <- performance(pred, "tpr", "fpr") return(perf) } # function for calculating specificity and sensitivity getSensSpec <- function(modelName) { score <- predict(modelName, type= "prob", testing) pred <- prediction(score[,1], testing$Creditability) perf <- performance(pred, "sens", "spec") return(perf) }

### ROC and Accuracy plot

One method to evaluate the models is by calculating the overall accuracy of each model. The `BoostMod`

has the highest accuracy.

A more reliable method to evaluate the model’s performace is the Receiver Operating Characteristics. It’s a well used visualization technique to evaluate binary classifiers. Predicting good or bad creditworthiness is indeed a binary classification. A ROC curve is created by plotting the true positive rate (TPR) or sensitivity against the false positive rate (FPR), thus it shows the tpr as a function of fpr.

For each fpr it is shown that the `BoostMod`

has the highest tpr.

## Caret package

Another great package I found is the `caret`

package. It has an uniform interface to a lot of predictive algorithms. Also, it provides a generic approach for visualization, pre-processing, data-splitting, variable importance, model performance and parallel processing. This can be handy, since different modeling functions have different syntax for model training, predicting and parameter tuning.

`Caret`

has bindings to the `C5.0`

algorithm, therefor it will also tune the parameters boosting and winnowing. Another way to get a more reliable estimate of accuracy is by **K-fold cross-validation**. Just for illustration I will use a 10-fold cross validation, but will use it only on our training set. So I can use the testset for the other performance measures.

This image illustrates the mechanics of cross-validation:

# a list of values that define how this function acts ctrl <- trainControl(method = 'repeatedcv', # 10-fold cv number=10, # 10-fold cross-validation repeats=5) # 5-repeats # train model set.seed(2) cvMod <- train(form = Creditability ~., data = training, method = "C5.0", trControl = ctrl, tuneGrid = expand.grid(trials = 15, model = c("tree", "rules"), winnow = c(T,F))) # store rfmodel into the models object models$cvMod <- cvMod

So far I started exploring the construction of a single classification tree with the `C5.0`

packages. I tried to improve the performance by adding an ensemble learner (boosting). Looking at the ROC and Accuracy plot this seems to be the best performing model so far. Another ensemble learner can be done for example with the `randomForest`

package. Intead of using boosting or cross-validation it will use another technique called **bagging** (__b__ootstrap __agg__regat**ing**).

Here, I’m only learning a forest on the training set, so I can evaluate its performance just like the other models.

library(randomForest) # RANDOMFOREST set.seed(2) rfModel <- randomForest(form = Creditability ~., data = training, ntree=500, importance=T, proximity=T, keep.forest = TRUE ) # store rfmodel into the models object models$rfModel <- rfModel

A slight improvement can be seen on the ROC curve. The model with cross-validation and randomforest are slightly higher on the curve.

Another way to evaluate the ROC performance can be done by calculating the **Area under the Curve**. Again, both previous best models have the same and highest AUC.

baseMod | BoostMod | WinnowMod | cvMod | rfModel |
---|---|---|---|---|

0.66 | 0.77 | 0.73 | 0.79 | 0.79 |

The accuracy of rfModel has the highest accuracy, but at what cost?

I guess a bank will choose a more conservative approach and follows a strategy with a more precise prediction for bad creditworthiness. Thus, a bank will prefer to avoid more false positives (predicted good, actual bad) than false negatives (predicted bad, actual true). So I would assume banks will likely choose a model with a good specificity.

Given this notion a bank can evaluate its options by looking at both:

- sensitivity = \(number of true positives \over number of true positives + number of false negatives\)
- specificity = \(number of true negatives \over number of true negatives + number of false positives\)

For each given model:

cut | sens | spec | |
---|---|---|---|

baseMod | 0.78 | 0.81 | 0.50 |

BoostMod | 0.66 | 0.66 | 0.78 |

WinnowMod | 0.62 | 0.78 | 0.60 |

cvMod | 0.64 | 0.70 | 0.76 |

rfModel | 0.68 | 0.66 | 0.80 |

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Know Your Data**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.