**R on Know Your Data**, and kindly contributed to R-bloggers)

## Refining the credit model(s)

To continue with the creditworthiness case, I want to explore this case a little bit more by adding more meta algorithms such as boosting, winnowing, cross validation etc. Additionally, I’ll use `randomforest`

as classifier algorithm.

I’m still using the same german credit data as in the previous post. I’m also using the same train/testest. Each model is stored into one object `models`

.

```
# object that will store all the models in a list
models <- list()
```

I start with three different models, which are all generated with the `C5.0`

algorithm. **First model** is a default model with no extra features. **Second model** is amplified with the boosting feature: instead of generating just one classifier it will generate several classifiers. After each iteration it will focus more on misclassified examples for reducing bias. The **third model** has the `winnow`

parameter set to `TRUE`

. Basicly, it will search over the 20 attributes of the dataset and pre-select a subset of attribute that will be used to construct the decision tree or ruleset. Read more at C5.0 tutorial

```
# C5.0 package
set.seed(2)
# train model
baseMod <-
C5.0(
training[,-1],
training$Creditability)
# store basemodel into the models object
models$baseMod <- baseMod
# Using boosting with C5.0 model
set.seed(2)
# train model
BoostMod <-
C5.0(
training[,-1],
training$Creditability,
trials = 100)
# store boostmod into the models object
models$BoostMod <- BoostMod
# Using winnow and boosting
set.seed(2)
# train model
WinnowMod <-
C5.0(
training[,-1],
training$Creditability,
control = C5.0Control(winnow = TRUE),
trials = 100)
# store winnowmod into the models object
models$WinnowMod <- WinnowMod
```

So the models created thus far:

`names(models)`

`## [1] "baseMod" "BoostMod" "WinnowMod"`

## Performance measures

After training, lets gather the performance of those models on new examples. With the `ROCR`

package we can do lots of performance tests such as: Area under the Curve, sensitivity, specificity, accuracy, etc. I made a few functions `accuracyTester`

, `getPerformance`

, `getSensSpec`

and `getVarImportance`

so I can run those function for each model in the `models`

object.

```
# function for returning accuracy on test dataset
accuracyTester <-
function(predictModel) {
temp <- predict(predictModel, testing)
postResample(temp, testing$Creditability)
}
# function for calculating performance
# input for the ROC curve
getPerformance <-
function(modelName) {
score <- predict(modelName, type= "prob", testing)
pred <- prediction(score[,1], testing$Creditability)
perf <- performance(pred, "tpr", "fpr")
return(perf)
}
# function for calculating specificity and sensitivity
getSensSpec <-
function(modelName) {
score <- predict(modelName, type= "prob", testing)
pred <- prediction(score[,1], testing$Creditability)
perf <- performance(pred, "sens", "spec")
return(perf)
}
```

### ROC and Accuracy plot

One method to evaluate the models is by calculating the overall accuracy of each model. The `BoostMod`

has the highest accuracy.

A more reliable method to evaluate the model’s performace is the Receiver Operating Characteristics. It’s a well used visualization technique to evaluate binary classifiers. Predicting good or bad creditworthiness is indeed a binary classification. A ROC curve is created by plotting the true positive rate (TPR) or sensitivity against the false positive rate (FPR), thus it shows the tpr as a function of fpr.

For each fpr it is shown that the `BoostMod`

has the highest tpr.

## Caret package

Another great package I found is the `caret`

package. It has an uniform interface to a lot of predictive algorithms. Also, it provides a generic approach for visualization, pre-processing, data-splitting, variable importance, model performance and parallel processing. This can be handy, since different modeling functions have different syntax for model training, predicting and parameter tuning.

`Caret`

has bindings to the `C5.0`

algorithm, therefor it will also tune the parameters boosting and winnowing. Another way to get a more reliable estimate of accuracy is by **K-fold cross-validation**. Just for illustration I will use a 10-fold cross validation, but will use it only on our training set. So I can use the testset for the other performance measures.

This image illustrates the mechanics of cross-validation:

```
# a list of values that define how this function acts
ctrl <-
trainControl(method = 'repeatedcv', # 10-fold cv
number=10, # 10-fold cross-validation
repeats=5) # 5-repeats
# train model
set.seed(2)
cvMod <-
train(form = Creditability ~.,
data = training,
method = "C5.0",
trControl = ctrl,
tuneGrid = expand.grid(trials = 15, model = c("tree", "rules"), winnow = c(T,F)))
# store rfmodel into the models object
models$cvMod <- cvMod
```

So far I started exploring the construction of a single classification tree with the `C5.0`

packages. I tried to improve the performance by adding an ensemble learner (boosting). Looking at the ROC and Accuracy plot this seems to be the best performing model so far. Another ensemble learner can be done for example with the `randomForest`

package. Intead of using boosting or cross-validation it will use another technique called **bagging** (__b__ootstrap __agg__regat**ing**).

Here, I’m only learning a forest on the training set, so I can evaluate its performance just like the other models.

```
library(randomForest)
# RANDOMFOREST
set.seed(2)
rfModel <-
randomForest(form = Creditability ~., data = training,
ntree=500,
importance=T,
proximity=T,
keep.forest = TRUE
)
# store rfmodel into the models object
models$rfModel <- rfModel
```

A slight improvement can be seen on the ROC curve. The model with cross-validation and randomforest are slightly higher on the curve.

Another way to evaluate the ROC performance can be done by calculating the **Area under the Curve**. Again, both previous best models have the same and highest AUC.

baseMod | BoostMod | WinnowMod | cvMod | rfModel |
---|---|---|---|---|

0.66 | 0.77 | 0.73 | 0.79 | 0.79 |

The accuracy of rfModel has the highest accuracy, but at what cost?

I guess a bank will choose a more conservative approach and follows a strategy with a more precise prediction for bad creditworthiness. Thus, a bank will prefer to avoid more false positives (predicted good, actual bad) than false negatives (predicted bad, actual true). So I would assume banks will likely choose a model with a good specificity.

Given this notion a bank can evaluate its options by looking at both:

- sensitivity = \(number of true positives \over number of true positives + number of false negatives\)
- specificity = \(number of true negatives \over number of true negatives + number of false positives\)

For each given model:

cut | sens | spec | |
---|---|---|---|

baseMod | 0.78 | 0.81 | 0.50 |

BoostMod | 0.66 | 0.66 | 0.78 |

WinnowMod | 0.62 | 0.78 | 0.60 |

cvMod | 0.64 | 0.70 | 0.76 |

rfModel | 0.68 | 0.66 | 0.80 |

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Know Your Data**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...