As I said in the previous post, this summer I’ve been learning some of the most popular machine learning algorithms and trying to apply what I’ve learned to real world scenarios. The German Credit dataset provided by the UCI Machine Learning Repository is another great example of application.
The German Credit dataset contains 1000 samples of applicants asking for some kind of loan and the creditability (either good or bad) alongside with 20 features that are believed to be relevant in predicting creditability. Some examples are: the duration of the loan, the amount, the age of the applicant, the sex, and so on. Note that the dataset contains both categorical and quantitative features.
This is a classical example of a binary classification task, where our goal is essentially to build a model that can improve the selection process of the applicants.
As a first step, you can see that this is an unbalanced dataset since 70% of the samples scored 1 (creditworthy) while 30% did not. Usually, the more unbalanced the dataset is, the harder to fit a model since we would be trying to predict a rare event. Furthermore, our model should be at least 70% accurate otherwise, we might as well not use any model, since 50% accuracy would be random guess and less accuracy would be worse than guessing. We can also say that it would be better to fit a model that has greater accuracy in predicting bad applicants since when a client defaults the loss might be significant. It is therefore worse to predict a false positive than a false negative. A confusion matrix will surely help. This last requirement shows how the machine learning techniques, when applied, should be assessed from different perspective, business purpose being one. If the aim of the model should be protecting the bank from giving out loans to bad applicants, then (as you will soon see) this simple classifier might not be a good solution.
By running some tests (which I will not show here) such as checking correlation between features using cor() and cov(), I decided to use less features (namely 14) because some of the existing features add too little information or are highly correlated with other variables.
I am going to fit a logistic regression model with half of the entire dataset (500 samples) and test the model on the other half. Below you can find the model fitting:
Analysis of the summary:
In the summary, we see that even though the variables used are uncorrelated, some of them are not statistically significant. Oddly enough, Guarantors, age and having valuable assets seem not to be statistically significant. Perhaps this should be analysed deeper.
On the other side, the model tells us that the account balance, payment status of previous credit (credit history), purpose and credit amount are statistically significant and as one might expect, the coefficients seem to agree with the common sense, for example:
– A higher credit amount has a negative effect on the probability of the borrower being solvent, or better, a unit increase in the variable credit amount is associated to a decrease of 0.000447 units in the log odds of being solvent. On the other hand, a unit increase in the variable account balance is associated with a 0.0684 unit increase in the log odds of being able to repay the debt meaning that clients with higher balance are more likely to repay the debt.
To improve the model further we should delete features with low p-value if no other relevant information is available.
How good is our model?
To answer this question thoroughly we should do a serious cross validation, however for sake of simplicity we will try to predict the other half of the dataset out of the box, without mixing the cards too much, in order to have a quick look at how the model performs on brand new data.
The 0.75 accuracy score seems fine for a simple logistic regression, although we should remember that this score has high variance and therefore might not be that accurate. Again, cross validation should help solving the problem here.
The confusion matrix shows that the model is misclassifying 50% of the ‘bad’ clients while it is making a decent job at classifying good ones. This shows that further analysis should be done trying to reduce the first kind of misclassification (if we ignore the opportunity costs or assuming they are negligible compared to the losses from bad clients). The classifier accuracy anyway should be above 0.7. Misclassification rates among bad applicants as the one above are likely to be inacceptable.
Can our model improve the business?
If we assume that the bank looses everything with bad applicants and earns 30% with good applicants (say on 7-8 year loans), then by just saying yes to all applicants, the bank would incur on average in a 0.7*0.3 –0.3 = –0.125 unit loss while using the model it would obtain a 0.598*0.3 – 0.162 = 0.0174 average unit profit.
Can we do better? Sure, in the future I’m going to post more about this.
Next, cross validation.
The dataset has been downloaded from the following source:
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.