Modelling with R: part 3

October 5, 2011
By

(This article was first published on We think therefore we R, and kindly contributed to R-bloggers)

The previous posts, part 1 and part 2, detailed the procedure to successfully import the data and transform the data so that we can extract some useful information from them. Now it's time to get our hands dirty with some predictive modelling.

The dependent variable here is a binary variable taking values "0" and "1", indicating whether the customer defaulted or not. There are specific, but many, modelling techniques that help us with binary dependent variables. Let's start with the one of the simplest, the Logistic Regression model. And this time, let's just dive straight into it and try to make sense as we go along.


##########-------------2.1: Logistic regression------------###############


m1.logit <- glm(default ~ 
                          amt.fac + 
                          age.fac + 
                          duration +
                          chk_acct +
                          history +
                          purpose +
                          sav_acct +
                          employment +
                          install_rate + 
                          pstatus +
                          other_debtor +
                          time_resid +
                          property +
                          other_install + 
                          housing +
                          other_credits +
                          job +
                          num_depend + 
                          telephone + 
                          foreign
                                  , family = binomial(link = "logit"), data = dev)
# The glm command is for "Generalized Linear Models"
# "~" is the separator between the dependent (or target variable) and the independent variables
# Independent variables are separated by the "+" sign
# "data" requires the data frame in the which the variables are stored
# "family" is used to specify the assumed distribution.


The above code is pretty much self-explanatory. We are running a logistic model on our transformed data. We have specified the "logit" link which means that the estimates (or coefficient or weights) will be for the log of odds ratio and not the direct probabilities. Additionally, note that in case we had to regress the dependent variable, "default" on all the independent variables, we could have simply written
# m1.logit <- glm(default ~ ., family = binomial(link = "logit"), data = dev)
Here the "." indicates all the variables in the data set except for the dependent variable. We are not using the command here is because we have the original "amount" and "age" variables as well as the transformed "amt.fac" and "age.fac" variables as well and we should use only one form, either the continuous form or the factor form but not both.

 We can check the output of the model as
summary(m1.logit)


You will see an output that looks like this
Call:
glm(formula = default ~ amt.fac + age.fac + duration + chk_acct + 
history + purpose + sav_acct + employment + install_rate +
pstatus + other_debtor + time_resid + property + other_install +
housing + other_credits + job + num_depend + telephone +
foreign, family = binomial(link = "logit"), data = dev)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9970 -0.7118 -0.3821 0.6864 2.5393

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.36717 1.28298 0.286 0.77474
amt.fac2600-5000 -0.39145 0.28742 -1.362 0.17322
amt.fac5000+ 0.25237 0.40440 0.624 0.53258
age.fac30-40 -0.12406 0.26915 -0.461 0.64486
age.fac40+ -0.40640 0.30975 -1.312 0.18950
duration 0.03454 0.01152 2.998 0.00272 **
chk_acctA12 -0.26207 0.26204 -1.000 0.31725
chk_acctA13 -0.71081 0.43527 -1.633 0.10246
chk_acctA14 -1.66798 0.27841 -5.991 2.08e-09 ***
historyA31 0.51820 0.68122 0.761 0.44684
historyA32 -0.44583 0.54524 -0.818 0.41354
historyA33 -0.36003 0.58949 -0.611 0.54137
historyA34 -0.89606 0.54235 -1.652 0.09850 .
purposeA41 -1.30443 0.44483 -2.932 0.00336 **
purposeA410 -0.68765 0.79700 -0.863 0.38825
purposeA42 -0.50703 0.31109 -1.630 0.10314
purposeA43 -0.95899 0.29629 -3.237 0.00121 **
purposeA44 -1.07335 0.95318 -1.126 0.26013
purposeA45 -0.40166 0.64852 -0.619 0.53569
purposeA46 0.07051 0.48321 0.146 0.88399
purposeA48 -0.79836 1.32004 -0.605 0.54531
purposeA49 -0.63881 0.40523 -1.576 0.11493
sav_acctA62 -0.20567 0.35346 -0.582 0.56064
sav_acctA63 -0.29441 0.47750 -0.617 0.53752
sav_acctA64 -2.03887 0.74864 -2.723 0.00646 **
sav_acctA65 -0.95937 0.32147 -2.984 0.00284 **
employmentA72 0.24498 0.51552 0.475 0.63463
employmentA73 0.01948 0.50334 0.039 0.96913
employmentA74 -0.41741 0.54075 -0.772 0.44016
employmentA75 0.01317 0.50928 0.026 0.97936
install_rate 0.22383 0.10671 2.098 0.03594 *
pstatusA92 -0.71859 0.53450 -1.344 0.17882
pstatusA93 -1.19509 0.52355 -2.283 0.02245 *
pstatusA94 -0.95261 0.60277 -1.580 0.11402
other_debtorA102 0.71012 0.48442 1.466 0.14267
other_debtorA103 -0.84087 0.51928 -1.619 0.10538
time_resid 0.06092 0.10684 0.570 0.56851
propertyA122 0.43538 0.31755 1.371 0.17035
propertyA123 0.45504 0.29062 1.566 0.11740
propertyA124 0.95268 0.52675 1.809 0.07051 .
other_installA142 0.54490 0.47279 1.153 0.24911
other_installA143 -0.45903 0.28664 -1.601 0.10929
housingA152 -0.58427 0.29480 -1.982 0.04749 *
housingA153 -0.95369 0.59035 -1.615 0.10621
other_credits -0.08412 0.24145 -0.348 0.72753
jobA172 0.51201 0.80390 0.637 0.52419
jobA173 0.56316 0.77175 0.730 0.46556
jobA174 0.09850 0.77019 0.128 0.89823
num_depend 0.38802 0.29445 1.318 0.18758
telephoneA192 -0.19916 0.24766 -0.804 0.42131
foreignA202 -1.35004 0.85752 -1.574 0.11541
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 853.51 on 699 degrees of freedom
Residual deviance: 628.93 on 649 degrees of freedom
AIC: 730.93

Number of Fisher Scoring iterations: 5


We can see that we have five variables which are significant at 1 percent level of significance and one variable significant at 0.1 percent level of significance. Great! Or not so great! How do we judge how good/bad our model is? Have no fear, R is here.

One of the most popular methods to check the performance of a model, which has a binary independent variable, is by plotting the Receiver Operating Characteristic (ROC) curve which is a plot of the sensitivity, or the true positive rate vs. the false positive rate. In simpler terms, it plots the no. of 1s in the model correctly identified as 1 vs. the no. of 0s in the model which are incorrectly identified as 1.

Also, it makes more sense to plot the ROC curve on the validation sample. For this, we would need to score that validation data set. By "scoring" we mean that with the values of the independent variables given in the validation sample, we will run the model equation on each row of the data set and get the probability of the customer defaulting. Simply, use all the Xs and the Y-hats or the predicted Ys.
val$m1.yhat <- predict(m1.logit, val, type = "response")
# The predict command runs the regression model on the "val" dataset and stores the estimated  y-values, i.e, the yhat.

library(ROCR)
m1.scores <- prediction(val$m1.yhat, val$default)
# The prediction function of the ROCR library basically creates a structure to validate our predictions, "val$yhat" with respect to the actual y-values "val$default"

We can then plot the ROC curve by plotting the performance fuction which checks how our model is performing.
plot(performance(m1.scores, "tpr", "fpr"), col = "red")
abline(0,1, lty = 8, col = "grey")
# "tpr" and "fpr" are arguments of the "performance" function indicating that the plot is between the true positive rate and the false positive rate.
# "col" is an argument to the plot function and indicates the colour of the line
# "abline" plots the diagonal, "lty" is the line type which is used to create the dashed line

The ROC curve's performance is judged by the area under the curve formed by the ROC curve and the 45' line. The idea is that if we randomly try to classify the customers who will default, we will get the 45' line. And if we use a predictive approach, then the approach should try to classify all the 1s and 1, i.e, correctly identify all the people who defaulted. Such a curve would then be a right triangle with vertices at (0,0), (0,1), (1,1). But since predictive modelling is usually accompanied with some error, we get a curve like we see in the graph. And the rule of thumb is, more the area under the curve, better the model.

In addition to the ROC curve, the KS statistic is also a popular metric used to judge model performance, the Kolmogorov-Smirnov statistic. It is the maximum difference between the cumulative true positive rate and the cumulative false positive rate

For this let's store the output of the "performance" function above to an R object
m1.perf <- performance(m1.scores, "tpr", "fpr")
ks1.logit <- max(attr(m1.perf, "y.values")[[1]] - (attr(m1.perf, "x.values")[[1]]))
ks1.logit


Well, the ROC shows that the model is not so great but not so bad either. So what do we do to improve our model? Do some more transformations, tweak it a bit, or just try a different technique?


I'll leave this as an open question to make it interactive. Let's try all the "feasible" approaches that people suggest that will help us improve the model.                 
               

To leave a comment for the author, please follow the link and comment on his blog: We think therefore we R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.