Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

R Code

library(gpls)
library(pROC)

df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)] set.seed(2016) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, ] test <- df2[-sample, ] m1 <- glm(DEFAULT ~ ., data = train, family = "binomial") cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))
cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response"))) m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5) cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1])) cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1])) cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n") print(data.frame(glm = m1$coefficients, gpls = m2$coefficients)) Output ### ROC OF GLM PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m1,     newdata = train, type = "response"))

Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6641

### ROC OF GLM PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m1, newdata = test, type = "response")) Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6537 ### ROC OF GPLS PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m2,     newdata = train)$predicted[, 1]) Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6627

### ROC OF GPLS PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m2, newdata = test)$predicted[, 1])

Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test\$DEFAULT 1).
Area under the curve: 0.6542

### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###
glm          gpls
(Intercept) -0.1940785071 -0.1954618828
AGE         -0.0122709412 -0.0147883358
MAJORDRG     0.0757313171  0.0813835741
MINORDRG     0.2621574192  0.2547176301
OWNRENT     -0.2803919685 -0.1032119571
INCOME      -0.0004222914 -0.0004531543
LOGSPEND    -0.1688395555 -0.1525963363