July 1, 2012
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

 library(ggplot2) d <- read.table('http://www.win-vector.com/dfiles/maskVars/FRB_CHGDEL.csv', sep=',',header=T) model1 <- lm(Charge.off.rate.on.single.family.residential.mortgages ~ Charge.off.rate.on.credit.card.loans,data=d) d$model1 <- predict(model1,newdata=d) summary(model1) plot1 <- ggplot(d) + geom_point(aes(x=model1, y=Charge.off.rate.on.single.family.residential.mortgages)) + xlim(-1,3) + ylim(-1,3) #ggsave('plot1.png',plot1) cor(d$model1,d$Charge.off.rate.on.single.family.residential.mortgages, use='complete.obs') # 0.7706394  The plot below shows the performance of this trivial model (which ignores auto-correlation, inventory, dates, regional factors, macro-econmic factors and regulations). What we see is the model incorrectly predicts continuous variation between zero and one percent when actual mortgage charge-offs are more of a step function (the rate stays near zero until it jumps above one percent). Even so the correlation of this model to actuals is 0.77, which is fair. Any one variable linear model is really just a shift and rescaling (or an affine transform) of the single input variable. So we get the exact same shape and correlation if we skip the linear modeling step and directly plot the relation between the two variables. We show this in the R code and graph below.  plotXY <- ggplot(d) + geom_point(aes(x=Charge.off.rate.on.credit.card.loans, y=Charge.off.rate.on.single.family.residential.mortgages)) ggsave('plotXY.png',plotXY) cor(d$Charge.off.rate.on.credit.card.loans, d$Charge.off.rate.on.single.family.residential.mortgages, use='complete.obs') # 0.7706394  Now we get to the meat of the masked variable technique. We want to build a step-wise function that better fits the relation. To do this the analyst either by hand or through automation could note in our last graph residential mortgages charge-off rates do not seem to be very sensitive to credit card charge off rates until the credit card charge-off rate exceeds 5%. To encode this domain knowledge we build three new synthetic variables: an indicator that tells us if the credit card charge-off rate is over 5% or note. We call this variable HL (high/low indicator). We then multiply this new variable by our original variable to get a new variable that only varies when the charge-off rate is above 5% (we call this variable H and it is an interaction between the new indicator variable and the original variable). Finally we create a third variable that varies only when the credit card charge-off rate is no more than 5%. This variable is equal to (1-HL) times the original variable and we call it L. We call HL the mask and H and L masked variables. The R-code to form these three new synthetic variables is given below:  d$Charge.off.rate.on.credit.card.loans.HL <- ifelse(d$Charge.off.rate.on.credit.card.loans > 5,1,0) d$Charge.off.rate.on.credit.card.loans.H <- with(d,Charge.off.rate.on.credit.card.loans.HL*Charge.off.rate.on.credit.card.loans) d$Charge.off.rate.on.credit.card.loans.L <- with(d,(1-Charge.off.rate.on.credit.card.loans.HL)*Charge.off.rate.on.credit.card.loans)  We can now use these new variables to build a slightly better model. We do this by exposing all three synthetic variables to the fitter. Thus the fitter now has available in its concept space all step-wise linear functions with a change at 5% (including discontinuous functions). This is related to kernel tricks: make the unknown function you want a linear combination of functions you have and a standard linear fitter can find it for you. The R-code and graph are given below:  modelSplit <- lm(Charge.off.rate.on.single.family.residential.mortgages ~ Charge.off.rate.on.credit.card.loans.HL + Charge.off.rate.on.credit.card.loans.H + Charge.off.rate.on.credit.card.loans.L,data=d) d$modelSplit <- predict(modelSplit,newdata=d) summary(modelSplit) plotSplit <- ggplot(d) + geom_point(aes(x=modelSplit, y=Charge.off.rate.on.single.family.residential.mortgages)) + xlim(-1,3) + ylim(-1,3) #ggsave('plotSplit.png',plotSplit) cor(d$modelSplit,d$Charge.off.rate.on.single.family.residential.mortgages, use='complete.obs') # 0.8133998 

Notice we now get a better correlation of 0.81 and the graph shows that the model is more accurate in the sense its predictions are also clustered near zero (without the horizontal stripe that represented mis-predicted variation).

Now we could call this modeling technique a “poor man’s GAM.” What a GAM does is try to learn the optimal re-shaping of a variable for a given modeling problem. That is instead of the analyst picking a cut-point and asking the modeling system to find slopes (which is what we did when we introduced separate masked variables) we ask the modeling system to learn a best re-shaping. The R-code and graph for a GAM fit are given below. Notice the s() wrapper which tells the GAM to think about reshaping a given variable.

 library(gam) modelGAM <- gam(Charge.off.rate.on.single.family.residential.mortgages ~ s(Charge.off.rate.on.credit.card.loans),data=d) summary(modelGAM) d$modelGAM <- predict(modelGAM,newdata=d) plotGAM <- ggplot(d) + geom_point(aes(x=modelGAM,y=Charge.off.rate.on.single.family.residential.mortgages)) + xlim(-1,3) + ylim(-1,3) #ggsave('plotGAM.png',plotGAM) #png(filename='gamShape.png') plot(modelGAM) #dev.off() cor(d$modelGAM,d$Charge.off.rate.on.single.family.residential.mortgages,use='complete.obs') # 0.8160738  The GAM correlation of 0.82 is slightly better than our masked model. And we can ask the GAM to show us how it reshaped the input variable. Notice the shape the GAM splines picked is a hockey stick (piece wise linear continuous curve) with the bend near 5%.  #png(filename='gamShape.png') plot(modelGAM) #dev.off() cor(d$modelGAM,d$Charge.off.rate.on.single.family.residential.mortgages, use='complete.obs') # 0.8160738  For completeness we include a neural net fit, but we haven’t tuned its controls or hyper-parameters so it is a fully fair comparison. We just want to emphasize the properly using a neural net takes some work (isn’t completely free). And we feel if you are going to work on variables you are better off using techniques like variable transforms, treatments or masks.  library(nnet) modelNN <- nnet(Charge.off.rate.on.single.family.residential.mortgages ~ Charge.off.rate.on.credit.card.loans,data=d, size=3) d$modelNN <- predict(modelNN,newdata=d) plotNN <- ggplot(d) + geom_point(aes(x=modelNN, y=Charge.off.rate.on.single.family.residential.mortgages)) + xlim(-1,3) + ylim(-1,3) #ggsave('plotNN.png',plotNN) cor(d$modelNN,d$Charge.off.rate.on.single.family.residential.mortgages, use='complete.obs') # 0.7961966 

The point of the masked variable technique is: it represents a good compromise between using analyst/data-scientist reasoning and sophisticated packages. The masking cuts can be generated once by an analyst and supported by providing the documenting graphs as we have shown here. Then an already in-place standard fitting system can pick the coefficients for the new synthetic variables (causing the fitter itself to compute the shape of the optimal piece-wise curve, saving the analyst this chore). This technique can be used in any data analysis environment that supports graphing, user-defined transformations and regression fitting (linear or otherwise).

The technique doesn’t require the analyst to pick the actual transform or slopes (again, the fitter does this). Also, this methodology is good for supporting audit and maintenance. The construction of synthetic variables can be documented and validated and standard explainable methods can be used for the remainder of the fitting process. We feel the masked variable trick represents a good practical compromise in terms of power, rigor and clarity.

Related posts:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...