Visualizing Bootrapped Stepwise Regression in R using Plotly

[This article was first published on R – Modern Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all have used stepwise regression at some point. Stepwise regression is known to be sensitive to initial inputs. One way to mitigate this sensitivity is to repeatedly run stepwise regression on bootstrap samples.

R has a nice package called bootStepAIC() which (from its description) “Implements a Bootstrap procedure to investigate the variability of model selection under the stepAIC() stepwise algorithm of package MASS.

It provides a lot of information as an output and sometimes it can get challenging to keep track of all of this information especially if there are a lot of covariates. In this post we’ll try to come up with a simple visualization aimed at summarizing the output from the function boot.stepAIC().

Running boot.stepAIC()

Using the boot.stepAIC() is fairly simple. Just input an already fitted lm/glm model and th associated dataset.

We’ll use the BostonHousing dataset from the mlbench package. More details here

library(bootStepAIC)
library(plotly)
library(mlbench)

# Load Boston housing dataset
data("BostonHousing")

# Fit Linear regression model
fit <- lm(crim ~ ., data = BostonHousing)

# Run bootstrapped stepwise regression
fit.boot <- boot.stepAIC(fit, data = BostonHousing, B = 100) # That's it !

Collecting required information

The output from boot.stepAIC() contains the following. Note that each output is shown as a percentage (based on the total number of bootstrapped samples)

  • No of times a covariate was featured in the final model from stepAIC()
  • No of times a covariate’s coefficient sign was positive / negative
  • No of times a covariate was statistically significant (default at alpha = 5%)

We’ll collect all of this information first and create data frames so as to make charting easier later on.

Note that in this particualr example there is a variable by the name chas which is a factor with levels 0 and 1. R renames the variable as chas1 by default.

# Extract data
nBoot <- summary(fit.boot)[8,1]
origModel <- paste(names(coef(fit.boot$OrigModel)), collapse = " + ")
stepModel <- paste(names(coef(fit.boot$OrigStepAIC)), collapse = " + ")

# Names of covariates
covariates <- rownames(fit.boot$Covariates)
nCovariates <- length(covariates)

# Matrix of number of times each covariate was picked
coef.pick <- fit.boot$Covariates

# Matrix for the consistency of sign on each covariate
coef.sign <- fit.boot$Sign

# Change name for "chas" since it is a factor
rownames(coef.sign)[7] <- "chas"
coef.sign <- coef.sign[match(rownames(coef.pick), rownames(coef.sign)),]

# Matrix for statistical significance
coef.stat <- fit.boot$Significance

# Change name for "chas" since it is a factor
rownames(coef.stat)[11] <- "chas"
coef.stat <- coef.stat[match(rownames(coef.pick), rownames(coef.stat)),]

# Make into long form for charting later
coef.stat.long <- data.frame()

for(i in 1:length(coef.stat)){
  n <- round(coef.stat[i],0)
  vec <- seq(0, n, by = 2)
  mat <- data.frame(rep(names(coef.stat)[i], length(vec)), vec, paste("% Sig", n))
  names(mat) <- c("variable", "sig", "text")
  
  # We'll use mode = "line". NA helps separate line segments
  coef.stat.long <- rbind(coef.stat.long, mat, c(NA, NA))
}

# Convert to dataframes
coef.pick <- as.data.frame(coef.pick)
coef.stat <- as.data.frame(coef.stat)
coef.sign <- as.data.frame(coef.sign)

names(coef.pick) <- "pick"
names(coef.sign) <- c("pos", "neg")
names(coef.stat) <- "stat"

Plot

Now that we have all the information we need, we just need to plot. The plot is arranged as such:

  • One layer for the number of times a variable was picked up by stepAIC() (barplot)
  • One layer for the positive and negative coefficients (scatter plot using triangles)
  • One layer for the number of times a variable was significant (vertical line chart)
  • Annotation for some other information

# Base plot for number of times a variable was picked by stepAIC
plot_ly(coef.pick, x = rownames(coef.pick), y = pick,
        type = "bar", opacity = 0.75, name = "Times picked (%)",
        hoverinfo = "text", text = pick.text,
        marker = list(color = "#00994d", line = list(width = 2))) %>% 
  
  # Layer for number of times a variable was statistically significant at 5%
  add_trace(data = coef.stat.long, x = variable, y = sig, 
            type = "scatter", mode = "markers + line", name = "Stat. Sig (%)",
            line = list(color = "#ffdb4d", width = 15),
            hoverinfo = "text", text = text) %>% 
  
  # Layer for number of times a variable's coefficient was positive
  add_trace(data = coef.sign, x = rownames(coef.pick), y = rep(-5, nCovariates), 
            type = "scatter", mode = "markers", name = "Coef Sign(% pos)",
            marker = list(symbol = "triangle-up", size = pos/scale, color = "#4da6ff",
                          line = list(color = "black", width = 2)),
            hoverinfo = "text", text = sign.text.up) %>% 
  
  # Layer for number of times a variable's coefficient was negative
  add_trace(data = coef.sign, x = rownames(coef.pick), y = rep(-10, nCovariates), 
            type = "scatter", mode = "markers", name = "Coef Sign(% neg)",
            marker = list(symbol = "triangle-down", size = neg/scale, color = "#ff704d",
                          line = list(color = "black", width = 2)),
            hoverinfo = "text", text = sign.text.down) %>% 
  
  # Layout, annotations, axis options etc
  layout(xaxis = list(title = "<b>Covariates</b>"),
         yaxis = list(title = "<b>Percentage(%)</b>",
                      tickmode = "array", 
                      tickvals = round(seq(0, 100, length.out = 10), 0),
                      domain = c(0.2, 1)),
         plot_bgcolor = "#e1efc3",
         paper_bgcolor = "#e1efc3",
         
         annotations = list(
           list(x = 0.1, y = 1, 
                xref = "paper", yref = "paper", 
                xanchor = "left", yanchor = "top",
                ax = 0, ay = 0,
                text = "Visualizing <em>boot.stepAIC()</em>",
                font = list(family = "serif", size = 30)),
           
           list(x = 0.3, y = 0.1, 
                xref = "paper", yref = "paper", 
                xanchor = "left", yanchor = "top",
                ax = 0, ay = 0,
                text = paste("<em>Original Model:</em>", origModel),
                font = list(family = "PT Sans Narrow", size = 15)),
           
           list(x = 0.21, y = 0.05, 
                xref = "paper", yref = "paper", 
                xanchor = "left", yanchor = "top", align = "left",
                ax = 0, ay = 0,
                text = paste("<em>Stepwise Model:</em>", stepModel),
                font = list(family = "PT Sans Narrow", size = 15)),
           
           list(x = 0.8, y = 0.90, 
                xref = "paper", yref = "paper", 
                xanchor = "left", yanchor = "top", align = "left",
                ax = 0, ay = 0,
                text = paste0("<em>No. of Covariates:</em>", nCovariates, "<br>",
                              "<em>No. of bootstrap samples:</em>", nBoot, "<br>"),
                font = list(family = "PT Sans Narrow", size = 15))
         ))

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)