# Visualizing Bootrapped Stepwise Regression in R using Plotly

May 29, 2016
By

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

We all have used stepwise regression at some point. Stepwise regression is known to be sensitive to initial inputs. One way to mitigate this sensitivity is to repeatedly run stepwise regression on bootstrap samples.

R has a nice package called bootStepAIC() which (from its description) “Implements a Bootstrap procedure to investigate the variability of model selection under the stepAIC() stepwise algorithm of package MASS.

It provides a lot of information as an output and sometimes it can get challenging to keep track of all of this information especially if there are a lot of covariates. In this post we’ll try to come up with a simple visualization aimed at summarizing the output from the function `boot.stepAIC()`.

### Running `boot.stepAIC()`

Using the `boot.stepAIC()` is fairly simple. Just input an already fitted `lm/glm` model and th associated dataset.

We’ll use the `BostonHousing` dataset from the `mlbench` package. More details here

```library(bootStepAIC)
library(plotly)
library(mlbench)

data("BostonHousing")

# Fit Linear regression model
fit <- lm(crim ~ ., data = BostonHousing)

# Run bootstrapped stepwise regression
fit.boot <- boot.stepAIC(fit, data = BostonHousing, B = 100) # That's it !```

### Collecting required information

The output from `boot.stepAIC()` contains the following. Note that each output is shown as a percentage (based on the total number of bootstrapped samples)

• No of times a covariate was featured in the final model from `stepAIC()`
• No of times a covariate’s coefficient sign was positive / negative
• No of times a covariate was statistically significant (default at alpha = 5%)

We’ll collect all of this information first and create data frames so as to make charting easier later on.

Note that in this particualr example there is a variable by the name chas which is a factor with levels 0 and 1. R renames the variable as chas1 by default.

```# Extract data
nBoot <- summary(fit.boot)[8,1]
origModel <- paste(names(coef(fit.boot\$OrigModel)), collapse = " + ")
stepModel <- paste(names(coef(fit.boot\$OrigStepAIC)), collapse = " + ")

# Names of covariates
covariates <- rownames(fit.boot\$Covariates)
nCovariates <- length(covariates)

# Matrix of number of times each covariate was picked
coef.pick <- fit.boot\$Covariates

# Matrix for the consistency of sign on each covariate
coef.sign <- fit.boot\$Sign

# Change name for "chas" since it is a factor
rownames(coef.sign)[7] <- "chas"
coef.sign <- coef.sign[match(rownames(coef.pick), rownames(coef.sign)),]

# Matrix for statistical significance
coef.stat <- fit.boot\$Significance

# Change name for "chas" since it is a factor
rownames(coef.stat)[11] <- "chas"
coef.stat <- coef.stat[match(rownames(coef.pick), rownames(coef.stat)),]

# Make into long form for charting later
coef.stat.long <- data.frame()

for(i in 1:length(coef.stat)){
n <- round(coef.stat[i],0)
vec <- seq(0, n, by = 2)
mat <- data.frame(rep(names(coef.stat)[i], length(vec)), vec, paste("% Sig", n))
names(mat) <- c("variable", "sig", "text")

# We'll use mode = "line". NA helps separate line segments
coef.stat.long <- rbind(coef.stat.long, mat, c(NA, NA))
}

# Convert to dataframes
coef.pick <- as.data.frame(coef.pick)
coef.stat <- as.data.frame(coef.stat)
coef.sign <- as.data.frame(coef.sign)

names(coef.pick) <- "pick"
names(coef.sign) <- c("pos", "neg")
names(coef.stat) <- "stat"```

### Plot

Now that we have all the information we need, we just need to plot. The plot is arranged as such:

• One layer for the number of times a variable was picked up by `stepAIC()` (barplot)
• One layer for the positive and negative coefficients (scatter plot using triangles)
• One layer for the number of times a variable was significant (vertical line chart)
• Annotation for some other information
```# Base plot for number of times a variable was picked by stepAIC
plot_ly(coef.pick, x = rownames(coef.pick), y = pick,
type = "bar", opacity = 0.75, name = "Times picked (%)",
hoverinfo = "text", text = pick.text,
marker = list(color = "#00994d", line = list(width = 2))) %>%

# Layer for number of times a variable was statistically significant at 5%
add_trace(data = coef.stat.long, x = variable, y = sig,
type = "scatter", mode = "markers + line", name = "Stat. Sig (%)",
line = list(color = "#ffdb4d", width = 15),
hoverinfo = "text", text = text) %>%

# Layer for number of times a variable's coefficient was positive
add_trace(data = coef.sign, x = rownames(coef.pick), y = rep(-5, nCovariates),
type = "scatter", mode = "markers", name = "Coef Sign(% pos)",
marker = list(symbol = "triangle-up", size = pos/scale, color = "#4da6ff",
line = list(color = "black", width = 2)),
hoverinfo = "text", text = sign.text.up) %>%

# Layer for number of times a variable's coefficient was negative
add_trace(data = coef.sign, x = rownames(coef.pick), y = rep(-10, nCovariates),
type = "scatter", mode = "markers", name = "Coef Sign(% neg)",
marker = list(symbol = "triangle-down", size = neg/scale, color = "#ff704d",
line = list(color = "black", width = 2)),
hoverinfo = "text", text = sign.text.down) %>%

# Layout, annotations, axis options etc
layout(xaxis = list(title = "Covariates"),
yaxis = list(title = "Percentage(%)",
tickmode = "array",
tickvals = round(seq(0, 100, length.out = 10), 0),
domain = c(0.2, 1)),
plot_bgcolor = "#e1efc3",
paper_bgcolor = "#e1efc3",

annotations = list(
list(x = 0.1, y = 1,
xref = "paper", yref = "paper",
xanchor = "left", yanchor = "top",
ax = 0, ay = 0,
text = "Visualizing boot.stepAIC()",
font = list(family = "serif", size = 30)),

list(x = 0.3, y = 0.1,
xref = "paper", yref = "paper",
xanchor = "left", yanchor = "top",
ax = 0, ay = 0,
text = paste("Original Model:", origModel),
font = list(family = "PT Sans Narrow", size = 15)),

list(x = 0.21, y = 0.05,
xref = "paper", yref = "paper",
xanchor = "left", yanchor = "top", align = "left",
ax = 0, ay = 0,
text = paste("Stepwise Model:", stepModel),
font = list(family = "PT Sans Narrow", size = 15)),

list(x = 0.8, y = 0.90,
xref = "paper", yref = "paper",
xanchor = "left", yanchor = "top", align = "left",
ax = 0, ay = 0,
text = paste0("No. of Covariates:", nCovariates, "",
"No. of bootstrap samples:", nBoot, ""),
font = list(family = "PT Sans Narrow", size = 15))
))```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...