Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Ben Ogorek

# Introduction

Regression is a tool that can be used to address causal questions in an observational study, though no one said it would be easy. While this article won’t close the vexing gap between correlation and causation, it will offer specific advice when you’re after a causal truth – keep an eye out for variables called “colliders,” and keep them out of your regression!

By the end of this article, we will have explored a situation where adding a variable to a regression will simultaneously
1. improve the predictive power,
2. ruin the coefficient estimates.
Thus, the mistake is a tempting one to make. In the sections below, we’ll first review how adding additional variables to a regression can defeat confounding and lead us closer to a causal truth. Then we’ll see that truth evaporate when a variable we thought was a confounder was actually something called a “collider.” As is typical with Anything but R-bitrary articles, there will be lightweight simulations in R to drive the point home.

# Defeating confounders – a causal power of regression

Sentences that begin with “Controlling for [factors X, Y, and Z], …” are reassuring amidst controversial subject matter. But less reassuring is the implicit assumption that X, Y, and Z are indeed the things we call “confounders.” We review the definition of a confounder via the following causal graph:

In the above diagram, w is a confounder and will distort the percieved causal relationship between x and y if unaccounted for. An example from Counterfactuals and Causal Inference: Methods and Principles for Social Research is the effect of educational attainment (x) on earnings (y) where mental ability (w) is a confounder. The authors remark that the amount of “ability bias” in estimates of educational impact “has remained one of the largest causal controversies in the social sciences since the 1970s.

We now review adjustment for confounding via linear models. Open R. Define a sample size that affords us the luxury of ignoring standard errors (without guilt!):
N <- 100000

Now we generate data consistent with the above diagram:
w <- rnorm(N)
x <- .5 * w + rnorm(N)
y <- .3 * w + .4 * x + rnorm(N)

Note that our confounder w is the only variable that is determined from factors outside the system. It is "exogenous," and the only variable that we can set in any way we want. Both x and y depend on w, and they are "endogenous" to the system. This is very different than in an experimental design where an artificial grid is created for x and those levels prescribed. In that case, x would be exogenous too.

Now run the following two regressions in R, noting the very different coefficient estimates for x:
summary(lm(y ~ x))
summary(lm(y ~ x + w))

The first regression (that ignores w) incurs upward bias in x's coefficient due to the confounder's positive effects on both x and y. The second regression (with w included) recovers x's true coefficient while increasing the R-squared by a few percentage points. That's a win-win.

# Falling for Colliders - a causal trap for regression

Suppose that x is still the amount of education and y is still earned income, but instead of being mental ability, w is now annual dollars spent on decorative artwork. In this scenario, education and income probably cause art purchases rather than vice versa. Supposing this is truth, below is what the new causal graph looks like.

Unlike in the previous section where the causal arrows emanate from w, they now point towards w. If you sent marbles moving in the direction of the arrows above, two of them might collide at w, earning it the label "collider." Causal diagram theory says that if you condition on a collider, you create an artificial situation that appears as if the directions of the arrows pointing toward the collider have flipped. Take a moment to think about that. Do you see the problem? If we condition on w, we will have created a confounder. Let's see it happen in R.

First, we generate the data, again using a simple linear data generating mechanism. Notice that x is now exogenous while w is endogenous.
x <- rnorm(N)
y <- .7 * x + rnorm(N)
w <- 1.2 * x + .6 * y + rnorm(N)

Now run the same two regressions as before, examining the coefficient estimates for x:
summary(lm(y ~ x))
summary(lm(y ~ x + w))

Unlike in previous section, the simpler regression without w recovers the true coefficient of x, while the regression with w has a horribly biased estimate. But the second model is not unequivocally inferior; it has an R-squared that's roughly 20 percentage points higher than the first! The collider w might have ruined our regression coefficients, but it still helps us predict and is an "important" part of the conditional expectation function E(y|x, z). Unfortunately, you can't in general rely on that function for understanding how the world works.

# Summary

We investigated a situation where adding a certain type of variable to a regression, called a "collider," will bias coefficients while still increasing predictive power. Whether this is good or bad depends on the research objective. If the goal is to obtain a predictive model that makes accurate predictions of the response, it's good. If the goal is to create a model of reality that is useful in making decisions, then the collider bias is almost certainly a bad thing.

Of course, discarding a variable that adds to the predictive power of a model is easier said than done. Models in any organization are evaluated by some metric, typically a predictive one, and trying to convince your peers (and boss) that your model is better because it "doesn't have colliders" may be a tough sell.

Determining if a variable is a collider involves thinking critically about the way the world works. If an "explanatory variable" could actually be caused by the "response" as well as another predictor, then you have a candidate that is perhaps better left out of the regression. It's a complex world out there with many difficult decisions, not all of which can be based on data. This author wishes you a 2016 with more difficult decisions that in 2015, in regression analysis at least!