[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:

Assume we want to estimate the causal effect beta of x on y. However, we have an unobserved confounder z that affects both x and y. If we don’t add the confounder z as control variable in the regression of y on x, the OLS estimator of beta will be biased. That is the so called omitted variable bias.

Let’s simulate a data set and illustrate the omitted variable bias:

n = 10000
alpha = beta = gamma = 1

z = rnorm(n,0,1)
eps.x = rnorm(n,0,1)
eps.y = rnorm(n,0,1)

x = alpha*z + eps.x
y = beta*x + gamma*z + eps.y

# Estimate short regression with z omitted
coef(lm(y~x))[2]

##        x
## 1.486573


While the true causal effect beta is equal to 1, our OLS estimator where we omit z is around 1.5. This means it has a positive bias of roughly 0.5.

Before we continue, let’s have a quiz (click here if the Google form quiz is not correctly shown.):

Let’s see what happens if we increase the impact of the confounder z on x, say to alpha=1000.

alpha = 1000
x = alpha*z + eps.x
y = beta*x + gamma*z + eps.y
coef(lm(y~x))[2]

##        x
## 1.000983


The bias is almost gone!

This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both y and x. Thus the omitted variable bias probably becomes worse if the confounder z affects y or x more strongly. While this intuition is correct for small alpha, it is wrong once alpha is sufficiently large.

For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:

$asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}$

(From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)

Let’s plot the bias for different values of $\alpha$:

Var.z = Var.eps.x = 1
alpha = seq(0,10,by=0.1)
asym.bias = gamma*alpha * Var.z /
(alpha^2*Var.z+Var.eps.x)
plot(alpha,asym.bias)


For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.

Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.

## Typical presentation of the omitted variable bias formula

Note that the omitted variable bias formula is usually presented as follows:

$Bias(\hat \beta) = \gamma \hat \delta$

where $\hat \delta$ is the OLS estimate of the linear regression

$z = const + \delta x + u$

(This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as

$x=\tilde{const} + \frac 1 \delta z + \tilde u$

suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the y and x in a simple linear regression can be a bit surprising, see my previous post.)

If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get

$\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha$

However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.

## Appendix: Derivation of the asymptotic bias formula

Here is just a short derivation of the first asymptotic bias formula. We estimate a simple regression (just one explanatory variable):

$y=const+\beta x+\eta$

For example, the introductionary textbook by Wooldridge shows in the chapter on the OLS asymptotics that under relatively weak assumptions the asymptotic bias of the OLS estimator $\hat{\beta}$ in such a simple regression is given by

$asy.\; Bias(\hat{\beta})=\frac{Cov(x,\eta)}{Var(x)}$

In our simulation, the error term of the short regression is given by

$\eta=\gamma z+\varepsilon_{y}$

and $x$ is given by

$x=\alpha z+\varepsilon_{x}$

where and $\varepsilon_{y}$ and $\varepsilon_{x}$ are iid errors. We thus have

$Cov(x,\eta)=\alpha\gamma Var(z)$

and

$Var(x)=\alpha^{2}Var(z)+Var(\varepsilon_{x})$

Hence we get the asymptotic bias formula

$asy.\; Bias(\hat{\beta})=\alpha\gamma\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_{x})}$