# An Interesting Aspect of the Omitted Variable Bias

**Economics and R - R posts**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:

Assume we want to estimate the causal effect `beta`

of `x`

on `y`

. However, we have an unobserved confounder `z`

that affects both `x`

and `y`

. If we don’t add the confounder `z`

as control variable in the regression of `y`

on `x`

, the OLS estimator of `beta`

will be biased. That is the so called omitted variable bias.

Let’s simulate a data set and illustrate the omitted variable bias:

n = 10000 alpha = beta = gamma = 1 z = rnorm(n,0,1) eps.x = rnorm(n,0,1) eps.y = rnorm(n,0,1) x = alpha*z + eps.x y = beta*x + gamma*z + eps.y # Estimate short regression with z omitted coef(lm(y~x))[2] ## x ## 1.486573

While the true causal effect `beta`

is equal to 1, our OLS estimator where we omit `z`

is around `1.5`

. This means it has a positive bias of roughly `0.5`

.

Before we continue, let’s have a quiz (click here if the Google form quiz is not correctly shown.):

Let’s see what happens if we increase the impact of the confounder `z`

on `x`

, say to `alpha=1000`

.

alpha = 1000 x = alpha*z + eps.x y = beta*x + gamma*z + eps.y coef(lm(y~x))[2] ## x ## 1.000983

The bias is almost gone!

This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both `y`

and `x`

. Thus the omitted variable bias probably becomes worse if the confounder `z`

affects `y`

or `x`

more strongly. While this intuition is correct for small `alpha`

, it is wrong once `alpha`

is sufficiently large.

For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:

\[asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}\](From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)

Let’s plot the bias for different values of $\alpha$:

Var.z = Var.eps.x = 1 alpha = seq(0,10,by=0.1) asym.bias = gamma*alpha * Var.z / (alpha^2*Var.z+Var.eps.x) plot(alpha,asym.bias)

For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.

Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.

## Typical presentation of the omitted variable bias formula

Note that the omitted variable bias formula is usually presented as follows:

\[Bias(\hat \beta) = \gamma \hat \delta\]where $\hat \delta$ is the OLS estimate of the linear regression

\[z = const + \delta x + u\](This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as

\[x=\tilde{const} + \frac 1 \delta z + \tilde u\]suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the `y`

and `x`

in a simple linear regression can be a bit surprising, see my previous post.)

If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get

\[\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha\]However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.

## Appendix: Derivation of the asymptotic bias formula

Here is just a short derivation of the first asymptotic bias formula. We estimate a simple regression (just one explanatory variable):

\[y=const+\beta x+\eta\]For example, the introductionary textbook by Wooldridge shows in the chapter on the OLS asymptotics that under relatively weak assumptions the asymptotic bias of the OLS estimator $\hat{\beta}$ in such a simple regression is given by

\[asy.\; Bias(\hat{\beta})=\frac{Cov(x,\eta)}{Var(x)}\]In our simulation, the error term of the short regression is given by

\[\eta=\gamma z+\varepsilon_{y}\]and $x$ is given by

\[x=\alpha z+\varepsilon_{x}\]where and $\varepsilon_{y}$ and $\varepsilon_{x}$ are iid errors. We thus have

\[Cov(x,\eta)=\alpha\gamma Var(z)\]and

\[Var(x)=\alpha^{2}Var(z)+Var(\varepsilon_{x})\]Hence we get the asymptotic bias formula

\[asy.\; Bias(\hat{\beta})=\alpha\gamma\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_{x})}\]**leave a comment**for the author, please follow the link and comment on their blog:

**Economics and R - R posts**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.