# An Interesting Aspect of the Omitted Variable Bias

**Economics and R - R posts**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:

Assume we want to estimate the causal effect `beta`

of `x`

on `y`

. However, we have an unobserved confounder `z`

that affects both `x`

and `y`

. If we don’t add the confounder `z`

as control variable in the regression of `y`

on `x`

, the OLS estimator of `beta`

will be biased. That is the so called omitted variable bias.

Let’s simulate a data set and illustrate the omitted variable bias:

```
<span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">eps.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">eps.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alpha</span><span class="o">*</span><span class="n">z</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">eps.x</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta</span><span class="o">*</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">gamma</span><span class="o">*</span><span class="n">z</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">eps.y</span><span class="w">
</span><span class="c1"># Estimate short regression with z omitted</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="o">~</span><span class="n">x</span><span class="p">))[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span>
```

```
## x
## 1.486573
```

While the true causal effect `beta`

is equal to 1, our OLS estimator where we omit `z`

is around `1.5`

. This means it has a positive bias of roughly `0.5`

.

Before we continue, let’s have a quiz (click here if the Google form quiz is not correctly shown.):

Let’s see what happens if we increase the impact of the confounder `z`

on `x`

, say to `alpha=1000`

.

```
<span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alpha</span><span class="o">*</span><span class="n">z</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">eps.x</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta</span><span class="o">*</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">gamma</span><span class="o">*</span><span class="n">z</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">eps.y</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="o">~</span><span class="n">x</span><span class="p">))[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span>
```

```
## x
## 1.000983
```

The bias is almost gone!

This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both `y`

and `x`

. Thus the omitted variable bias probably becomes worse if the confounder `z`

affects `y`

or `x`

more strongly. While this intuition is correct for small `alpha`

, it is wrong once `alpha`

is sufficiently large.

For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:

[asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}]

(From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)

Let’s plot the bias for different values of $\alpha$:

```
<span class="n">Var.z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Var.eps.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">10</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">asym.bias</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gamma</span><span class="o">*</span><span class="n">alpha</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">Var.z</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="p">(</span><span class="n">alpha</span><span class="o">^</span><span class="m">2</span><span class="o">*</span><span class="n">Var.z</span><span class="o">+</span><span class="n">Var.eps.x</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="n">asym.bias</span><span class="p">)</span><span class="w">
</span>
```

For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.

Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.

## Typical presentation of the omitted variable bias formula

Note that the omitted variable bias formula is usually presented as follows:

[

Bias(\hat \beta) = \gamma \hat \delta]

where $\hat \delta$ is the OLS estimate of the linear regression [z = const + \delta x + u

]

(This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as

[

x=\tilde{const} + \frac 1 \delta z + \tilde u

]

suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the `y`

and `x`

in a simple linear regression can be a bit surprising, see my previous post.)

If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get

[\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha]

However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.

**leave a comment**for the author, please follow the link and comment on their blog:

**Economics and R - R posts**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.