**jacobsimmering.com**, and kindly contributed to R-bloggers)

# Instrumental Variables

Instrumental variables are an incredibly powerful for dealing with unobserved

heterogenity within the context of regression but the language used to define

them is mind bending. Typically, you hear something along the lines of “an

instrumental variable is a variable that is correlated with x but uncorrelated

with the outcome except through x.” At this point examples are listed — taxes

on smoking likely effect health only through their actions on smoking — or the

author drops right into the math stats.

I like math stats (when I am not getting a grade for it at least!) and will work

through it. But at some point, I want to play with a simulation of the process.

However, books like Mostly Harmless Econometerics

just throw a bunch of nonsense in terms of emperical studies at you. Sure, the

OLS and IV estimates are different… but I don't know which is right for sure

in an empirical setting. Simulations are great for this (the lack of simulations

is one of my biggest issues with that book. Right after the use of Stata.).

I turned to Google and did several searches and the only simple simulation

that I could find was done using Stata.

At least it was a simulation… but I wanted more. Since I couldn't find any

examples using R, I play around until I got it and then write up the simulation

to see if it helps anyone.

## Overview

Suppose that you have a continuous variable \( y \) with the known mean

response function

\[ E(y) = \beta_0 + \beta_1 x + \beta_2 c \]

and further that \( x \) and \( c \) are correlated with each other. If you knew \( c \),

estimating \( \beta_1 \) would be easy by simply running `lm(y ~ x + c)`

. You

would complete the regression, throw together a few diagnostics and call it a

day.

But often we don't observe \( c \) in our data. \( c \) could be any number of things,

such as treatment practices at a hospital or unmeasured differences between

patients, but it is in the direct casual path of \( y \) and you don't know it. If

you instead do the regression `lm(y ~ x)`

, you will not get an accurate

estimate of \( \beta_0 \). If you think about the model fit in that call it looks

like

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

but in this case

\[ \epsilon_i = f(c_i, \eta_i) \]

where \( \eta_i \) is white noise centered on zero. As long as that function is

linear, the expectation of \( \epsilon \) will be some multiple of \( c \). The result

is that \( x \) and \( \epsilon_i \) are correlated. The estimates of \( \beta_1 \) will

nessecarly be wrong. Interested parties can read more here

and see just how wrong it will be.

Suppose there there is a third variable \( z \), \( x \) can be expressed as

some function of \( z \), \( z \) has no effect on \( y \) except through \( x \) (therefore \( z \)

is independent of any other variable that effects \( y \)). Then we have the

equations

\[ E(x) = \alpha_0 + \alpha_1 x^* + \alpha_2 z \]

and

\[ E(y) = \beta_0 + beta_1 x + f(c) \]

where \( x^* \) is some latent part of \( x \) and \( c \) is still unobserved. Looking at

the first equation and realizing we don't know what \( x^* \) is the resulting model

will have two regression coefficients. The intercept will be

\( \alpha_0 + \alpha_1 E(x^*) \) which is not correlated with \( f(c) \) as

and \( z \) is independent of \( f(c) \) by assumption.

If we take the fitted values of \( x \) from the first equation and plug it into the

second equation, the \( x \) term is no independent of \( f(c) \). This independence

means that we can produce consistent estimates of \( \beta_1 \) when we replace \( x \)

with the fitted values of \( x \) from equation one. When the estimator used is OLS,

this method is called two-stage least squares (2SLS).

## Simulations

We need to start by generating values of \( x \) and \( c \) that are correlated. The

easiest way to do this is to use the function `mvrnorm()`

from `MASS`

.

Lets start by generating a 1000x2 random matrix with \( \rho = 0.5 \).

```
library(MASS)
# we are really generating x* and c and using a common variance
xStarAndC <- mvrnorm(1000, c(20, 15), matrix(c(1, 0.5, 0.5, 1), 2, 2))
xStar <- xStarAndC[, 1]
c <- xStarAndC[, 2]
```

and lets generate \( z \) and make the observed \( x \)

```
z <- rnorm(1000)
x <- xStar + z
```

And now generate the response variable

```
# using 1 makes it easy to estimate how 'wrong' an estimator is and toss
# some noise on y
y <- 1 + x + c + rnorm(1000, 0, 0.5)
```

Lets check to make sure that everything worked so far:

```
cor(x, c)
```

```
## [1] 0.3027
```

```
cor(z, c)
```

```
## [1] -0.02993
```

We have a moderate correlation between \( x \) and \( c \) and it appears that \( z \) and

\( c \) are uncorrelated. So far, so good.

Lets estimate the full model (the model that we would estimate if we knew \( c \))

```
lm(y ~ x + c)
```

```
##
## Call:
## lm(formula = y ~ x + c)
##
## Coefficients:
## (Intercept) x c
## 1.716 0.987 0.970
```

Not bad 0.9868 is basically 1. Now what would happen if

we ignored the fact that \( c \) was unknown and estimated anyways?

```
lm(y ~ x)
```

```
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 12.0 1.2
```

Yikes! The complete model definitely works but when we ignore the heterogeneity

caused by \( c \) we get terrible estimates. The estimate was

120% of the true effect of \( x \). That isn't

good.

What happens when we break the correlation between the “x part” and the

unobserved \( c \) by using the instrument and 2SLS?

```
xHat <- lm(x ~ z)$fitted.values
lm(y ~ xHat)
```

```
##
## Call:
## lm(formula = y ~ xHat)
##
## Coefficients:
## (Intercept) xHat
## 16.895 0.954
```

That is a pretty good estimate, only off by

4.6% of the true value of 1. Pretty

nifty that we are able to accurately estimate the effect of \( x \) using the

instrument despite the unobserved value of \( c \).

Its worth noting now that if you want to do this in R, you probably do not want

to do 2SLS as I have done it here. The parameter estimates are correct but the

errors are not. Using the method I have done above, you get standard errors

from the second equation residuals whereas you want them from the first. The

package `AER`

includes a function `ivreg()`

which will take care of

this for you.

If you look at the equations, two things probably jump out at you. First,

if \( x \) does a poor job predicting \( x \) (i.e., \( z \) is a weak instrument), then

it is unlikely that IV and 2SLS will provide you with meaningful results. In

fact, it is possible that the IV estimates will be more inaccurate than the

naive OLS estimator. A rule of thumb is the first stage (i.e., the regression of

x on the instrument and any covariates) should have an F statistic of at least

10.

Second, if \( x \) and \( c \) are only weakly correlated, the OLS and IV estimates

should converge. As \( x \) and \( c \) become more correlated, however, the estimates

move apart quickly and with even mild correlation (\( \rho \sim 0.2 \)) there can

be significant differences in the estimates.

The dynamic graph below allows you to explore the effects of altering these

parameters on the resulting OLS and IV estimates. The solid black line is the

true value, the green line is the estimate using the full model (if you knew \( c \)), red is the IV estimate and blue is using the OLS estimator.

**leave a comment**for the author, please follow the link and comment on their blog:

**jacobsimmering.com**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...