Instrumental Variables Simulation

January 9, 2014

(This article was first published on, and kindly contributed to R-bloggers)

Instrumental Variables

Instrumental variables are an incredibly powerful for dealing with unobserved
heterogenity within the context of regression but the language used to define
them is mind bending. Typically, you hear something along the lines of “an
instrumental variable is a variable that is correlated with x but uncorrelated
with the outcome except through x.” At this point examples are listed — taxes
on smoking likely effect health only through their actions on smoking — or the
author drops right into the math stats.

I like math stats (when I am not getting a grade for it at least!) and will work
through it. But at some point, I want to play with a simulation of the process.
However, books like Mostly Harmless Econometerics
just throw a bunch of nonsense in terms of emperical studies at you. Sure, the
OLS and IV estimates are different… but I don't know which is right for sure
in an empirical setting. Simulations are great for this (the lack of simulations
is one of my biggest issues with that book. Right after the use of Stata.).

I turned to Google and did several searches and the only simple simulation
that I could find was done using Stata.
At least it was a simulation… but I wanted more. Since I couldn't find any
examples using R, I play around until I got it and then write up the simulation
to see if it helps anyone.


Suppose that you have a continuous variable \( y \) with the known mean
response function

\[ E(y) = \beta_0 + \beta_1 x + \beta_2 c \]

and further that \( x \) and \( c \) are correlated with each other. If you knew \( c \),
estimating \( \beta_1 \) would be easy by simply running lm(y ~ x + c). You
would complete the regression, throw together a few diagnostics and call it a

But often we don't observe \( c \) in our data. \( c \) could be any number of things,
such as treatment practices at a hospital or unmeasured differences between
patients, but it is in the direct casual path of \( y \) and you don't know it. If
you instead do the regression lm(y ~ x), you will not get an accurate
estimate of \( \beta_0 \). If you think about the model fit in that call it looks

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
but in this case
\[ \epsilon_i = f(c_i, \eta_i) \]
where \( \eta_i \) is white noise centered on zero. As long as that function is
linear, the expectation of \( \epsilon \) will be some multiple of \( c \). The result
is that \( x \) and \( \epsilon_i \) are correlated. The estimates of \( \beta_1 \) will
nessecarly be wrong. Interested parties can read more here
and see just how wrong it will be.

Suppose there there is a third variable \( z \), \( x \) can be expressed as
some function of \( z \), \( z \) has no effect on \( y \) except through \( x \) (therefore \( z \)
is independent of any other variable that effects \( y \)). Then we have the

\[ E(x) = \alpha_0 + \alpha_1 x^* + \alpha_2 z \]
\[ E(y) = \beta_0 + beta_1 x + f(c) \]

where \( x^* \) is some latent part of \( x \) and \( c \) is still unobserved. Looking at
the first equation and realizing we don't know what \( x^* \) is the resulting model
will have two regression coefficients. The intercept will be
\( \alpha_0 + \alpha_1 E(x^*) \) which is not correlated with \( f(c) \) as
and \( z \) is independent of \( f(c) \) by assumption.

If we take the fitted values of \( x \) from the first equation and plug it into the
second equation, the \( x \) term is no independent of \( f(c) \). This independence
means that we can produce consistent estimates of \( \beta_1 \) when we replace \( x \)
with the fitted values of \( x \) from equation one. When the estimator used is OLS,
this method is called two-stage least squares (2SLS).


We need to start by generating values of \( x \) and \( c \) that are correlated. The
easiest way to do this is to use the function mvrnorm() from MASS.
Lets start by generating a 1000x2 random matrix with \( \rho = 0.5 \).

# we are really generating x* and c and using a common variance
xStarAndC <- mvrnorm(1000, c(20, 15), matrix(c(1, 0.5, 0.5, 1), 2, 2))
xStar <- xStarAndC[, 1]
c <- xStarAndC[, 2]

and lets generate \( z \) and make the observed \( x \)

z <- rnorm(1000)
x <- xStar + z

And now generate the response variable

# using 1 makes it easy to estimate how 'wrong' an estimator is and toss
# some noise on y
y <- 1 + x + c + rnorm(1000, 0, 0.5)

Lets check to make sure that everything worked so far:

cor(x, c)
## [1] 0.3027
cor(z, c)
## [1] -0.02993

We have a moderate correlation between \( x \) and \( c \) and it appears that \( z \) and
\( c \) are uncorrelated. So far, so good.

Lets estimate the full model (the model that we would estimate if we knew \( c \))

lm(y ~ x + c)
## Call:
## lm(formula = y ~ x + c)
## Coefficients:
## (Intercept)            x            c  
##       1.716        0.987        0.970

Not bad 0.9868 is basically 1. Now what would happen if
we ignored the fact that \( c \) was unknown and estimated anyways?

lm(y ~ x)
## Call:
## lm(formula = y ~ x)
## Coefficients:
## (Intercept)            x  
##        12.0          1.2

Yikes! The complete model definitely works but when we ignore the heterogeneity
caused by \( c \) we get terrible estimates. The estimate was
120% of the true effect of \( x \). That isn't

What happens when we break the correlation between the “x part” and the
unobserved \( c \) by using the instrument and 2SLS?

xHat <- lm(x ~ z)$fitted.values
lm(y ~ xHat)
## Call:
## lm(formula = y ~ xHat)
## Coefficients:
## (Intercept)         xHat  
##      16.895        0.954

That is a pretty good estimate, only off by
4.6% of the true value of 1. Pretty
nifty that we are able to accurately estimate the effect of \( x \) using the
instrument despite the unobserved value of \( c \).

Its worth noting now that if you want to do this in R, you probably do not want
to do 2SLS as I have done it here. The parameter estimates are correct but the
errors are not. Using the method I have done above, you get standard errors
from the second equation residuals whereas you want them from the first. The
package AER includes a function ivreg() which will take care of
this for you.

If you look at the equations, two things probably jump out at you. First,
if \( x \) does a poor job predicting \( x \) (i.e., \( z \) is a weak instrument), then
it is unlikely that IV and 2SLS will provide you with meaningful results. In
fact, it is possible that the IV estimates will be more inaccurate than the
naive OLS estimator. A rule of thumb is the first stage (i.e., the regression of
x on the instrument and any covariates) should have an F statistic of at least

Second, if \( x \) and \( c \) are only weakly correlated, the OLS and IV estimates
should converge. As \( x \) and \( c \) become more correlated, however, the estimates
move apart quickly and with even mild correlation (\( \rho \sim 0.2 \)) there can
be significant differences in the estimates.

The dynamic graph below allows you to explore the effects of altering these
parameters on the resulting OLS and IV estimates. The solid black line is the
true value, the green line is the estimate using the full model (if you knew \( c \)), red is the IV estimate and blue is using the OLS estimator.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.