**Back Side Smack » R Stuff**, and kindly contributed to R-bloggers)

When working with labor economics, we often run into issues with selection on variables of interest. Regressing earnings on years of education to estimate the human capital earnings function makes sense at first blush until we imagine that education is not randomly assigned–your equation is capturing the selection on education and the effects of education in one parameter. I won’t belabor this point, but Angrist and Pischke’s *Mostly Harmless Econometrics* covers the subject of selection in labor econometrics quite masterfully. One path around the selection problem is to find a variable which mimics random assignment of the problematic explanatory variable. If education is giving us trouble, then we find some third variable which predicts education but does not impact earnings: an instrument. Other flavors of solutions to this problem have cropped up over the years (regression discontinuity, difference in difference, etc.) and IVs are a bit passe, but a simple implementation of IVs can be instructive.

R has two (three, if you count systemfit) “canned” instrumental variables functions: `tsls()`

in **sem** and `ivreg()`

in **AER**. Both have similar input formats, both use two-stage least squares to estimate their final equation and both adjust the standard error of the final parameter estimates in a consistent fashion. However for the purposes of learning how to apply an instrumental variable strategy neither is very helpful. If you attempt to do something wrong (e.g. supply an underdetermined instrument) `tsls()`

will throw an error–helpful if you want to do production work where someone may depend on your estimates but not too instructive if you simply want to test out a simple model. Thankfully the basic two stage least squares estimator for instrumental variables is pretty easy to implement. If we have some model:

Where is our dependent variable of interest and is the problematic explanatory variable we can get our estimate of with:

Because of the selection problem our estimate will be biased (with the direction dependent on the effect of the selection), but we can use our instrument to estimate . Imagine our instrument is and a model of interaction between the instrument and the explanatory variable can be written as

Our predicted values of , or can be computed in a first stage equation by regressing on and inserted into the second stage to get:

The good news is doing this sort of thing in R is easy! I use a dataset from Applied Econometrics with R available in the **AER** package. The dataset is a survey of high school graduates with variables coded for wages, education, average tuition and a number of demographic variables. The dataset also includes distance from a college while the survey participants were in high school (hence why it is called “CollegeDistance”). Loosely following David Card’s paper on college distance we can use that measure of distance as an instrument for education. The logic goes something like this. Distance from a college will strongly predict a decision to pursue a college degree but may not predict wages apart from increased education. We are asserting the strength and validity of the instrument in question (more on that in a bit). We can imagine some problems with an instrument like college distance; families who value education may move into neighborhoods close to colleges or neighborhoods near colleges may have stronger job markets. Both of those features may invalidate the instrument by introducing unobserved variables which influence lifetime earnings but cannot be captured in our measure of schooling. However, for our purposes it may work.

From Simple regression |

The nature of variation in the instrument may generate some problems in our model, but we can ignore them for now and return in a later post. Computing `lm(education ~ distance , data=cd.d)`

shows a significant negative relationship between education (a nominally continuous variable but is coded as integer values between 12 and 18 in the survey) and college distance. A t-test of the parameter estimate isn’t quite enough to prove the instrument is strong, see Bound, Jaeger and Baker for an econometric explanation of weak instrument peril. In the code below you can see results from the F-test with the `encomptest()`

function from **lmtest**.

Our simple OLS estimate measuring the impact of education on earnings cannot reject the null hypothesis of no effect. The below plot was generated with the great `coefplot()`

from Andrew Gelman’s **arm** package.

From Simple regression |

Inserting the predicted values of the first stage as a proxy for education in the second stage gives us a significant effect of education (instrumented by distance) on earnings–caveat emptor about causality!

From Simple regression |

The error on the computed parameter for education in our instrumental variable estimate is much larger than the simple OLS estimate. Part of this comes from OLS underestimating the variance but most of it comes from the added noise in generating a two stage estimate. We’ll get back to that later, but for now we can sit back and enjoy our very simple IV estimator. Code for the estimator is below:

I also want to thank Tal Galili and R-Bloggers for syndicating this blog into their feed. R-bloggers is a great resource and being a minuscule part of such a fun project is a source of great pride. Please subscribe to their RSS feed!

**leave a comment**for the author, please follow the link and comment on their blog:

**Back Side Smack » R Stuff**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...