**Stable Markets » R**, and kindly contributed to R-bloggers)

**Introduction**

This is the first post in a series devoted to explaining basic econometric concepts using R simulations.

The topic in this post is endogeneity, which can severely bias regression estimates. I will specifically simulate endogeneity caused by an omitted variable. In future posts in this series, I’ll simulate other specification issues such as heteroskedasticity, multicollinearity, and collider bias.

**The True Data-Generating Process**

Consider the data-generating process (DGP) of some outcome variable :

For the simulation, I set parameter values for , , and and simulate positively correlated independent variables, and (N=500).

# simulation parameters set.seed(144); ss=500; trials=5000; a=50; b=.5; c=.01; d=25; h=.9; # generate two independent variables x=rnorm(n=ss,mean=1000,sd=50); z=d+h*x+rnorm(ss,0,10)

**The Simulation**

The simulation will estimate the two models below. The first model is correct in the sense that it includes all terms in the actual DGP. However, the second model omits a variable that is present in the DGP. Instead, the variable is obsorbed into the error term .

This second model will yield a biased estimator of . The variance will also be biased. This is because is endogenous, which is a fancy way of saying it is correlated with the error term, . Since

sim=function(endog){ # assume normal error with constant variance to start e=rnorm(n=ss,mean=0,sd=10) y=a+b*x+c*z+e # Select data generation process if(endog==TRUE){ fit=lm(y~x) }else{ fit=lm(y~x+z)} return(fit$coefficients) } # run simulation - with and wihtout endogeneity sim_results=t(replicate(trials,sim(endog=FALSE))) sim_results_endog=t(replicate(trials,sim(endog=TRUE)))

**Simulation Results**

This simulation yields two different sampling distributions for

**Bias Analysis**

The bias in

Substituting

When omitting variable

Here is the distribution of the bias, it is centered around .0895, very close to the true bias value.

The derivation above also lets us determine the direction of bias from knowing the correlation of

**Conclusion**

The case above was pretty general, but has particular applications. For example, if we believe that an individual’s income is a function of years of education and year of work experience, then omitting one variable will bias the slope estimate of the other.

**leave a comment**for the author, please follow the link and comment on their blog:

**Stable Markets » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...