[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

## Inference after model selection

In many cases, when we want to estimate some causal relationship between two variables we have to solve the problem of selecting the right control variables. If we fail, our results will be very fragile and the estimator potentially biased because we left some important control variables out. This problem is known as omitted variables bias. It happens because some variables correlated with our variable of interest were left out and went to the errors term, making the errors correlated with the variable of interest.

Suppose we want to test the causal effect of education on wages. Many potential controls come to my mind such as parents education, experience, age, etc. There are several methods to select relevant variables, the most naive one is to run a regression with all variables and then select those which are significant. Then we run the regression again only with these variables.

No mater how we select the variables, we may always fall on a type II error situation, when some variable is significant but we wrongly drop it from our model. Additionally, some variables may have a very small, but significant effect which makes it more difficult for variable selection methods to catch them.

## The traditional single selection approach

Traditionally, this is how the procedure is done. Suppose our response variable is $y_i$, and we are interested on the effects of $d_i$ on $y_i$ using the controls $x_i$. First we estimate the following model:

$y_i = \alpha_0 d_i + \beta' x_i + \varepsilon_i$,

Then we select the significant variables from $x_i$ and re-estimate the equation using $d_i$ and $x_i^s$, which is the group of the selected variables. This is not good and there is a high probability that your inference will be wrong!!

## The double selection approach

Beloni and Chernozhukov (2012) showed that the most adequate way to do inference after model selection is by a procedure called double selection. Consider the following equations:

$y_i = \alpha_0 d_i + \beta' x_i + \varepsilon_i$,
$d_i = \gamma'x_i + \epsilon_i$,

we can substitute the second equation on the first and have the following:

$y_i = \alpha_0(\gamma'+\beta')x_i + (\alpha_0\epsilon_i +\varepsilon_i)$,
$d_i = \gamma'x_i + \epsilon_i$,

Note that the $y_i$ now does not depend on $d_i$. The double selection procedure is as follows:

• Estimate the equation of $x_i$ on $y_i$ and select the relevant variables,
• Estimate the equation of $x_i$ on $d_i$ and select the relevant variables,
• Estimate the equation of $x_i$ and $d_i$ on $y_i$ using the elements of $x_i$ that were selected on any of the two steps.

This procedure gives a new chance to the variables that were dropped on the fist step and are related to $d_i$ to be included in the final model.

## Application

The R code presented here is for a simulation on a simplified version of the data generation process from Beloni Chernozhukov (2012). You are going to need to install the mvtnorm package to generate data from multivariate normal distributions. The code below declares the parameters for the simulation.

library(mvtnorm)
## Declare parameters ##
p=50 # = Number of potential controls = #
N=100 # = Number of observations = #
alpha0=0.5 # = Parameter of interest = #
b=1/((1:p)^2) # = Control parameters (some are big and some small) = #

# = Controls covariance matrix = #
sigma=matrix(NA,p,p)
for(k in 1:p){
for(j in 1:p){
sigma[k,j]=0.5^abs(j-k)
}
}
M=5000 # = Number of simulations = #
set.seed(123) # = Seed for replication = #
store=matrix(NA,M,2) # = Matrix to store the parameters in each simulation = #
colnames(store)=c("Single Selection", "Double Selection")


We are going to perform 5000 simulations. This may take a couple minutes. The objective is to compare the single selection and the double selection in terms of bias and variance. The code below performs the simulation loop.

# = Simulation Loop = #
for(i in 1:M){
X=rmvnorm(N,rep(0,p),sigma) # = Generate controls = #

ed=rnorm(N) # = Errors from di equation = #
ey=rnorm(N) # = Errors from yi equation = #

cd=1/sqrt(var(X%*%b)) # = This will keep the R2 close to 0.5 = #
d=X%*%(as.vector(cd)*b)+ed # = generate di = #

cy=1/sqrt(var(cbind(d,X)%*%c(alpha0,b))) # = R2 control = #
y=cbind(d,X)%*%(c(alpha0,as.vector(cy)*b))+ey # = Generate yi = #

# = Single Selection Approach = #
model0=lm(y~d+X) # = Run first step = #
selected=which(summary(model0)$coefficients[-c(1:2),4]<0.05) # = Variable selection using significance = # # = Just a control for the cases the model selects 0 variables = # if(length(selected)==0){ model1=lm(y~d) }else{ model1=lm(y~d+X[,selected]) } store[i,1]=coef(model1)[2] # = save results = # # = Double Selection = # modeld0=lm(y~X) # = Regression of xi on yi = # selected0=which(summary(modeld0)$coefficients[-1,4]<0.05) # = Variable selection = #
modeld1=lm(d~X) # = Regression of xi on di = #
selected1=which(summary(modeld1)\$coefficients[-1,4]<0.05) # = Variable  selection = #

selected=sort(union(selected0,selected1)) # = Selected variables = #

# = Just a control for the cases the model selects 0 variables = #
if(length(selected)==0){
final_model=lm(y~d)
}else{
final_model=lm(y~d+X[,selected])
}

# = save results = #
store[i,2]=coef(final_model)[2]
}

colMeans(store) # = Average alpha0 (real value is 0.5) = #

## Single Selection Double Selection
##        0.5895864        0.5075527

apply(store,2,sd) # = alpha0 standard deviation = #

## Single Selection Double Selection
##        0.1609284        0.1117405

# = Plot densities = #
plot(density(store[,2]))
lines(density(store[,1]),col=2)
abline(v=alpha0,lty=2,col="yellow")
legend("topleft",legend=c("Double","Single"),col=c(1,2),lty=1,seg.len = 0.8,cex=0.8,bty="n")


The results show the damage that can be caused if we ignore the existence of the variable selection to do inference. The single selection procedure is biased and the distribution is bimodal. There is a high probability that the effects of $d_i$ on $y_i$ will be overestimated. The omitted variable bias is behind the problem. When the fist step of the single selection fails to recover important variables they are left to the error term. The double selection makes traditional inference suitable again. Note that the single selection fails even if we use sophisticated variable selection techniques such as LASSO.

## References

Belloni, A., V. Chernozhukov, and C. Hansen. “Inference on treatment effects after selection amongst high-dimensional controls.”