The package hdm for double selection inference with a simple example

insightr

5 years ago

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

In a late post I discussed the Double Selection (DS), a procedure for inference after selecting controls. I showed an example of the consequences of ignoring the variable selection step discussed in an article by Belloni, Chernozhukov and Hansen.

Some of the authors of the mentioned article created the hdm package, which implements the double selection using the Rigorous LASSO (RLASSO) to select the controls. The RLASSO uses the theory they developed (instead of cross-validation or information criterion) to select the regularization parameter, normally referred as .

Application

I am going to show an application based on the package’s vignettes, which is based in an article from Barro and Lee (1994). The hypothesis we want to test is if less developed countries, with lower GDP per capita, grow faster than developed countries. In other words, there is a catch up effect. The model equation is as follows:

where is the GDP growth rate over a specific decade in country , is the log of the GDP at the beginning of the decade, are controls that may affect the GDP. We want to know the effects of on , which is measured by . If our catch up hypothesis is true, must be positive and hopefully significant.

The dataset is available in the package. It has 62 variables and 90 observations. Each observation is a country, but the same country may have more than one observation if analysed in two different decades. The large number of variables will require some variable selection, and I will show what happens if we use a single LASSO selection and the Double Selection. The hdm package does all the DS steps in a single line of code, we do not need to estimate the two selection models and the Post-OLS individually. I will also run a naive OLS will all variables just for illustration.

library(hdm)
data("GrowthData") # = use ?GrowthData for more information = #
dataset=GrowthData[,-2] # = The second column is just a vector of ones = #

# = Naive OLS with all variables = #
# = I will select only the summary line that contains the initial log GDP = #
OLS = summary(lm(Outcome ~., data = dataset))$coefficients[1, ]

# = Single step selection LASSO and Post-OLS = #
# = I will select only the summary line that contains the initial log GDP = #
lasso = rlasso(Outcome~., data = dataset, post = FALSE) # = Run the Rigorous LASSO = #
selected = which(coef(lasso)[-c(1:2)] !=0) # = Select relevant variables = #
formula = paste(c("Outcome ~ gdpsh465", names(selected)), collapse = "+")
SS = summary(lm(formula, data = dataset))$coefficients[1, ]

# = Double Selection = #
DS=rlassoEffects(Outcome~. , I=~gdpsh465, data=dataset)
DS=summary(DS)$coefficients[1,]
(results=rbind(OLS,SS,DS))

##        Estimate Std. Error    t value    Pr(>|t|)
## OLS  0.24716089 0.78450163  0.3150547 0.755056170
## SS   0.31168793 0.09832465  3.1699876 0.002169693
## DS  -0.04432403 0.01531925 -2.8933558 0.003811493

The OLS estimate is positive, however the standard error is very big because we have only 90 observations for more than 60 variables. The Single Selection estimate is also positive and, in this case, significant. However, the Double Selection showed a negative and significant coefficient. If the DS is correct, our initial catch up hypothesis is wrong and poor countries grow less than rich countries. We can’t say that the DS is correct for sure, but it is backed up by a strong theory and lots of simulations that show that the SS is problematic. It is very, very unlikely that the SS results are more accurate than the DS. It is very surprising how much the results can change from one case to the other. You should at least be skeptic when you see this type of modelling and the selection of controls is not clear.

The hdm package has several other implementations in this framework such as instrumental variables and logit models and there are also more examples in the package vignette.

References

Belloni, A., V. Chernozhukov, and C. Hansen. “Inference on treatment effects after selection amongst high-dimensional controls.” https://arxiv.org/abs/1201.0224

Barro, Robert J., and Jong-Wha Lee. “Sources of economic growth.” Carnegie-Rochester conference series on public policy. Vol. 40. North-Holland, 1994. http://www.sciencedirect.com/science/article/pii/0167223194900027

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.