Using Machine Learning for Causal Inference

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many “off-the-shelf” ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.

“Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal.” – Bradley Efron


However, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous treatment effects. So, we will take a look at three interesting and different approaches for it and focus on a very recent paper by Athey et al. which is forthcoming in “The Annals of Statistics”1.

Model-based Recursive Partitioning

One of the earlier papers about causal trees is by Zeileis et al., 20082. They describe an algorithm for Model-based Recursive Partitioning (MOB), which looks at recursive partitioning for more complex models. They fit at first a parametric model to the data set, while using Maximum-Likelihood, then test for parameter instability for a set of predefined variables and lastly split the model with the variable regarding the highest parameter instability. Those steps are repeated in each of the daughter nodes till a stopping criterion is reached. However, they do not provide statistical properties for the mob and the partitions are still quite large.

Bayesian Additive Regression Tree

Another paper uses Bayesian Additive Regression Tree (BART) for the estimation of heterogeneous treatment effects3. Hereby, one advantage of this approach is, that BART can detect and handle interactions and non-linearity in the response surface. It uses a Sum-of-Tree Model. First, a weak-learning tree is grown, whereby the residuals are calculated and the next tree is fitted according to these residuals. Similar to Boosting Algorithms, BART wants do avoid overfitting. This is achieved by using a regularization prior, which restricts overfitting and the contribution of each tree to the final result.

Generalized Random Forest

However, this and the next blog post will be mainly focused on the Generalized Random Forest (GRF) by Athey et al., who have already been exploring the possibilities of ML in economics before. It is a method for non-parametric statistical estimation, which uses the basic ideas of the Random Forest. Therefore, it keeps the recursive partitioning, subsampling and random split selection. Nevertheless, the final outcome is not estimated via simple averaging over the trees. The Forest is used to estimate an adaptive weighting function. So, we grow a set of trees and each observation gets weighted equalling how often it falls into the same leaf as the target observation. Those weights are used to solve a “local GMM” model.

Another important piece of the GRF is the split selection algorithm, which emphasizes maximizing heterogeneity. With this framework, a wide variety of applications is possible like quantile regressions but also the estimation of heterogeneous treatment effects. Therefore, the split selection must be suitable for a lot of different purposes. As in Breiman's Random Forest, splits are selected greedily. However, in the case of general moment estimation, we don't have a direct loss criterion to minimize. So instead we want to maximize a criterion ∆ , which favors splits that are increasing the heterogeneity of our in-sample estimation. Maximizing ∆ directly on the other side would be computationally costly, therefore Athey et al. are using a gradient-based approximation for it. This results in a computational performance, similar to standard CART- approaches.

Comparing the regression forest of GRF to standard random forest

Athey et al. are claiming in their paper that in the special case of a regression forest, the GRF gets the same results as the standard random forest by Breiman (2001). So, one already implemented estimation method in the grf-package4 is a regression forest. Therefore, I will compare those results, with the random forest implementations of the randomForest-package as well as the implementation of the ranger-packages. For tuning porpuses, I will use a random search with 50 iterations for the randomForest and ranger-package and for the grf the implemented tune_regression_forest()-function. The Algorithms will be benchmarked on 3 data sets, which have been already used in another blog post, while using the RMSE to compare the results. For easy handling, I implemented the regression_forest() into the caret framework, which can be found on my GitHub.

Data Set Metric grf ranger randomForest
air RMSE 0.25 0.24 0.24
bike RMSE 2.90 2.41 2.67
gas RMSE 36.0 32.6 34.4

The GRF performs a little bit worse in comparison with the other implementations. However, this could be also due to the tuning of the parameters, because there are more parameters to tune. According to their GitHub, they are planning on improving the tune_regression_forest()-Function.
One advantage of the GRF is, that it produces unbiased confidence intervals for each estimation point. In order to do so, they are performing honest tree splitting, which was first described in their paper about causal trees5. With honest stree splitting, one sample is used to make the splits and another distinct sample is used to estimate the coefficients.

However, standard regression is not the exciting part of the Generalized Random Forest. Therefore, I will take a look at how the GRF performs in estimating heterogeneous treatment effects with simulated data and compare it to the estimation results of the MOB and the BART in my next blog post.


  1. Athey, Tibshirani, Wager. Forthcoming.”Generalized Random Forests”
  2. Zeileis, Hothorn, Hornik. 2008.”Model-based Recursive Partitioning”
  3. Hill. 2011.”Bayesian Nonparametric Modeling for Causal Inference”
  5. Athey and Imbens. 2016.”Recursive partitioning for heterogeneous causal effects.”
Über den Autor
Markus Berroth

Markus Berroth

Markus ist Mitglied in unserem Data Science Team und Python Experte. In seiner Freizeit geht er gerne übers Wochenende auf Städtetrips.

Der Beitrag Using Machine Learning for Causal Inference erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)