Cross-Fitting Double Machine Learning estimator

insightr

5 years ago

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

Motivation

In a late post I talked about inference after model selection showing that a simple double selection procedure is enough to solve the problem. In this post I’m going to talk about a generalization of the double selection for any Machine Learning (ML) method described by Chernozhukov et al. (2016).

Suppose you are in the following framework:

where is the parameter of interest, is a set of control variables and and are error terms. The functions and are very generic and possibly non-linear.

How can we estimate ? The most naive way (and obviously wrong) is to simple estimate a regression using only to explain . Another way is to try something similar to the double selection, which is referred by Chernozhukov et al. (2016) as Double Machine Learning (DML):

(1): Estimate ,
(2): Estimate without using ,
(3): Estimate .

However, in this case the procedure above will leave a term on the asymptotic distribution of that will cause bias. This problem may be solved with sample splitting. Suppose we randomly split our observations in two samples of size . The fist sample will be indexed by and the auxiliary second sample will be indexed by . We are going to estimate the first two steps above in the auxiliary sample and the third step will be done into sample . Therefore:

The estimator above is unbiased. However, you are probably wondering about efficiency loss because was estimated using only half of the sample. To solve this problem we must now do the opposite: first we estimate steps 1 and 2 in the sample and then we estimate in the sample . As a result, we will have and . The cross-fitting estimator will be:

which is a simple average between the two s.

Example

The best way to illustrate this topic is using simulation. I am going to generate data from the following model:

The number of observations and the number of simulations will be 500,
The number of variables in will be , generated from a multivariate normal distribution,
,
,
and are normal with mean 0 and variance 1.

library(clusterGeneration)
library(mvtnorm)
library(randomForest)

set.seed(123) # = Seed for Replication = #
N=500 # = Number of observations = #
k=10 # = Number of variables in z = #
theta=0.5
b=1/(1:k)

# = Generate covariance matrix of z = #
sigma=genPositiveDefMat(k,"unifcorrmat")$Sigma
sigma=cov2cor(sigma)

The ML model we are going to use to estimate steps 1 and 2 is the Random Forest. The simulation will estimate the simple OLS using only to explain , the naive DML without sample splitting and the Cross-fitting DML. The 500 simulations may take a few minutes.

set.seed(123)
M=500 # = Number of Simumations = #

# = Matrix to store results = #
thetahat=matrix(NA,M,3)
colnames(thetahat)=c("OLS","Naive DML","Cross-fiting DML")

for(i in 1:M){
  z=rmvnorm(N,sigma=sigma) # = Generate z = #
  g=as.vector(cos(z%*%b)^2) # = Generate the function g = #
  m=as.vector(sin(z%*%b)+cos(z%*%b)) # = Generate the function m = #
  d=m+rnorm(N) # = Generate d = #
  y=theta*d+g+rnorm(N) # = Generate y = #

  # = OLS estimate = #
  OLS=coef(lm(y~d))[2]
  thetahat[i,1]=OLS

  # = Naive DML = #
  # = Compute ghat = #
  model=randomForest(z,y,maxnodes = 20)
  G=predict(model,z)
  # = Compute mhat = #
  modeld=randomForest(z,d,maxnodes = 20)
  M=predict(modeld,z)
  # = compute vhat as the residuals of the second model = #
  V=d-M
  # = Compute DML theta = #
  theta_nv=mean(V*(y-G))/mean(V*d)
  thetahat[i,2]=theta_nv

  # = Cross-fitting DML = #
  # = Split sample = #
  I=sort(sample(1:N,N/2))
  IC=setdiff(1:N,I)
  # = compute ghat on both sample = #
  model1=randomForest(z[IC,],y[IC],maxnodes = 10)
  model2=randomForest(z[I,],y[I], maxnodes = 10)
  G1=predict(model1,z[I,])
  G2=predict(model2,z[IC,])

  # = Compute mhat and vhat on both samples = #
  modeld1=randomForest(z[IC,],d[IC],maxnodes = 10)
  modeld2=randomForest(z[I,],d[I],maxnodes = 10)
  M1=predict(modeld1,z[I,])
  M2=predict(modeld2,z[IC,])
  V1=d[I]-M1
  V2=d[IC]-M2

  # = Compute Cross-Fitting DML theta
  theta1=mean(V1*(y[I]-G1))/mean(V1*d[I])
  theta2=mean(V2*(y[IC]-G2))/mean(V2*d[IC])
  theta_cf=mean(c(theta1,theta2))
  thetahat[i,3]=theta_cf

}

colMeans(thetahat) # = check the average theta for all models = #

##              OLS        Naive DML Cross-fiting DML
##        0.5465718        0.4155583        0.5065751

# = plot distributions = #
plot(density(thetahat[,1]),xlim=c(0.3,0.7),ylim=c(0,14))
lines(density(thetahat[,2]),col=2)
lines(density(thetahat[,3]),col=4)
abline(v=0.5,lty=2,col=3)
legend("topleft",legend=c("OLS","Naive DML","Cross-fiting DML"),col=c(1,2,4),lty=1,cex=0.7,seg.len = 0.7,bty="n")

As you can see, the only unbiased distribution is the Cross-Fitting DML. However, the model used to estimate the functions and must be able to capture the relevant information for the data. In the linear case you may use the LASSO and achieve a result similar to the double selection. Finally, what we did here was a 2-fold Cross-Fitting, but you can also do a k-fold Cross-Fitting just like you do a k-fold cross-validation. The size of k does not matter asymptotically, but in small samples the results may change.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., & Hansen, C. (2016). Double machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.