[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Front-door adjustment: a superhero method for handling unobserved confounding by using mediators (if present) to estimate causal effects accurately.

## Let’s DAG an estimand!

library(dagitty)

dag <- dagitty('dag {
bb="0,0,1,1"
U [pos="0.402,0.320"]
X [exposure,pos="0.176,0.539"]
Y [outcome,pos="0.614,0.539"]
Z [pos="0.409,0.539"]
U -> X
U -> Y
X -> Z
Z -> Y
}
')

plot(dag)


You can easily draw a DAG on dagitty and then copy and paste the code and slot it in the dag { code } and voila!

X is our treatment variable.
Y is our outcome variable.
U is confounder that is not measured.
Z is our mediator.

## Assume We Know The Truth

library(tidyverse)
library(DT)

# pseudorandomization
set.seed(1)

# Set sample
n <- 1000

# Set noise of each nodes
ux <- rnorm(n)
uy <- rnorm(n)
uz <- rnorm(n)
uu <- rnorm(n)

# Set each nodes' equation
u <- uu
x <- 0.4*u + ux
z <- 0.6*x + uz
y <- -0.5*z + uy + 0.6*u

# Create a table
df <- tibble(x=x,y=y,z=z,u=u)

# For easier viewing on blog
datatable(df)


The reason to assign uu, uy, uz, uu is basically to introduce some randomness to the equation. Then, the magic is when we write out the equations for u, x, z, y. This is what we meant by “knowing the truth”. We know exactly what coefficients for equations. For example, How much z would increase if there is an increase of 1 unit of x? The answer would be 0.6, as that is the coefficient we had set. We also know the exact u variable values are as well, but we’ll assume that we don’t and we’ll check if front-door adjustment can help us.

## The Right Model ✅

In order for us to know what a wrong model is, we have to know what the right model is first. Since we know u, let’s model it and find the correct coefficient of x, holding u constant.

As you can see the screenshot above, if we have the luxury to adjust for u, then we have essentially d-seperated x and y and we can estimate their coefficient. Let’s put in into linear regression model.

correct_model <- lm(y~x+u,df)
summary(correct_model)

##
## Call:
## lm(formula = y ~ x + u, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.4698 -0.7571 -0.0529  0.7869  3.8100
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.02431    0.03641  -0.668    0.505
## x           -0.31841    0.03520  -9.045   <2e-16 ***
## u            0.61805    0.03810  16.220   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.151 on 997 degrees of freedom
## Multiple R-squared:  0.2142,	Adjusted R-squared:  0.2126
## F-statistic: 135.8 on 2 and 997 DF,  p-value: < 2.2e-16


Looking at the coefficient for x, the correct estimate is -0.3184144

## The Wrong Model ❌

Now, let’s look at the wrong model. Let’s say naively, we want to fit just y ~ x. What would happen?

model_no_front <- lm(y~x,df)
summary(model_no_front)

##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.1379 -0.8724 -0.0593  0.9068  4.0849
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01287    0.04091  -0.315  0.75319
## x           -0.09505    0.03641  -2.611  0.00917 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.294 on 998 degrees of freedom
## Multiple R-squared:  0.006784,	Adjusted R-squared:  0.005789
## F-statistic: 6.817 on 1 and 998 DF,  p-value: 0.009167


Did you notice that the x coefficient is -0.0950511 ?

Let’s look at another wrong model, let’s control for Z and see what that shows?

model <- lm(y~z+x,df)
summary(model)

##
## Call:
## lm(formula = y ~ z + x, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.0450 -0.8006  0.0104  0.8316  3.9043
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005363   0.037782  -0.142    0.887
## z           -0.483048   0.036701 -13.162  < 2e-16 ***
## x            0.216663   0.041125   5.268 1.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.195 on 997 degrees of freedom
## Multiple R-squared:  0.1538,	Adjusted R-squared:  0.1521
## F-statistic: 90.61 on 2 and 997 DF,  p-value: < 2.2e-16


Wait, what? Now x is 0.2166626 ??? The correct coefficient is -0.3184144 as we previously had calculated. What is going on here? 🤪

Here is a screenshot of wrongly adjusting for z

## What do we do? 🤷‍♂️

### In Comes The Front-door Adjustment! If we’re lucky… 🍀

Let’s look at the equation to calculate for front-door adjustment. It is divided into 2 steps

#### 1. Estimate Z and do(X)

The game of doCalculus is to how to get rid of the do

$$p(Z|do(X)) = \sum p(Z|X)$$.
Why is the above estimate true? Let’s go back to the DAG and change outcome to z

Did you notice that there is no need to adjust anything at all because there is a collider on y ? Can’t see it? Here you go: z -> y <- u -> x. Sometimes writing the equation out really helps.

#  1. Estimate Z and do(X)
model2 <- lm(z~x,df)
summary(model2)

##
## Call:
## lm(formula = z ~ x, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.5756 -0.6448 -0.0196  0.7013  2.7987
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.01553    0.03258   0.477    0.634
## x            0.64531    0.02900  22.254   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 998 degrees of freedom
## Multiple R-squared:  0.3317,	Adjusted R-squared:  0.331
## F-statistic: 495.2 on 1 and 998 DF,  p-value: < 2.2e-16


x coefficient is 0.6453056.

#### 2. Estimate Y and do(Z)

$$p(Y|do(Z)) = \sum p(Y|Z,X).p(X)$$. Why is the above estimate true? Let’s go back to the DAG and change outcome to y and exposure to z

Did you notice that we need to adjust for one node that we actually have data on to achieve d-separation? Can’t see it? Here you go: y <- u -> x -> z. What is it? x !!! The treatment itself, how uncanny !!!

Let’s check out this model

model <- lm(y~z+x,df)
summary(model)

##
## Call:
## lm(formula = y ~ z + x, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.0450 -0.8006  0.0104  0.8316  3.9043
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005363   0.037782  -0.142    0.887
## z           -0.483048   0.036701 -13.162  < 2e-16 ***
## x            0.216663   0.041125   5.268 1.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.195 on 997 degrees of freedom
## Multiple R-squared:  0.1538,	Adjusted R-squared:  0.1521
## F-statistic: 90.61 on 2 and 997 DF,  p-value: < 2.2e-16


z coefficient is -0.4830482.

#### 3. Put it together - x coefficient * z coefficient

The true x coefficient in relation to y is essentially the multiplication of the above estimates!

$$p(Y|do(X)) = p(Y|do(Z)).p(Z|do(X)) = \sum p(Y|Z,X').p(X').\sum p(Z|X)$$

Let’s see if this is true with our codes and data

#correct estimated coefficient
model$coefficients[["z"]]*model2$coefficients[["x"]]

## [1] -0.3117137


Our estimate x coefficient is -0.3117137. Whereas our true x coefficient is -0.3184144. About 0.0067008 difference. Not too shabby at all !!! Notice that we did not need u data at all to estimate this. Awesomeness!!!

Yes, it’s not going to be exactly what the true estimate is but it’s close enough! You can mess around the n by increasing it and you will then get closer to the true estimate. Also depends on your set.seed, the true coefficient will be different too, give seed 123 a try.

### Lessons learnt

• Front-door analysis is another great tool when unobserved confounder is a problem
• Learnt how to estimate front-door formula from scratch is very helpful in understanding how the whole thing works
• To estimate whether a method is beneficial we can simulate equations and data and test the theory
• there are other packages such as doWhy which will calculate all these for you without a hassle in deriving your own formula