[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Beware of what we adjust. As we have demonstrated, adjusting for a collider variable can lead to a false estimate in your analysis. If a collider is included in your model, relying solely on AIC/BIC for model selection may provide misleading results and give you a false sense of achievement.

### Let’s DAG out The True Causal Model

```# Loading libraries
library(ipw)
library(tidyverse)
library(dagitty)
library(DT)

# Creating the actual/known DAG
dag2 <- dagitty('dag {
bb="0,0,1,1"
X [exposure,pos="0.234,0.502"]
Y [outcome,pos="0.486,0.498"]
Z [pos="0.232,0.264"]
collider [pos="0.181,0.767"]
W -> X
W -> Y
X -> Y
X -> collider
Y -> collider
Z -> W
Z -> X
}
')

plot(dag2)
```

X: Treatment/Exposure variable/node.
Y: Outcome variable/node.
Z: Something… not sure what this is called 🤣.
W: Confounder.
collider: Collider, V-structure.

Let’s find out what is our minimal adjustment for total effect

```adjust <- adjustmentSets(dag2)

## { W }
```

Perfect! We only need to adjust W. Now we know which node is correct to adjust, let’s simulate data!

```set.seed(1)

# set number of observations
n <- 100
z <- rnorm(n)
w <- 0.6*z + rnorm(n)
x <- 0.5*z + 0.2*w + rnorm(n)
y <- 0.5*x + 0.4*w
collider <- -0.4*x + -0.4*y

# turning y and x into binary/categorical
x1=ifelse(x>mean(x),1,0)
y1=ifelse(y>mean(y),1,0)

# create a dataframe for the simulated data
df <- tibble(z=z,w=w,y=y1,x=x1, collider=collider)

datatable(df)
```

The reason we converted `x` and `y` into binary variables was to replicate a question that is often encountered in our research. In many cases, both the treatment and outcome are binary. Hence, the transformation. However, it is possible to keep the variables as they are and use linear regression instead of logistic regression, which I’ll be using shortly. Additionally, I used the `mean` to binarize both variables into `1` or `0` because they were approximately normally distributed, and I wanted to balance them. You can experiment with different threshold values if needed.

### Simple Model (y ~ x)

```model_c <- glm(y~x, data=df)
summary(model_c)

##
## Call:
## glm(formula = y ~ x, data = df)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.8148  -0.1739   0.1852   0.1852   0.8261
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.17391    0.05721   3.040  0.00304 **
## x            0.64090    0.07786   8.232 8.11e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.15058)
##
##     Null deviance: 24.960  on 99  degrees of freedom
## Residual deviance: 14.757  on 98  degrees of freedom
## AIC: 98.441
##
## Number of Fisher Scoring iterations: 2
```

Note that our true `x` `coefficient` is `0.5`. Our current naive model shows 0.6409018

### Correct Model (y ~ x + w) ✅

```model_cz <- glm(y~x + w, data=df)
summary(model_cz)

##
## Call:
## glm(formula = y ~ x + w, data = df)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.6073  -0.2124  -0.0082   0.2211   0.5647
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.25054    0.04420   5.668 1.48e-07 ***
## x            0.48667    0.06158   7.903 4.32e-12 ***
## w            0.24174    0.02808   8.609 1.34e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.08623743)
##
##     Null deviance: 24.960  on 99  degrees of freedom
## Residual deviance:  8.365  on 97  degrees of freedom
## AIC: 43.677
##
## Number of Fisher Scoring iterations: 2
```

Wow, even the correct model models `x` coefficient of 0.4866734. Not bad with an `n` of 100

### Alright, What About Everything Including Collider (y ~ x + z + w + collider) ❌

```model_czwall <- glm(y~x + z + w + collider, data=df)
summary(model_czwall)

##
## Call:
## glm(formula = y ~ x + z + w + collider, data = df)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -0.53681  -0.23608  -0.00639   0.20049   0.50698
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.354932   0.054370   6.528 3.25e-09 ***
## x            0.273625   0.090835   3.012  0.00332 **
## z            0.004099   0.039375   0.104  0.91731
## w            0.192690   0.032518   5.926 4.97e-08 ***
## collider    -0.198703   0.066765  -2.976  0.00370 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.07997983)
##
##     Null deviance: 24.9600  on 99  degrees of freedom
## Residual deviance:  7.5981  on 95  degrees of freedom
## AIC: 38.06
##
## Number of Fisher Scoring iterations: 2
```

😱 `x` is now 0.2736252. Not good!

### Let’s Visualize All Models

```load("df_model.rda")

df_model |>
mutate(color = case_when(
str_detect(formula, "collider") ~ 1,
TRUE ~ 0
)) |>
ggplot(aes(x=formula,y=estimate,ymin=lower,ymax=upper)) +
geom_point(aes(color=as.factor(color))) +
geom_linerange(aes(color=as.factor(color))) +
scale_color_manual(values = c("grey","red")) +
coord_flip() +
geom_hline(yintercept = 0.5) +
theme_minimal() +
theme(legend.position = "none")
```

The red lines represent models with colliders adjusted. It’s important to observe that none of these models contain the true value within their 95% confidence intervals. Adjusting for colliders can lead to biased estimates, particularly when the colliders directly affect both the treatment and outcome variables. Careful consideration should be given to the inclusion of colliders in the analysis to avoid potential distortions in the results.”

### Let’s check IPW

```ipw <- ipwpoint(
exposure = x,
family = "binomial",
denominator = ~ w,
data = as.data.frame(df))

model_cipw <- glm(y ~ x, data = df |> mutate(ipw=ipw\$ipw.weights), weights = ipw)
summary(model_cipw)

##
## Call:
## glm(formula = y ~ x, data = mutate(df, ipw = ipw\$ipw.weights),
##     weights = ipw)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.3954  -0.3649   0.3050   0.3581   1.5342
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.25974    0.06371   4.077 9.29e-05 ***
## x            0.46462    0.08946   5.194 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3980486)
##
##     Null deviance: 49.746  on 99  degrees of freedom
## Residual deviance: 39.009  on 98  degrees of freedom
## AIC: 131.05
##
## Number of Fisher Scoring iterations: 2
```

`x` is now 0.4646218. Quite similar to before. What if we just try a collider?

```ipw <- ipwpoint(
exposure = x,
family = "binomial",
denominator = ~ collider,
data = as.data.frame(df))

model_cipw2 <- glm(y ~ x, data = df |> mutate(ipw=ipw\$ipw.weights), weights = ipw)
summary(model_cipw2)

##
## Call:
## glm(formula = y ~ x, data = mutate(df, ipw = ipw\$ipw.weights),
##     weights = ipw)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.9717  -0.3421   0.3829   0.3835   1.9563
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.34073    0.07161   4.758 6.73e-06 ***
## x            0.27633    0.09741   2.837  0.00554 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.315229)
##
##     Null deviance: 33.429  on 99  degrees of freedom
## Residual deviance: 30.892  on 98  degrees of freedom
## AIC: 156.91
##
## Number of Fisher Scoring iterations: 2
```

Youza! `x` is now 0.2763335 with `collider`. NOT GOOD !!!

Some clinical examples of adjusting for colliders that lead to d-connection include situations where the treatment and outcome have a common cause that is also adjusted for, such as when the outcome is mortality and the collider is the recurrence of a medical condition. In this scenario, adjusting for the common cause (recurrence of the condition) could lead to d-connection because both the treatment and mortality can directly affect the recurrence of the medical condition (where recurrence would likely decrease when mortality occurs).”

### What If We Just Use Stepwise Regression?

```datatable(df_model |> select(formula,aic,bic),
options = list(order = list(list(2,"asc"))))
```

Wow, when sorted in ascending order, the first 2 lowest AIC values are associated with models that include colliders! This is surprising! It highlights the importance of caution when using stepwise regression or automated model selection techniques, especially if you are uncertain about the underlying causal model. Blindly relying on these methods without understanding the causal relationships can lead to misleading results.

### Lessons learnt

• Having a well-defined Causal Estimand is crucial! It requires careful consideration of the clinical context and the specific question you want to address.
• Blindly adjusting for all available variables can be dangerous, as it may lead to spurious correlations. Selecting variables to adjust for should be guided by domain knowledge and causal reasoning.
• If you’re interested in accessing the code for all of the models click here