Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# CIS Primer Question 3.2.1

Here are my solutions to question 3.2.1 of Causal Inference in Statistics: a Primer (CISP).

Here are the parameters we’ll use. Note that they are taken from the Simpson’s revesal example of question 1.5.2.

r   <- 0.28   # fraction with syndrome

q0 <- 0.07  # P(X = 1 | Z = 0)
q1 <- 0.85  # P(X = 1 | Z = 1)

p00 <- 0.84 # P(Y = 1 | X = 0, Z = 0)
p10 <- 0.88 # P(Y = 1 | X = 1, Z = 0)
p01 <- 0.53 # P(Y = 1 | X = 0, Z = 1)
p11 <- 0.58 # P(Y = 1 | X = 1, Z = 1)

## Part a

We can simulate the intervention by generating values for $$X$$ independently of $$Z$$.

N <- 10000  # number of individuals

set.seed(53201)

part_a <- tibble(z = rbinom(N, 1, r)) %>%
mutate(
x = rbinom(n(), 1, 0.5), # no Z-dependence
p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),
y = rbinom(n(), 1, p_y_given_x_z)
) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
group_by(x) %>%
mutate(p_y_given_do_x = n / sum(n))
x y n p_y_given_do_x
0 0 1238 0.2470565
0 1 3773 0.7529435
1 0 964 0.1932251
1 1 4025 0.8067749

## Part b

To simulate observational data, we need to include the dependence of $$X$$ on $$Z$$.

N <- 100000  # number of individuals

set.seed(95400)

p_x_y_z <- tibble(
id = 1:N,

z = rbinom(N, 1, r),
x = rbinom(N, 1, if_else(z == 0, q0, q1)),

p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),

y = rbinom(N, 1, p_y_given_x_z)
) %>%
group_by(x, y, z) %>%
count() %>%
ungroup() %>%
mutate(p = n / sum(n))
x y z n p
0 0 0 10723 0.10723
0 0 1 2020 0.02020
0 1 0 56015 0.56015
0 1 1 2205 0.02205
1 0 0 624 0.00624
1 0 1 10067 0.10067
1 1 0 4470 0.04470
1 1 1 13876 0.13876

In order to apply the causal effect rule, we’ll need $$\mathbb P(x \mid z)$$.

p_x_given_z <- p_x_y_z %>%
group_by(x, z) %>%
summarise(n = sum(n)) %>%
group_by(z) %>%
mutate(p = n / sum(n)) %>%
ungroup()
x z n p
0 0 66738 0.9290845
0 1 4225 0.1499929
1 0 5094 0.0709155
1 1 23943 0.8500071

We can then add the conditional probabilities to the joint distribution table, then sum overal all the $$Z$$ variables.

p_y_given_do_x <- p_x_y_z %>%
inner_join(
p_x_given_z,
by = c('x', 'z'),
suffix = c('_num', '_denom')
) %>%
mutate(p = p_num / p_denom) %>%
group_by(x, y) %>%
summarise(p = sum(p)) 
x y p
0 0 0.2500877
0 1 0.7499123
1 0 0.2064264
1 1 0.7935736

The causal effect estimates are very close to the simulated intervention.

## Part c

We can calculate ACE simply by taking the difference of the causal effect estimates.

ace <- p_y_given_do_x %>%
filter(y == 1) %>%
mutate(ace = 1 - 0) %>%
pull(ace)

ace
 0.04366134

This is different from the overall probability differences.

p_y_given_x <- p_x_y_z %>%
group_by(x, y) %>%
summarise(n = sum(n)) %>%
group_by(x) %>%
mutate(p = n / sum(n)) %>%
select(-n)

risk_difference <- p_y_given_x %>%
filter(y == 1) %>%
mutate(rd = 1 - 0) %>%
pull(rd)

risk_difference
 -0.188613

Making $$X$$ independent of $$Z$$ would minimise the disrepancy between ACE and RD, which would turn the adjustment formula into the formulat for $$\mathbb P(y \mid x$$. In other words, setting $$q_0 = q_1 = \mathbb P(X = 1)$$ would do the trick.

## Part d

Note that the desegregated causal effects

• $$p_{1, 0} - p_{0, 0}$$ is 0.04; and
• $$p_{1, 1} - p_{0, 1}$$ is 0.05,

are both consisent with our calculation for the overall causal effect, ACE = 4.37%. The generated data are an illustration of Simpson’s reversal because the risk difference, -18.9%, has the opposite sign.