# Correlation is not transitive, in general at least: A simulation approach

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let $$\rho_{XY}$$ be the correlation between the stochastic variables $$X$$ and $$Y$$ and similarly for $$\rho_{XZ}$$ and $$\rho_{YZ}$$.
If we know two of these, can we say anything about the third?

In a recent blog post I dealt with the
problem mathematically and I used the concept of a partial correlation coefficient.
Here I will take a simulation approach.

First z is simulated. Then x and y is simulated based on z
in a regression context with a slope between $$-1$$ and $$1$$.
The sign of the slope together with the variance of the error term
determine the correlation.

library(tidyverse)
library(patchwork)
theme_set(theme_bw())

sample_size <- 100

set.seed(1)

# First z is sampled
z <- rnorm(n = sample_size, mean = 0, sd = 1)

# Then x
slope_x <- runif(n = 1, min = -1, max = 1)
sd_error_x <- runif(n = 1, min = 0.001, max = 2)
x <- slope_x*z + rnorm(n = sample_size, mean = 0, sd = sd_error_x)

# And the y:
slope_y <- runif(n = 1, min = -1, max = 1)
sd_error_y <- runif(n = 1, min = 0.001, max = 2)
y <- slope_y*z + rnorm(n = sample_size, mean = 0, sd = sd_error_y)

This gives us:

p_xz <- ggplot() + geom_point(aes(x, z))
p_yz <- ggplot() + geom_point(aes(y, z))
p_xz + p_yz Now we know the correlation between $$x$$ and $$z$$ as well as between $$y$$ and $$z$$.
But we do not know it between $$x$$ and $$y$$, but we can calculate it.

cor(x, z)
##  -0.6728867
cor(y, z)
##  0.3551803
cor(x, y)
##  -0.1818432

Let us perform the simulation many, many times:

set.seed(1)
num_sims <- 100000
sim_res <- replicate(n = num_sims, expr = {
# First z is sampled
z <- rnorm(n = sample_size, mean = 0, sd = 1)

# Then x
slope_x <- runif(n = 1, min = -1, max = 1)
sd_error_x <- runif(n = 1, min = 0.001, max = 2)
x <- slope_x*z + rnorm(n = sample_size, mean = 0, sd = sd_error_x)

# And the y:
slope_y <- runif(n = 1, min = -1, max = 1)
sd_error_y <- runif(n = 1, min = 0.001, max = 2)
y <- slope_y*z + rnorm(n = sample_size, mean = 0, sd = sd_error_y)

return(c("rXZ" = cor(x, z), "rYZ" = cor(y, z), "rXY" = cor(x, y)))
})

We then make the illustration:

d <- t(sim_res) %>% as_tibble() %>%
mutate(rXY_sign = as.character(sign(rXY)))

p <- ggplot(d, aes(rXZ, rYZ, color = rXY_sign)) +
geom_point(alpha = 0.1) +
scale_color_manual(expression(paste("Sign of ", rho[XY])),
values = c("-1" = "red", "1" = "blue")) +
labs(x = expression(rho[XZ]),
y = expression(rho[YZ])) +
coord_equal()
p It can be difficult to see because of possible overplotting so we divide points into small bins and compute the different signs in each bin:

n_bins <- 50
d_binned <- d %>%
mutate(
rXZ_grp = cut(rXZ, breaks = n_bins),
rYZ_grp = cut(rYZ, breaks = n_bins)
) %>%
group_by(rXZ_grp, rYZ_grp) %>%
summarise(
rXZ = mean(rXZ),
rYZ = mean(rYZ),
n = n(),
rXY_sign = case_when(
all(sign(rXY) == -1) ~ "-1",
all(sign(rXY) == 1) ~ "1",
TRUE ~ "?"
))
d_binned
## # A tibble: 2,500 x 6
## # Groups:   rXZ_grp 
##    rXZ_grp    rYZ_grp          rXZ    rYZ     n rXY_sign
##
##  1 (-1,-0.96] (-1,-0.96]    -0.987 -0.986   128 1
##  2 (-1,-0.96] (-0.96,-0.92] -0.988 -0.943    60 1
##  3 (-1,-0.96] (-0.92,-0.88] -0.987 -0.901    56 1
##  4 (-1,-0.96] (-0.88,-0.84] -0.988 -0.862    54 1
##  5 (-1,-0.96] (-0.84,-0.8]  -0.983 -0.819    54 1
##  6 (-1,-0.96] (-0.8,-0.76]  -0.987 -0.780    52 1
##  7 (-1,-0.96] (-0.76,-0.72] -0.986 -0.742    42 1
##  8 (-1,-0.96] (-0.72,-0.68] -0.989 -0.699    58 1
##  9 (-1,-0.96] (-0.68,-0.64] -0.986 -0.659    61 1
## 10 (-1,-0.96] (-0.64,-0.6]  -0.988 -0.620    71 1
## # … with 2,490 more rows
p <- ggplot(d_binned, aes(rXZ, rYZ, fill = rXY_sign)) +
geom_tile(width = 0.05, height = 0.05) +
scale_fill_manual(expression(paste("Sign of ", rho[XY])),
values = c("-1" = "red", "1" = "blue", "?" = "grey")) +
labs(x = expression(rho[XZ]),
y = expression(rho[YZ])) +
coord_equal()
p We know from the previous blog post that we expect uncertainty about the sign inside the unit circle and certainty outside. Let us draw the unit circle on this, too:

p + annotate("path",
x = cos(seq(0, 2*pi, length.out = 100)),
y = sin(seq(0, 2*pi, length.out = 100)),
color = "green", size = 2) Somehow we have difficulty in sampling varying signs in the corners
just inside the circle.
(Maybe we need another simulation model.)
This could potentially lead us to believe that instead of a circle, we would have a
square:

p + annotate("path",
x = cos(seq(0, 2*pi, length.out = 100)),
y = sin(seq(0, 2*pi, length.out = 100)),
color = "green", size = 2, alpha = 0.5) +
annotate("segment", x = -1, xend = 0, y = 0, yend = 1, colour = "green", size = 2) +
annotate("segment", x = 0, xend = 1, y = 1, yend = 0, colour = "green", size = 2) +
annotate("segment", x = 1, xend = 0, y = 0, yend = -1, colour = "green", size = 2) +
annotate("segment", x = 0, xend = -1, y = -1, yend = 0, colour = "green", size = 2) This illustrates that one must be a bit cautious with blindly
relying on simulation results unless the
simulation mechanism and data-generating process is well-understood.

Let us try to fix that:
Instead of simlating $$x$$ and $$y$$ separately (depending only on $$z$$), we now allow for $$y$$ to also depend on $$x$$:

set.seed(1)
num_sims <- 100000
sim_res <- replicate(n = num_sims, expr = {
# First z is sampled
z <- rnorm(n = sample_size, mean = 0, sd = 1)

# Then x
slope_x <- runif(n = 1, min = -1, max = 1)
sd_error_x <- runif(n = 1, min = 0.001, max = 2)
x <- slope_x*z + rnorm(n = sample_size, mean = 0, sd = sd_error_x)

# And the y:
slope_x <- runif(n = 1, min = -1, max = 1) # new line
slope_y <- runif(n = 1, min = -1, max = 1)
sd_error_y <- runif(n = 1, min = 0.001, max = 2)
# Extra term: "slope_x*x"
y <- slope_x*x + slope_y*z + rnorm(n = sample_size, mean = 0, sd = sd_error_y)

return(c("rXZ" = cor(x, z), "rYZ" = cor(y, z), "rXY" = cor(x, y)))
})
n_bins <- 50
d_binned <- t(sim_res) %>%
as_tibble() %>%
mutate(
rXZ_grp = cut(rXZ, breaks = n_bins),
rYZ_grp = cut(rYZ, breaks = n_bins)
) %>%
group_by(rXZ_grp, rYZ_grp) %>%
summarise(
rXZ = mean(rXZ),
rYZ = mean(rYZ),
n = n(),
rXY_sign = case_when(
all(sign(rXY) == -1) ~ "-1",
all(sign(rXY) == 1) ~ "1",
TRUE ~ "?"),
rXY = mean(rXY),
)
p <- ggplot(d_binned, aes(rXZ, rYZ, fill = rXY_sign)) +
geom_tile(width = 0.05, height = 0.05) +
scale_fill_manual(expression(paste("Sign of ", rho[XY])),
values = c("-1" = "red", "1" = "blue", "?" = "grey")) +
labs(x = expression(rho[XZ]),
y = expression(rho[YZ])) +
coord_equal()
p + annotate("path",
x = cos(seq(0, 2*pi, length.out = 100)),
y = sin(seq(0, 2*pi, length.out = 100)),
color = "green", size = 2) Instead of coloring according to sign only, we can color it according to the mean
correlation in each small area:

library(wesanderson) # for wes_palette
p <- ggplot(d_binned, aes(rXZ, rYZ, fill = rXY)) +
geom_tile(width = 0.05, height = 0.05) +
colours = wes_palette("Zissou1", 100, type = "continuous"))
labs(x = expression(rho[XZ]),
y = expression(rho[YZ])) +
coord_equal()
## NULL
p + annotate("path",
x = cos(seq(0, 2*pi, length.out = 100)),
y = sin(seq(0, 2*pi, length.out = 100)),
color = "green", size = 2) R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.