Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bob Horton
Sr Data Scientist, Microsoft

Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow the “more details” link to find the R code used to generate it. There is a lot of R in Wikipedia). I rearranged it a bit to put the values in a dataframe, to make it a bit easier to think of the “color” column as a confounding variable:

x y color
1 6 1
2 7 1
3 8 1
4 9 1
8 1 2
9 2 2
10 3 2
11 4 2

If we do not consider this confounder, we find that the coefficient of x is negative (the dashed line in the figure above):

coefficients(lm(y ~ x, data=simpson_data))
## (Intercept)           x
##   8.3333333  -0.5555556

If we do take the confouder into account, we see the coefficient of x is positive:

coefficients(lm(y ~ x + color, data=simpson_data))
## (Intercept)           x       color
##          17           1         -12

In his book Causality, Judea Pearl makes a more sweeping statement regarding Simpson’s paradox: “Any statistical relationship between two variables may be reversed by including additional factors in the analysis.” [Pearl2009]

That sounds fun; let’s try it.

First we’ll make variables x and y with a simple linear relationship. I’ll use the same slopes and intercepts as in the Wikipedia figure, both to show the parallel and to demonstrate the incredible cosmic power I have to bend coefficients to my will.

set.seed(1)
N <- 3000
x <- rnorm(N)
m <- -0.5555556
b <- 8.3333333
y <- m * x + b + rnorm(length(x))
plot(x, y, col="gray", pch=20, asp=1)
fit <- lm(y ~ x)
abline(fit, lty=2, lwd=2) When we look at the slope of the regression line determined by fitting the model, it is almost exactly equal to the constant m that we used to determine y.

coefficients(fit)
## (Intercept)           x
##   8.3284021  -0.5358175

We get out what we put in; the coefficient of x is essentially the slope we originally gave y when we generated it (-0.5555556). This is the ‘effect’ of x, in that a one unit increase in x apparently increases y by this amount.

Now think about how to concoct a confounding variable to reverse the coefficient of x. This figure shows one way to approach the problem – group the points into a set of parallel stripes, with the stripes sloping in a different direction from the overall dataset:

m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below The stripes were made by specifying a reference line with a slope equal to the x-coefficient we want to achieve, and calculating the distance to that line for each point. Putting these distances into categories (by rounding off some multiple of the distance) then groups the points into stripes (shown as colors in the figure). A regression line was then fitted separately to the set of points within each stripe. The regression lines for the stripes on the very ends can be a bit wild, since these groups are very small and scattered, but the ones near the center, representing the majority of the data points, have a quite consistent slope.

The equation for determining the distance from a point to a line is (of course) right there in Wikipedia.

With a little rearranging to express the line in terms of y-intercept (b) and slope (m), and leaving off the absolute value so that points below the line have negative distances (and thus end up in a different group from the stripe with a positive distance of the same magnitude), we get this function:

point_line_distance <- function(b, m, x, y)
(y - (m*x + b))/sqrt(m^2 + 1)

Here are functions for putting the points into stripewise groups, determining the regression coefficients for each group, and putting it all together into a figure:

confounded_data_frame <- function(x, y, m, num_grp){
b <- 0 # intercept doesn't matter
d <- point_line_distance(b, m, x, y)
d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
data.frame(x=x, y=y,
group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}

find_group_coefficients <- function(data){
coef <- t(sapply(levels(data$group), function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}

striped_scatterplot <- function(formula, grouped_data){
# blue on top and red on bottom, to match the Wikipedia figure
colors <- rev(rainbow(length(levels(grouped_data$group)), end=2/3)) plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
grp_coef <- find_group_coefficients(grouped_data)
# if some coefficents get dropped, colors won't match exactly
for (r in 1:nrow(grp_coef))
abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}

Note that the regression lines for each group are not exactly parallel to the stripes. This is because linear regression is about minimizing the squared error on the y-axis, not the distance of points from the line. However, the thinner the stripes are, the closer the group regression lines are to our target slope. If we make a large number of thin stripes, the coefficient of x when the groups are taken into account is essentially the same as the slope of the reference line we used to orient the stripes:

cdf100 <- confounded_data_frame(x, y, m_new, num_grp=100)
# without confounder
coefficients(lm(y ~ x, cdf100))['x']
##          x
## -0.5358175
# with confounder
coefficients(lm(y ~ x + group, cdf100))['x']
##         x
## 0.9961566

This approach gives us the power to synthesize simulated confounders that can change the coefficient of x to pretty much any value we choose when a model is fitted with the confounder taken into account. Plus, it makes pretty rainbows.

While Simpson’s Paradox is typically described in terms of categorical confounders, the same reversal principle applies to continuous confounders. But that’s a topic for another post.