Spurious Correlations in R – Correlation is not Causation

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.

You can send me questions for the blog using this form and subscribe to receive an email when there is a new post.

I have created a new R package called spuriouscorrelations that aims to help educators to explain why correlation does not imply causation. What I had in mind were AP Statistics courses and college-level introductory statistics courses.

The package includes a dataset with 15 spurious correlations. You can install it from CRAN with:

# install.packages("spuriouscorrelations", repos = "https://cran.rstudio.com")
library(spuriouscorrelations)

Let’s plot one of the spurious correlations, for example, the correlation between the number of people who drowned by falling into a pool and the number of films Nicolas Cage appeared in:

library(dplyr)
library(ggplot2)

spurious_correlations %>%
  distinct(var1_short, var2_short)
# A tibble: 15 × 2
   var1_short                       var2_short                 
   <fct>                            <fct>                      
 1 US spending on science           Suicides                   
 2 Falling into a pool drownings    Nicholas Cage              
 3 Cheese consumed                  Bedsheet tanglings         
 4 Divorce rate in Maine            Margarine consumed         
 5 Age of Miss America              Murders by steam           
 6 Arcade revenue                   Computer science doctorates
 7 Space launches                   Sociology doctorates       
 8 Mozzarella cheese consumption    Engineering doctorates     
 9 Fishing boat deaths              Kentucky marriages         
10 US crude oil imports from Norway Railway train collisions   
11 Chicken consumption              US crude oil imports       
12 Swimming pool drownings          Nuclear power plants       
13 Japanese cars sold               Suicides by crashing       
14 Spelling bee letters             Spider deaths              
15 Math doctorates                  Uranium stored             
nic_cage <- filter(spurious_correlations, var2_short == "Nicholas Cage")

cor(nic_cage$var1_value, nic_cage$var2_value)
[1] 0.6660043
ggplot(nic_cage, aes(x = var1_value, y = var2_value)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Spurious Correlation: Drownings vs. Nicolas Cage Films",
    x = "Number of Drownings by Falling into a Pool",
    y = "Number of Films Nicolas Cage Appeared In"
  ) +
  theme_minimal()

With a correlation of 67%, we can even fit a linear model:

lm_model <- lm(var2_value ~ var1_value, data = nic_cage)
summary(lm_model)
Call:
lm(formula = var2_value ~ var1_value, data = nic_cage)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9308 -0.5926 -0.1020  0.4836  1.6026 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -5.37515    2.86726  -1.875   0.0936 .
var1_value   0.07620    0.02845   2.678   0.0253 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8678 on 9 degrees of freedom
Multiple R-squared:  0.4436,    Adjusted R-squared:  0.3817 
F-statistic: 7.174 on 1 and 9 DF,  p-value: 0.02527

With a explained variance of 38% (adjusted R-squared), we can say that the number of drownings is a statistically significant predictor of the number of films Nicolas Cage appeared in. However, this is a spurious correlation, and there is no causal relationship between these two variables, even with a p-value of 0.025.

Now let’s compare with a double y-axis plot:

# install.packages("tintin", repos = "https://cran.rstudio.com")
library(tidyr)
library(tintin)

cor_val <- cor(nic_cage$var1_value, nic_cage$var2_value)

# Align the two series visually
v1 <- nic_cage$var1_value
v2 <- nic_cage$var2_value

fun_adjust <- function(v1, v2) {
  s1 <- sd(v1, na.rm = TRUE)
  s2 <- sd(v2, na.rm = TRUE)
  a <- ifelse(s2 == 0, 1, s1 / s2)
  b <- mean(v1, na.rm = TRUE) - a * mean(v2, na.rm = TRUE)
  c(a = as.numeric(a), b = as.numeric(b))
}

adjust <- fun_adjust(v1, v2)
scale_a <- adjust["a"]
scale_b <- adjust["b"]

y1_title <- as.character(unique(nic_cage$var1))
y2_title <- as.character(unique(nic_cage$var2))

nic_cage_long <- nic_cage %>%
  select(year, var1_value, var2_value) %>%
  pivot_longer(
    cols = c(var1_value, var2_value),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable_label = case_when(
      variable == "var1_value" ~ y1_title,
      variable == "var2_value" ~ y2_title
    ),
  # apply transform to var2 for plotting: plot_value = a * var2 + b
  plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
  )

# make a double y axis plot with year on the x axis
ggplot(nic_cage_long, aes(x = year)) +
  geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
  geom_point(aes(y = plot_value, color = variable_label), size = 3) +
  labs(
    x = "Year",
    y = y1_title,
    title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
    subtitle = sprintf("Correlation: %.2f", cor_val),
    color = ""
  ) +
  # display all years on the x axis
  scale_x_continuous(breaks = nic_cage$year) +
  # primary y axis is the var1 scale
  # secondary shows var2 original scale by inverse-transforming
  scale_y_continuous(
    sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top") +
  # use tintin color palette
  scale_colour_manual(
    values = tintin_pal(option = "the black island")(2), 
    name = ""
  ) +
  # center title and subtitle
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

How about other spurious correlations? Here is one:

engineering_doctorates <- filter(spurious_correlations, var2_short == "Engineering doctorates")

cor_val <- cor(engineering_doctorates$var1_value, engineering_doctorates$var2_value)

v1 <- engineering_doctorates$var1_value
v2 <- engineering_doctorates$var2_value

adjust <- fun_adjust(v1, v2)
scale_a <- adjust["a"]
scale_b <- adjust["b"]

y1_title <- as.character(unique(engineering_doctorates$var1))
y2_title <- as.character(unique(engineering_doctorates$var2))

engineering_doctorates_long <- engineering_doctorates %>%
  select(year, var1_value, var2_value) %>%
  pivot_longer(
    cols = c(var1_value, var2_value),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable_label = case_when(
      variable == "var1_value" ~ y1_title,
      variable == "var2_value" ~ y2_title
    ),
  # apply transform to var2 for plotting: plot_value = a * var2 + b
  plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
  )

# make a double y axis plot with year on the x axis
ggplot(engineering_doctorates_long, aes(x = year)) +
  geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
  geom_point(aes(y = plot_value, color = variable_label), size = 3) +
  labs(
    x = "Year",
    y = y1_title,
    title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
    subtitle = sprintf("Correlation: %.2f", cor_val),
    color = ""
  ) +
  # display all years on the x axis
  scale_x_continuous(breaks = nic_cage$year) +
  # primary y axis is the var1 scale
  # secondary shows var2 original scale by inverse-transforming
  scale_y_continuous(
    sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top") +
  # use tintin color palette
  scale_colour_manual(
    values = tintin_pal(option = "the black island")(2), 
    name = ""
  ) +
  # center title and subtitle
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )

I can go ad nauseam with these spurious correlations. The point is that correlation does not imply causation, and we should be careful when interpreting correlations.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)