Site icon R-bloggers

Spurious Correlations in R – Correlation is not Causation

[This article was first published on https://pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html> < charset="utf-8"> < http-equiv="X-UA-Compatible" content="IE=edge"> < name="viewport" content="width=device-width, initial-scale=1.0"> pacha.dev/blog < !-- MathJax Configuration --> < !-- Smart header: libraries detected based on content --> < !-- File: /tmp/tmp.ldE4wqvhIc/index.html -->
  • < !-- DEBUG: Found sourceCode --> < !-- Load custom CSS after any library CSS to ensure proper precedence -->
  • < header class="site-top">

    Mauricio “Pachá” Vargas Sepúlveda

    Blog with notes about R, Shiny, SQL, Python, Linux and C++. This blog is listed on R-Bloggers.

    HOME 🏠
    < !-- categories are printed below this--> < nav class="sidebar-nav">

    Categories

    < header id="title-block-header" class="quarto-title-block default">

    Spurious Correlations in R – Correlation is not Causation

    The spuriouscorrelations package aims to help educators to explain why correlation does not imply causation.
    Author

    Mauricio “Pachá” Vargas S.

    Published

    September 27, 2025

    If this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.

    You can send me questions for the blog using this form and subscribe to receive an email when there is a new post.

    I have created a new R package called spuriouscorrelations that aims to help educators to explain why correlation does not imply causation. What I had in mind were AP Statistics courses and college-level introductory statistics courses.

    The package includes a dataset with 15 spurious correlations. You can install it from CRAN with:

    # install.packages("spuriouscorrelations", repos = "https://cran.rstudio.com")
    library(spuriouscorrelations)

    Let’s plot one of the spurious correlations, for example, the correlation between the number of people who drowned by falling into a pool and the number of films Nicolas Cage appeared in:

    library(dplyr)
    library(ggplot2)
    
    spurious_correlations %>%
      distinct(var1_short, var2_short)
    # A tibble: 15 × 2
       var1_short                       var2_short                 
       <fct>                            <fct>                      
     1 US spending on science           Suicides                   
     2 Falling into a pool drownings    Nicholas Cage              
     3 Cheese consumed                  Bedsheet tanglings         
     4 Divorce rate in Maine            Margarine consumed         
     5 Age of Miss America              Murders by steam           
     6 Arcade revenue                   Computer science doctorates
     7 Space launches                   Sociology doctorates       
     8 Mozzarella cheese consumption    Engineering doctorates     
     9 Fishing boat deaths              Kentucky marriages         
    10 US crude oil imports from Norway Railway train collisions   
    11 Chicken consumption              US crude oil imports       
    12 Swimming pool drownings          Nuclear power plants       
    13 Japanese cars sold               Suicides by crashing       
    14 Spelling bee letters             Spider deaths              
    15 Math doctorates                  Uranium stored             
    nic_cage <- filter(spurious_correlations, var2_short == "Nicholas Cage")
    
    cor(nic_cage$var1_value, nic_cage$var2_value)
    [1] 0.6660043
    ggplot(nic_cage, aes(x = var1_value, y = var2_value)) +
      geom_point(size = 3) +
      geom_smooth(method = "lm", se = FALSE, color = "blue") +
      labs(
        title = "Spurious Correlation: Drownings vs. Nicolas Cage Films",
        x = "Number of Drownings by Falling into a Pool",
        y = "Number of Films Nicolas Cage Appeared In"
      ) +
      theme_minimal()

    With a correlation of 67%, we can even fit a linear model:

    lm_model <- lm(var2_value ~ var1_value, data = nic_cage)
    summary(lm_model)
    Call:
    lm(formula = var2_value ~ var1_value, data = nic_cage)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -0.9308 -0.5926 -0.1020  0.4836  1.6026 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept) -5.37515    2.86726  -1.875   0.0936 .
    var1_value   0.07620    0.02845   2.678   0.0253 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.8678 on 9 degrees of freedom
    Multiple R-squared:  0.4436,    Adjusted R-squared:  0.3817 
    F-statistic: 7.174 on 1 and 9 DF,  p-value: 0.02527

    With a explained variance of 38% (adjusted R-squared), we can say that the number of drownings is a statistically significant predictor of the number of films Nicolas Cage appeared in. However, this is a spurious correlation, and there is no causal relationship between these two variables, even with a p-value of 0.025.

    Now let’s compare with a double y-axis plot:

    # install.packages("tintin", repos = "https://cran.rstudio.com")
    library(tidyr)
    library(tintin)
    
    cor_val <- cor(nic_cage$var1_value, nic_cage$var2_value)
    
    # Align the two series visually
    v1 <- nic_cage$var1_value
    v2 <- nic_cage$var2_value
    
    fun_adjust <- function(v1, v2) {
      s1 <- sd(v1, na.rm = TRUE)
      s2 <- sd(v2, na.rm = TRUE)
      a <- ifelse(s2 == 0, 1, s1 / s2)
      b <- mean(v1, na.rm = TRUE) - a * mean(v2, na.rm = TRUE)
      c(a = as.numeric(a), b = as.numeric(b))
    }
    
    adjust <- fun_adjust(v1, v2)
    scale_a <- adjust["a"]
    scale_b <- adjust["b"]
    
    y1_title <- as.character(unique(nic_cage$var1))
    y2_title <- as.character(unique(nic_cage$var2))
    
    nic_cage_long <- nic_cage %>%
      select(year, var1_value, var2_value) %>%
      pivot_longer(
        cols = c(var1_value, var2_value),
        names_to = "variable",
        values_to = "value"
      ) %>%
      mutate(
        variable_label = case_when(
          variable == "var1_value" ~ y1_title,
          variable == "var2_value" ~ y2_title
        ),
      # apply transform to var2 for plotting: plot_value = a * var2 + b
      plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
      )
    
    # make a double y axis plot with year on the x axis
    ggplot(nic_cage_long, aes(x = year)) +
      geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
      geom_point(aes(y = plot_value, color = variable_label), size = 3) +
      labs(
        x = "Year",
        y = y1_title,
        title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
        subtitle = sprintf("Correlation: %.2f", cor_val),
        color = ""
      ) +
      # display all years on the x axis
      scale_x_continuous(breaks = nic_cage$year) +
      # primary y axis is the var1 scale
      # secondary shows var2 original scale by inverse-transforming
      scale_y_continuous(
        sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
      ) +
      theme_minimal(base_size = 13) +
      theme(legend.position = "top") +
      # use tintin color palette
      scale_colour_manual(
        values = tintin_pal(option = "the black island")(2), 
        name = ""
      ) +
      # center title and subtitle
      theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)
      )

    How about other spurious correlations? Here is one:

    engineering_doctorates <- filter(spurious_correlations, var2_short == "Engineering doctorates")
    
    cor_val <- cor(engineering_doctorates$var1_value, engineering_doctorates$var2_value)
    
    v1 <- engineering_doctorates$var1_value
    v2 <- engineering_doctorates$var2_value
    
    adjust <- fun_adjust(v1, v2)
    scale_a <- adjust["a"]
    scale_b <- adjust["b"]
    
    y1_title <- as.character(unique(engineering_doctorates$var1))
    y2_title <- as.character(unique(engineering_doctorates$var2))
    
    engineering_doctorates_long <- engineering_doctorates %>%
      select(year, var1_value, var2_value) %>%
      pivot_longer(
        cols = c(var1_value, var2_value),
        names_to = "variable",
        values_to = "value"
      ) %>%
      mutate(
        variable_label = case_when(
          variable == "var1_value" ~ y1_title,
          variable == "var2_value" ~ y2_title
        ),
      # apply transform to var2 for plotting: plot_value = a * var2 + b
      plot_value = ifelse(variable == "var2_value", value * scale_a + scale_b, value)
      )
    
    # make a double y axis plot with year on the x axis
    ggplot(engineering_doctorates_long, aes(x = year)) +
      geom_line(aes(y = plot_value, color = variable_label, group = variable_label), linewidth = 1.5) +
      geom_point(aes(y = plot_value, color = variable_label), size = 3) +
      labs(
        x = "Year",
        y = y1_title,
        title = sprintf("%s\nvs\n%s\n", y1_title, y2_title),
        subtitle = sprintf("Correlation: %.2f", cor_val),
        color = ""
      ) +
      # display all years on the x axis
      scale_x_continuous(breaks = nic_cage$year) +
      # primary y axis is the var1 scale
      # secondary shows var2 original scale by inverse-transforming
      scale_y_continuous(
        sec.axis = sec_axis(~ (. - scale_b) / scale_a, name = y2_title)
      ) +
      theme_minimal(base_size = 13) +
      theme(legend.position = "top") +
      # use tintin color palette
      scale_colour_manual(
        values = tintin_pal(option = "the black island")(2), 
        name = ""
      ) +
      # center title and subtitle
      theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)
      )

    I can go ad nauseam with these spurious correlations. The point is that correlation does not imply causation, and we should be careful when interpreting correlations.

    < footer>

    Loading…

  • < !-- Load shared sidebar -->
    To leave a comment for the author, please follow the link and comment on their blog: https://pacha.dev/blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Exit mobile version