Site icon R-bloggers

spuriouscorrelations An R package to show examples about spurious correlations

[This article was first published on https://pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html> < charset="utf-8"> < http-equiv="X-UA-Compatible" content="IE=edge"> < name="viewport" content="width=device-width, initial-scale=1.0"> pacha.dev/blog < !-- MathJax Configuration --> < !-- Smart header: libraries detected based on content --> < !-- File: /tmp/tmp.nw9vnn0O7E/index.html -->
  • < !-- DEBUG: Found sourceCode --> < !-- Load custom CSS after any library CSS to ensure proper precedence -->
  • < header class="site-top">

    Mauricio “Pachá” Vargas Sepúlveda

    Blog with notes about R, Shiny, SQL, Python, Linux and C++. This blog is listed on R-Bloggers.

    HOME 🏠
    < !-- categories are printed below this--> < nav class="sidebar-nav">

    Categories

    < header id="title-block-header" class="quarto-title-block default">

    spuriouscorrelations: An R package to show examples about spurious correlations

    Correlation is not causation.
    Author

    Mauricio “Pachá” Vargas S.

    Published

    May 17, 2025

    I’ve been busy with the field exams, so I haven’t had much time to work on the blog.

    spuriouscorrelations package started as a fun project for one of my tutorials.

    Here is a case of an interesting correlation: the number of people who drowned by falling into a pool and the number of films Nicholas Cage appeared in.

    if (!require(spuriouscorrelations)) install.packages("spuriouscorrelations")
    Loading required package: spuriouscorrelations
    Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
    logical.return = TRUE, : there is no package called 'spuriouscorrelations'
    Installing package into '/home/pacha/R/x86_64-pc-linux-gnu-library/4.5'
    (as 'lib' is unspecified)
    if (!require(dplyr)) install.packages("dplyr")
    Loading required package: dplyr
    Attaching package: 'dplyr'
    The following objects are masked from 'package:stats':
    
        filter, lag
    The following objects are masked from 'package:base':
    
        intersect, setdiff, setequal, union
    if (!require(ggplot2)) install.packages("ggplot2")
    Loading required package: ggplot2
    library(spuriouscorrelations)
    library(dplyr)
    library(ggplot2)
    
    unique(spurious_correlations$var1)
     [1] US spending on science, space, and technology                   
     [2] Number of people who drowned by falling into a pool             
     [3] Per capita cheese consumption                                   
     [4] Divorce rate in Maine                                           
     [5] Age of Miss America                                             
     [6] Total revenue generated by arcades                              
     [7] Worldwide non-commercial space launches                         
     [8] Per capita consumption of mozzarella cheese                     
     [9] People who drowned after falling out of a fishing boat          
    [10] US crude oil imports from Norway                                
    [11] Per capita consumption of chicken                               
    [12] Number of people who drowned while in a swimming-pool           
    [13] Japanese cars sold in the US                                    
    [14] Letters in the winning word of the Scripps National Spelling Bee
    [15] Mathematics doctorates awarded                                  
    15 Levels: Age of Miss America ... Worldwide non-commercial space launches
    drownings <- spurious_correlations %>%
      filter(
         var1 == "Number of people who drowned by falling into a pool"
      ) %>%
      select(year, var1, var2, var1_value, var2_value)
    
    cor(drownings$var1_value, drownings$var2_value)
    [1] 0.6660043

    Now let’s plot the data.

    # compute a scale factor so that max(var2_value * factor) ≈ max(var1_value)
    max1 <- max(drownings$var1_value)
    max2 <- max(drownings$var2_value)
    ratio <- max1 / max2
    
    ggplot(drownings, aes(x = year)) +
      geom_line(aes(y = var1_value, color = "Drownings")) +
      geom_line(aes(y = var2_value * ratio, color = "Films")) +
      scale_y_continuous(
        name = "Number of drownings",
        sec.axis = sec_axis(~ . / ratio,
          name = "Number of films"
        ),
        limits = c(0, NA)
      ) +
      scale_color_manual(
        name = "",
        values = c(
          "Drownings" = "blue",
          "Films" = "red"
        )
      ) +
      theme_minimal() +
      labs(
        title = "Number of people who drowned by falling into a pool vs.\nNumber of films Nicholas Cage appeared in",
        caption = "Source: Spurious Correlations (Vigen 2015)"
      )

    Interested? You can install the package from GitHub

    pak::pkg_install("pachadotdev/spuriouscorrelations")
    < footer>

    Loading…

  • < !-- Load shared sidebar -->
    To leave a comment for the author, please follow the link and comment on their blog: https://pacha.dev/blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Exit mobile version