spuriouscorrelations An R package to show examples about spurious correlations

https://pacha.dev/blog

10 months ago

[This article was first published on https://pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< !DOCTYPE html> < charset="utf-8"> < http-equiv="X-UA-Compatible" content="IE=edge"> < name="viewport" content="width=device-width, initial-scale=1.0"> pacha.dev/blog < !-- MathJax Configuration --> < !-- Smart header: libraries detected based on content --> < !-- File: /tmp/tmp.nw9vnn0O7E/index.html -->

< !-- DEBUG: Found sourceCode --> < !-- Load custom CSS after any library CSS to ensure proper precedence -->

< header class="site-top">

Mauricio “Pachá” Vargas Sepúlveda

Blog with notes about R, Shiny, SQL, Python, Linux and C++. This blog is listed on R-Bloggers.

HOME 🏠 < !-- categories are printed below this--> < nav class="sidebar-nav">

spuriouscorrelations: An R package to show examples about spurious correlations

Statistics

Correlation is not causation.

Author

Mauricio “Pachá” Vargas S.

Published

May 17, 2025

I’ve been busy with the field exams, so I haven’t had much time to work on the blog.

spuriouscorrelations package started as a fun project for one of my tutorials.

Here is a case of an interesting correlation: the number of people who drowned by falling into a pool and the number of films Nicholas Cage appeared in.

if (!require(spuriouscorrelations)) install.packages("spuriouscorrelations")

Loading required package: spuriouscorrelations

Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'spuriouscorrelations'

Installing package into '/home/pacha/R/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)

if (!require(dplyr)) install.packages("dplyr")

Loading required package: dplyr

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

if (!require(ggplot2)) install.packages("ggplot2")

Loading required package: ggplot2

library(spuriouscorrelations)
library(dplyr)
library(ggplot2)

unique(spurious_correlations$var1)

 [1] US spending on science, space, and technology                   
 [2] Number of people who drowned by falling into a pool             
 [3] Per capita cheese consumption                                   
 [4] Divorce rate in Maine                                           
 [5] Age of Miss America                                             
 [6] Total revenue generated by arcades                              
 [7] Worldwide non-commercial space launches                         
 [8] Per capita consumption of mozzarella cheese                     
 [9] People who drowned after falling out of a fishing boat          
[10] US crude oil imports from Norway                                
[11] Per capita consumption of chicken                               
[12] Number of people who drowned while in a swimming-pool           
[13] Japanese cars sold in the US                                    
[14] Letters in the winning word of the Scripps National Spelling Bee
[15] Mathematics doctorates awarded                                  
15 Levels: Age of Miss America ... Worldwide non-commercial space launches

drownings <- spurious_correlations %>%
  filter(
     var1 == "Number of people who drowned by falling into a pool"
  ) %>%
  select(year, var1, var2, var1_value, var2_value)

cor(drownings$var1_value, drownings$var2_value)

[1] 0.6660043

Now let’s plot the data.

# compute a scale factor so that max(var2_value * factor) ≈ max(var1_value)
max1 <- max(drownings$var1_value)
max2 <- max(drownings$var2_value)
ratio <- max1 / max2

ggplot(drownings, aes(x = year)) +
  geom_line(aes(y = var1_value, color = "Drownings")) +
  geom_line(aes(y = var2_value * ratio, color = "Films")) +
  scale_y_continuous(
    name = "Number of drownings",
    sec.axis = sec_axis(~ . / ratio,
      name = "Number of films"
    ),
    limits = c(0, NA)
  ) +
  scale_color_manual(
    name = "",
    values = c(
      "Drownings" = "blue",
      "Films" = "red"
    )
  ) +
  theme_minimal() +
  labs(
    title = "Number of people who drowned by falling into a pool vs.\nNumber of films Nicholas Cage appeared in",
    caption = "Source: Spurious Correlations (Vigen 2015)"
  )

Interested? You can install the package from GitHub

pak::pkg_install("pachadotdev/spuriouscorrelations")

< footer>

Loading…

< !-- Load shared sidebar -->

To leave a comment for the author, please follow the link and comment on their blog: https://pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

spuriouscorrelations An R package to show examples about spurious correlations

Mauricio “Pachá” Vargas Sepúlveda

Categories

spuriouscorrelations: An R package to show examples about spurious correlations

Related