Why are Racing Drivers Born on March 23

[This article was first published on You Know, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wikipedia has pages for each day of the year (e.g., January 1, April 25). Each page contains a list of names of famous people with that birthday along with a short description of each person. I wrote an R script to scrape these lists from each wikipedia page and then analyze the data to find which days have unusually high occurrences of certain words (based on the tf-idf statistic).

Three days stood out:

  • May 2: columnists
  • March 16: bicycle racers
  • March 23: racing drivers
So, why are Racing Drivers so often born on March 23?
R script:

1:    
2:  # Prepare -----------------------------------------------------------------  
3:  rm(list = ls())  
4:  gc()  
5:  pkg <- c("tidyverse", "rvest", "tidytext", "ggplot2", "beepr")  
6:  inst <- pkg %in% installed.packages()  
7:  if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])  
8:  lapply(pkg, library, character.only = TRUE)  
9:  rm(list = c("inst", "pkg"))  
10:  setwd("/Users/danieldunn/Dropbox/DD Cloud/R/birthday_types")  
11:  set.seed(4444)  
12:    
13:    
14:  # Download wiki birthday data ---------------------------------------------  
15:  births <- seq(as.POSIXct("2004-01-01"), as.POSIXct("2004-12-31"), by = "days") %>%  
16:   format(format = "%B_%e") %>%  
17:   gsub(pattern = "_ ", replacement = "_") %>%  
18:   as.list()  
19:  #births <- births[1:5] # remove when ready for whole year  
20:  urls <- paste0("https://en.wikipedia.org/wiki/", births)  
21:  ndays <- length(urls)  
22:    
23:  for(i in 1:ndays) {  
24:   cat("Trying", i, "of", length(births), "urls.\n")  
25:   section <- ifelse(urls[i] == "https://en.wikipedia.org/wiki/February_29",   
26:            2,   
27:            read_html(urls[i]) %>%  
28:             html_nodes(xpath = '//*[@id="toc"]/ul') %>%  
29:             html_text() %>%  
30:             gsub(pattern = " Births.*$", replacement = "") %>%  
31:             gsub(pattern = "^.*\n", replacement = ""))  
32:   births[[i]][3] <- read_html(urls[i]) %>%  
33:    html_nodes(xpath = paste0('//*[@id="mw-content-text"]/div/ul[', section, ']')) %>%  
34:    html_text()  
35:   births[[i]][2] <- gregexpr(pattern = "\\n", text = births[[i]][3]) %>%  
36:    regmatches(x = births[[i]][3]) %>%  
37:    lengths()  
38:   Sys.sleep(time = 0.1)  
39:  }  
40:    
41:    
42:  # Find outlier word frequencies -------------------------------------------  
43:  data(stop_words)  
44:  termset <- data.frame(word = as.character(), n = as.integer(),   
45:             day = as.character(), tf = as.numeric(),   
46:             stringsAsFactors = FALSE)  
47:  for(i in 1:ndays) {  
48:   terms <- data_frame(text = births[[i]][3]) %>%  
49:    unnest_tokens(word, text, to_lower = FALSE) %>%  
50:    anti_join(y = stop_words, by = "word") %>%  
51:    count(word, sort = TRUE) %>%  
52:    filter(!grepl(pattern = "[[:digit:]].*$", x = word)) %>%  
53:    filter(!grepl(pattern = "[[:upper:]].*$", x = word)) %>%  
54:    mutate(day = births[[i]][1], tf = n / sum(n))  
55:   termset <- rbind(termset, terms)  
56:  }  
57:    
58:  full <- termset %>%  
59:   count(word, sort = TRUE) %>%  
60:   mutate(idf = log(x = ndays / nn), base = 10)  
61:    
62:  termset <- termset %>%  
63:   left_join(y = full, by = "word") %>%  
64:   mutate(tfidf = round(tf * idf, 3)) %>%  
65:   arrange(desc(tfidf)) %>%  
66:   mutate(dayN = as.Date(paste0(gsub(pattern = "_", replacement = " ", x = day), ", 2004"), format = "%B %e, %Y"))  
67:    
68:    
69:  # Roll up to word level ---------------------------------------------------  
70:  wordset <- termset %>%  
71:   group_by(word) %>%  
72:   summarize(frequency = sum(n)) %>%  
73:   ungroup() %>%  
74:   arrange(desc(frequency))  
75:    
76:    
77:  # Display extremes --------------------------------------------------------  
78:  topT <- termset %>%  
79:   group_by(day) %>%  
80:   top_n(n = 1, wt = tfidf) %>%   
81:   ungroup() %>%  
82:   filter(n >= 2) %>%  
83:   arrange(dayN) %>%  
84:   print  
85:    
86:  topW <- wordset %>%  
87:   top_n(n = 10, wt = frequency) %>%   
88:   filter(frequency >= 1) %>%  
89:   arrange(desc(frequency)) %>%  
90:   print  
91:    
92:    
93:  # Finish ------------------------------------------------------------------  
94:  beep()  
95:    
96:    
97:    

To leave a comment for the author, please follow the link and comment on their blog: You Know.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)