Why are Racing Drivers Born on March 23

August 31, 2017
By

[This article was first published on You Know, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wikipedia has pages for each day of the year (e.g., January 1, April 25). Each page contains a list of names of famous people with that birthday along with a short description of each person. I wrote an R script to scrape these lists from each wikipedia page and then analyze the data to find which days have unusually high occurrences of certain words (based on the tf-idf statistic).

Three days stood out:

  • May 2: columnists
  • March 16: bicycle racers
  • March 23: racing drivers
So, why are Racing Drivers so often born on March 23?
R script:

1:    
2: # Prepare -----------------------------------------------------------------
3: rm(list = ls())
4: gc()
5: pkg <- c("tidyverse", "rvest", "tidytext", "ggplot2", "beepr")
6: inst <- pkg %in% installed.packages()
7: if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])
8: lapply(pkg, library, character.only = TRUE)
9: rm(list = c("inst", "pkg"))
10: setwd("/Users/danieldunn/Dropbox/DD Cloud/R/birthday_types")
11: set.seed(4444)
12:
13:
14: # Download wiki birthday data ---------------------------------------------
15: births <- seq(as.POSIXct("2004-01-01"), as.POSIXct("2004-12-31"), by = "days") %>%
16: format(format = "%B_%e") %>%
17: gsub(pattern = "_ ", replacement = "_") %>%
18: as.list()
19: #births <- births[1:5] # remove when ready for whole year
20: urls <- paste0("https://en.wikipedia.org/wiki/", births)
21: ndays <- length(urls)
22:
23: for(i in 1:ndays) {
24: cat("Trying", i, "of", length(births), "urls.\n")
25: section <- ifelse(urls[i] == "https://en.wikipedia.org/wiki/February_29",
26: 2,
27: read_html(urls[i]) %>%
28: html_nodes(xpath = '//*[@id="toc"]/ul') %>%
29: html_text() %>%
30: gsub(pattern = " Births.*$", replacement = "") %>%
31: gsub(pattern = "^.*\n", replacement = ""))
32: births[[i]][3] <- read_html(urls[i]) %>%
33: html_nodes(xpath = paste0('//*[@id="mw-content-text"]/div/ul[', section, ']')) %>%
34: html_text()
35: births[[i]][2] <- gregexpr(pattern = "\\n", text = births[[i]][3]) %>%
36: regmatches(x = births[[i]][3]) %>%
37: lengths()
38: Sys.sleep(time = 0.1)
39: }
40:
41:
42: # Find outlier word frequencies -------------------------------------------
43: data(stop_words)
44: termset <- data.frame(word = as.character(), n = as.integer(),
45: day = as.character(), tf = as.numeric(),
46: stringsAsFactors = FALSE)
47: for(i in 1:ndays) {
48: terms <- data_frame(text = births[[i]][3]) %>%
49: unnest_tokens(word, text, to_lower = FALSE) %>%
50: anti_join(y = stop_words, by = "word") %>%
51: count(word, sort = TRUE) %>%
52: filter(!grepl(pattern = "[[:digit:]].*$", x = word)) %>%
53: filter(!grepl(pattern = "[[:upper:]].*$", x = word)) %>%
54: mutate(day = births[[i]][1], tf = n / sum(n))
55: termset <- rbind(termset, terms)
56: }
57:
58: full <- termset %>%
59: count(word, sort = TRUE) %>%
60: mutate(idf = log(x = ndays / nn), base = 10)
61:
62: termset <- termset %>%
63: left_join(y = full, by = "word") %>%
64: mutate(tfidf = round(tf * idf, 3)) %>%
65: arrange(desc(tfidf)) %>%
66: mutate(dayN = as.Date(paste0(gsub(pattern = "_", replacement = " ", x = day), ", 2004"), format = "%B %e, %Y"))
67:
68:
69: # Roll up to word level ---------------------------------------------------
70: wordset <- termset %>%
71: group_by(word) %>%
72: summarize(frequency = sum(n)) %>%
73: ungroup() %>%
74: arrange(desc(frequency))
75:
76:
77: # Display extremes --------------------------------------------------------
78: topT <- termset %>%
79: group_by(day) %>%
80: top_n(n = 1, wt = tfidf) %>%
81: ungroup() %>%
82: filter(n >= 2) %>%
83: arrange(dayN) %>%
84: print
85:
86: topW <- wordset %>%
87: top_n(n = 10, wt = frequency) %>%
88: filter(frequency >= 1) %>%
89: arrange(desc(frequency)) %>%
90: print
91:
92:
93: # Finish ------------------------------------------------------------------
94: beep()
95:
96:
97:

To leave a comment for the author, please follow the link and comment on their blog: You Know.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)