Why are Racing Drivers Born on March 23

August 31, 2017
By

(This article was first published on You Know, and kindly contributed to R-bloggers)

Wikipedia has pages for each day of the year (e.g., January 1, April 25). Each page contains a list of names of famous people with that birthday along with a short description of each person. I wrote an R script to scrape these lists from each wikipedia page and then analyze the data to find which days have unusually high occurrences of certain words (based on the tf-idf statistic).

Three days stood out:

  • May 2: columnists
  • March 16: bicycle racers
  • March 23: racing drivers
So, why are Racing Drivers so often born on March 23?
R script:

1:    
2: # Prepare -----------------------------------------------------------------
3: rm(list = ls())
4: gc()
5: pkg <- c("tidyverse", "rvest", "tidytext", "ggplot2", "beepr")
6: inst <- pkg %in% installed.packages()
7: if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])
8: lapply(pkg, library, character.only = TRUE)
9: rm(list = c("inst", "pkg"))
10: setwd("/Users/danieldunn/Dropbox/DD Cloud/R/birthday_types")
11: set.seed(4444)
12:
13:
14: # Download wiki birthday data ---------------------------------------------
15: births <- seq(as.POSIXct("2004-01-01"), as.POSIXct("2004-12-31"), by = "days") %>%
16: format(format = "%B_%e") %>%
17: gsub(pattern = "_ ", replacement = "_") %>%
18: as.list()
19: #births <- births[1:5] # remove when ready for whole year
20: urls <- paste0("https://en.wikipedia.org/wiki/", births)
21: ndays <- length(urls)
22:
23: for(i in 1:ndays) {
24: cat("Trying", i, "of", length(births), "urls.\n")
25: section <- ifelse(urls[i] == "https://en.wikipedia.org/wiki/February_29",
26: 2,
27: read_html(urls[i]) %>%
28: html_nodes(xpath = '//*[@id="toc"]/ul') %>%
29: html_text() %>%
30: gsub(pattern = " Births.*$", replacement = "") %>%
31: gsub(pattern = "^.*\n", replacement = ""))
32: births[[i]][3] <- read_html(urls[i]) %>%
33: html_nodes(xpath = paste0('//*[@id="mw-content-text"]/div/ul[', section, ']')) %>%
34: html_text()
35: births[[i]][2] <- gregexpr(pattern = "\\n", text = births[[i]][3]) %>%
36: regmatches(x = births[[i]][3]) %>%
37: lengths()
38: Sys.sleep(time = 0.1)
39: }
40:
41:
42: # Find outlier word frequencies -------------------------------------------
43: data(stop_words)
44: termset <- data.frame(word = as.character(), n = as.integer(),
45: day = as.character(), tf = as.numeric(),
46: stringsAsFactors = FALSE)
47: for(i in 1:ndays) {
48: terms <- data_frame(text = births[[i]][3]) %>%
49: unnest_tokens(word, text, to_lower = FALSE) %>%
50: anti_join(y = stop_words, by = "word") %>%
51: count(word, sort = TRUE) %>%
52: filter(!grepl(pattern = "[[:digit:]].*$", x = word)) %>%
53: filter(!grepl(pattern = "[[:upper:]].*$", x = word)) %>%
54: mutate(day = births[[i]][1], tf = n / sum(n))
55: termset <- rbind(termset, terms)
56: }
57:
58: full <- termset %>%
59: count(word, sort = TRUE) %>%
60: mutate(idf = log(x = ndays / nn), base = 10)
61:
62: termset <- termset %>%
63: left_join(y = full, by = "word") %>%
64: mutate(tfidf = round(tf * idf, 3)) %>%
65: arrange(desc(tfidf)) %>%
66: mutate(dayN = as.Date(paste0(gsub(pattern = "_", replacement = " ", x = day), ", 2004"), format = "%B %e, %Y"))
67:
68:
69: # Roll up to word level ---------------------------------------------------
70: wordset <- termset %>%
71: group_by(word) %>%
72: summarize(frequency = sum(n)) %>%
73: ungroup() %>%
74: arrange(desc(frequency))
75:
76:
77: # Display extremes --------------------------------------------------------
78: topT <- termset %>%
79: group_by(day) %>%
80: top_n(n = 1, wt = tfidf) %>%
81: ungroup() %>%
82: filter(n >= 2) %>%
83: arrange(dayN) %>%
84: print
85:
86: topW <- wordset %>%
87: top_n(n = 10, wt = frequency) %>%
88: filter(frequency >= 1) %>%
89: arrange(desc(frequency)) %>%
90: print
91:
92:
93: # Finish ------------------------------------------------------------------
94: beep()
95:
96:
97:

To leave a comment for the author, please follow the link and comment on their blog: You Know.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)