Who were the notable dead of Wikipedia?

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As described in my last post, I extracted all notable deaths from Wikipedia over the 2004-2016 period. In this post I want to explore this study population. Who were the notable dead?

How old were notable dead?

Let me assume here most entries of the table are humans. I won’t make the effort to remove dogs or horses from the list yet, which introduces a small mistake.

library("ggplot2")
library("viridis")
library("broom")
library("dplyr")
library("lubridate")
library("tidytext")
library("rcorpora")
deaths <- readr::read_csv("data/deaths_with_demonyms.csv")

As a reminder, in case you didn’t learn the figures from my last post (shame on you), the table contains information about 56303 notable deaths. I could extract the age of 97% of them.

ggplot(deaths) +
  geom_histogram(aes(age)) +
  ggtitle("Age at death of Wikipedia notable dead")

plot of chunk unnamed-chunk-2

Let’s be honest, I expected a bimodal distribution with a first peak at 27.

tidy(summary(deaths$age))
##   minimum q1 median  mean q3 maximum   na
## 1       1 68     80 75.94 88     176 1677

Wow this is a really high maximal age.

arrange(deaths, desc(age)) %>%
  head(n = 10) %>%
  knitr::kable()
wiki_linknameagecountry_roledatecountryadj_lengthadjectivalsoccupation
Harriet_(tortoise)Harriet (tortoise)176NA2006-06-23NANANANA
Eisenhower_TreeEisenhower Tree125American2014-02-16United States1AmericanNA American
J%C3%B3zef_Piotrowski_(organist)Józef Piotrowski (organist)118Polish organist and longevity claimant.2005-09-08Poland1Polish .*organist and longevity claimant
Misao_OkawaMisao Okawa117Japanese supercentenarian2015-04-01Japan1Japanese .*supercentenarian
Pawe%C5%82_ParniakPawel Parniak116Polish supercentenarian2006-03-27Poland1Polish .*supercentenarian
Gertrude_WeaverGertrude Weaver116American2015-04-06United States1AmericanNA American
Jai_GurudevJai Gurudev116Indian religious leader.2012-05-18India1Indian .*religious leader
Susannah_Mushatt_JonesSusannah Mushatt Jones116American2016-05-12United States1AmericanNA American
Jiroemon_KimuraJiroemon Kimura116Japanese2013-06-12Japan1Japanese .*NA Japanese
Jeralean_TalleyJeralean Talley116American supercentenarian2015-06-17United States1Americansupercentenarian

Ok, so the oldest beings in this table were a tortoise and a tree, which we might want to remove from the rest of the analysis.

deaths <- filter(deaths, age < 125)

What about the deaths at the youngest ages?

arrange(deaths, age) %>%
  head(n = 10) %>%
  knitr::kable()
wiki_linknameagecountry_roledatecountryadj_lengthadjectivalsoccupation
Manar_Maged class=mw-redirectManar Maged1Egyptian girl born with two heads2006-03-26Egypt1Egyptian .*girl born with two heads
Ayelet_GalenaAyelet Galena2American child2012-01-31United States1Americanchild
Colonel_MeowColonel Meow2American Himalayan-Persian cat2014-01-29United States1AmericanHimalayan-Persian cat
Ben_BowenBen Bowen2American child cancer victim2005-02-25United States1Americanchild cancer victim
Marius_(giraffe)Marius (giraffe)2Danish giraffe2014-02-09Denmark1Danish .*giraffe
Disappearance_of_Aisling_Symes class=mw-redirectDisappearance of Aisling Symes2New Zealand child whose disappearance initiated major search2009-10-05New Zealand2New Zealand .*child whose disappearance initiated major search
Paul_the_OctopusPaul the Octopus2British-born2010-10-26NANANABritish-born
ChriselliamChriselliam3Irish-bred British-trained Thoroughbred racehorse2014-02-07NANANAIrish-bred British-trained Thoroughbred racehorse
Eight_BellesEight Belles3American racehorse2008-05-03United States1Americanracehorse
Sybil_(cat)Sybil (cat)3British Downing Street cat2009-07-27England1British .*Downing Street cat

As one could have expected, the deaths at youngest ages are some sad stories, about humans but also animals.

Did the age distribution change over time?

deaths <- mutate(deaths, death_year = as.factor(year(date)))
ggplot(deaths) +
  geom_boxplot(aes(death_year, age, fill = death_year)) +
  scale_fill_viridis(discrete = TRUE) +
  theme(legend.position = "none")

plot of chunk unnamed-chunk-7

Well maybe there is an increasing trend? I wouldn’t be surprised if it were the case, since life expectancy tends to increase. I first wrote I wouldn’t take the time to test the trend and then I had a very interesting discussion with Miles McBain and Nick Tierney. I had first thought of a linear model, then of a survival analysis but I only have positive events. While using a linear model or a GLM the residuals were never normally distributed. Then Miles mentioned non-parametric tests which is something I never think about. Googling a bit around I fount the Mann-Kendall test!

I’m quite lucky I want to see if age at death monotically increases over time because that seems to be the usual use case for it. I choose to use the time series of weekly median age, which I’m not too sure is the best choice. I could have chosen monthly average age, etc.

library("trend")
library("lubridate")
weekly_median_age <- deaths %>% 
  filter(!is.na(age)) %>%
  group_by(wiki_link) %>%
  mutate(week = paste(year(date), week(date))) %>%
  group_by(week) %>%
  summarize(age = median(age)) %>% .$age
weekly_median_age <- as.ts(weekly_median_age)
plot(weekly_median_age)

plot of chunk unnamed-chunk-8

res <- mk.test(weekly_median_age)
summary.trend.test(res)
## Mann-Kendall Test
##  
## two-sided homogeinity test
## H0: S = 0 (no trend)
## HA: S != 0 (monotonic trend)
##  
## Statistics for total series
##       S     varS    Z   tau     pvalue
## 1 79925 36111378 13.3 0.337 < 2.22e-16

Using this test I have now more support for the existence of a trend, but not for its direction. The same package has an implementation of Sen’s method to compute the slope.

sens <- sens.slope(weekly_median_age, level = 0.95)
sens
##  
## Sen's slope and intercept
##  
##  
## slope:  0.0065
## 95 percent confidence intervall for slope
## 0.0074 0.0056
##  
## intercept: 77.1831
## nr. of observations: 689

With such a slope in one year one gains 0.34 years. Will we soon have humans as old as Harriet the tortoise?

Where did notable dead come from?

deaths %>% group_by(country) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(n = 10) %>%
  knitr::kable()
countryn
United States20220
England6163
Canada2183
Australia1783
India1604
France1277
Germany1277
Italy1141
Russia[15]884
NA857

Unsurprisingly given what I imagine to be the countries of Wikipedia contributors in English, mostly from developped countries, and then India is a huge English-speaking country. It’d probably be interesting to repeat the same data extraction for all languages and see how we rather know celebrities speaking our own language or sharing our culture.

What were the reasons of notability?

I first played with the idea of using my monkeylearn package to associated an industry to each occupation/reason for being notable, but I soon realized the description was too short for the extractor. I also soon saw I wouldn’t be able to find a good list of jobs, so I resorted to simply look for the most present terms using tidytext. For removing the stop-words I used rcorpora.

stopwords <- corpora("words/stopwords/en")$stopWords

deaths_words <- deaths %>%
  unnest_tokens(word, occupation) %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% stopwords)


head(deaths_words, n = 10) %>%
  knitr::kable()
wordn
politician6285
player4251
actor2807
footballer2022
football1899
actress1693
singer1593
writer1526
american1476
olympic1396

From these 10 most prevalent terms we could assume being a politician, some sort of athlete (player could also be a football player) or artist can make you notable. It’s interesting to see there are far more actors than actresses. In case you didn’t get the message, in the table there are 756 businessmen, 44 businesswomen, 4 business persons.

I also noticed that there are 147 murderers and 41 serial killers vs. 232 chemists and 46 statisticians. Since the term “data scientist” is quite young, there is none in my table, and I sure wish you’ll all stay healthy, my friends! In the next post I’ll present the analysis of the time series of monthly count of deaths.

If you liked learning more about notable dead, you can have a look at the analysis Hazel Kavili started doing of celebrity deaths in 2016.

I’d like to end this post with a note from my husband, who thinks having a blog makes me an influencer. If you too like Wikipedia, consider donating to the Wikimedia foundation.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)