Extracting notable deaths from Wikipedia

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I like Wikipedia. My husband likes it even more, he included it in his PhD thesis acknowledgements! I appreciate the efforts done for sharing knowledge, and also the apparently random stuff you can find on the website. In particular, I’ve been intrigued by the monthly lists of notable deaths such as this one. Who are people (or dogs, yes, dogs) whose life was deemed notable enough to be listed there? Also, using the numbers of such deaths, can I judge whether 2016 was really worse than previous years? The first step in answering these questions was to scrape the data. I’ll describe the process in this post. In another post I’ll have a look at my study population and in a third post I’ll analyse the time series of death counts.

I have extracted all deaths listed in Wikipedia from 2004 to 2016 from monthly pages. For most deaths, I extracted the Wikipedia link, the name of the person, their age and the first part of their presentation, which most often includes a nationality and reason for being famous, e.g. “Italian astrophysicist”. I chose not to get the rest of the line if it was longer, because the length of the reason for being famous was quite variable and cause of death was not consistently indicated.

Note that there are list of notable deaths for deaths occurred before 2004 but they are in a different format so let’s say it’s a challenge for another day (or another person, like you, dear reader?).

Downloading the lists of deaths

I started by downloading the content of all monthly lists of deaths from 2004 to 2016.

<span class="n">library</span><span class="p">(</span><span class="s2">"rvest"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"dplyr"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"purrr"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"tidyr"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"lazyeval"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"tibble"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"fuzzyjoin"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"stringr"</span><span class="p">)</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"January"</span><span class="p">,</span><span class="w"> </span><span class="s2">"February"</span><span class="p">,</span><span class="w">
            </span><span class="s2">"March"</span><span class="p">,</span><span class="w"> </span><span class="s2">"April"</span><span class="p">,</span><span class="w">
            </span><span class="s2">"May"</span><span class="p">,</span><span class="w"> </span><span class="s2">"June"</span><span class="p">,</span><span class="w">
            </span><span class="s2">"July"</span><span class="p">,</span><span class="w"> </span><span class="s2">"August"</span><span class="p">,</span><span class="w">
            </span><span class="s2">"September"</span><span class="p">,</span><span class="w"> </span><span class="s2">"October"</span><span class="p">,</span><span class="w">
            </span><span class="s2">"November"</span><span class="p">,</span><span class="w"> </span><span class="s2">"December"</span><span class="p">)</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2004</span><span class="o">:</span><span class="m">2016</span><span class="w">
</span><span class="n">pages_content</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="n">months</span><span class="p">,</span><span class="w"> </span><span class="n">paste</span><span class="p">,</span><span class="w"> </span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"_"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unlist</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="c1"># read page with monthly deaths
</span><span class="w">  </span><span class="n">map</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
    </span><span class="n">read_html</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"https://en.wikipedia.org/wiki/Deaths_in_"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">))})</span><span class="w">
</span>

Transforming the content of the lists into a table

I actually did all this stuff months ago. At that time apparently I was patient enough to figure out how to extract information from the list. I was surprised to see my code still worked, for getting all of 2016 I just needed to change 2015 into 2016. Note that I use the name list, it’s a list on the webpage, not a table, which means I couldn’t just rely on rvest and take a nap. My main goal was to be able to get something nice for typical entries and just forget about the non-typical entries, hoping there wouldn’t be many. Here are the two functions I defined. I use stringr, not stringi, because it’s an old code where this was enough but since reading Bob Rudis’ post I am more curious about stringi.

<span class="n">transform_day</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">day_in_month</span><span class="p">){</span><span class="w">
  </span><span class="c1"># filter only those that have the format of lines presenting deaths
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">day_deaths</span><span class="o">$</span><span class="n">days</span><span class="p">,</span><span class="w"> </span><span class="s2">"<ul><li>"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_split</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"<li>"</span><span class="p">,</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">day_deaths</span><span class="p">[</span><span class="o">!</span><span class="n">str_detect</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"Death"</span><span class="p">)]</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">day_deaths</span><span class="p">[</span><span class="o">!</span><span class="n">str_detect</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"Category"</span><span class="p">)]</span><span class="w">
  
  </span><span class="c1"># Erases the end of each line
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\(<i>.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  
  </span><span class="c1"># Create a table
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble_</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="n">day_deaths</span><span class="p">))</span><span class="w">
  
  </span><span class="c1"># Variable for grouping by row 
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">row</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">group_by_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"row"</span><span class="p">)</span><span class="w">
  
  </span><span class="c1"># separate by the word "title", first part is a link, second part description
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_split</span><span class="p">(</span><span class="n">line</span><span class="p">,</span><span class="w"> </span><span class="s2">"title"</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">wiki_link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">line</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="s2">"<a href=\\\"/wiki/"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  
  </span><span class="c1"># get Wikipedia link or better said the thing to paste to Wikipedia address
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> 
                        </span><span class="n">wiki_link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">wiki_link</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\\\"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> 
                        </span><span class="n">wiki_link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">wiki_link</span><span class="p">,</span><span class="w"> </span><span class="s2">"\""</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  
  
  </span><span class="c1"># now transform the description into several columns
</span><span class="w">  </span><span class="c1"># the format of the end of the description is variable so
</span><span class="w">  </span><span class="c1"># I only keep the beginning of the reason for notoriety i.e. country of origin
</span><span class="w">  </span><span class="c1"># and a role
</span><span class="w">  </span><span class="c1"># anyway cause of death is not written for all and I don't want to use many details
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">content</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">line</span><span class="p">[[</span><span class="m">1</span><span class="p">]][</span><span class="m">2</span><span class="p">]))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w">
                        </span><span class="n">content</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">content</span><span class="p">,</span><span class="w"> </span><span class="s1">'<a.*a>'</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">separate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"content"</span><span class="p">,</span><span class="w"> 
                          </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span><span class="w"> </span><span class="s2">"age"</span><span class="p">,</span><span class="w"> </span><span class="s2">"country_role"</span><span class="p">),</span><span class="w">
                          </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">)</span><span class="w">
  
  </span><span class="c1"># when no age
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">country_role</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">country_role</span><span class="p">),</span><span class="w">
                                                                  </span><span class="n">age</span><span class="p">,</span><span class="w">
                                                                  </span><span class="n">country_role</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">age</span><span class="p">)))</span><span class="w">
  
  </span><span class="c1"># improves the name
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> 
                        </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s2">"\">.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> 
                        </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="s2">"=\""</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  
  </span><span class="c1"># improves the country_role
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> 
                        </span><span class="n">country_role</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">country_role</span><span class="p">,</span><span class="w"> </span><span class="s2">"</li>"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> 
                        </span><span class="n">country_role</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">str_replace_all</span><span class="w">
                                              </span><span class="p">(</span><span class="n">country_role</span><span class="p">,</span><span class="w"> </span><span class="s2">"</ul>"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)))</span><span class="w">
  
  </span><span class="c1"># get rid of original line
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">line</span><span class="p">))</span><span class="w">
  
  </span><span class="c1"># get rid of grouping
</span><span class="w">  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ungroup</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">)</span><span class="w">
  </span><span class="n">day_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select_</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">,</span><span class="w"> </span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">row</span><span class="p">))</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">day_deaths</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">parse_month_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">){</span><span class="w">
  
  </span><span class="c1"># remember month and year
</span><span class="w">  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="n">toString</span><span class="p">(</span><span class="n">html_nodes</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"title"</span><span class="p">)),</span><span class="w"> 
                                </span><span class="s2">"Deaths in .* -"</span><span class="p">)</span><span class="w">
  
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"Deaths in "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">" -"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"January"</span><span class="p">,</span><span class="w"> </span><span class="s2">"01"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"February"</span><span class="p">,</span><span class="w"> </span><span class="s2">"02"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"March"</span><span class="p">,</span><span class="w"> </span><span class="s2">"03"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"April"</span><span class="p">,</span><span class="w"> </span><span class="s2">"04"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"May"</span><span class="p">,</span><span class="w"> </span><span class="s2">"05"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"June"</span><span class="p">,</span><span class="w"> </span><span class="s2">"06"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"July"</span><span class="p">,</span><span class="w"> </span><span class="s2">"07"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"August"</span><span class="p">,</span><span class="w"> </span><span class="s2">"08"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"September"</span><span class="p">,</span><span class="w"> </span><span class="s2">"09"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"October"</span><span class="p">,</span><span class="w"> </span><span class="s2">"10"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"November"</span><span class="p">,</span><span class="w"> </span><span class="s2">"11"</span><span class="p">)</span><span class="w">
  </span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"December"</span><span class="p">,</span><span class="w"> </span><span class="s2">"12"</span><span class="p">)</span><span class="w">
  
  </span><span class="c1"># find days with deaths
</span><span class="w">  </span><span class="n">content</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">toString</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">)</span><span class="w">
  </span><span class="n">content</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_split</span><span class="p">(</span><span class="n">content</span><span class="p">,</span><span class="w"> </span><span class="s2">"\n"</span><span class="p">,</span><span class="w"> </span><span class="n">simplify</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
  </span><span class="n">days</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">content</span><span class="p">,</span><span class="w"> </span><span class="s2">"h3"</span><span class="p">))</span><span class="w">
  
  </span><span class="n">paragraphs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">content</span><span class="p">,</span><span class="w"> </span><span class="s2">"<ul>"</span><span class="p">))</span><span class="w">
  
  </span><span class="n">last_good</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">which</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">html_nodes</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"h3"</span><span class="p">),</span><span class="w"> 
                                             </span><span class="n">toString</span><span class="p">)),</span><span class="w">
                               </span><span class="n">pattern</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"title=Deaths_in_.*_.."</span><span class="p">)))</span><span class="w">
  
  </span><span class="n">possible_days</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">last_good</span><span class="w">
  
  </span><span class="n">possible_days</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">possible_days</span><span class="p">[</span><span class="n">diff</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">days</span><span class="p">[</span><span class="n">possible_days</span><span class="p">],</span><span class="w"> </span><span class="m">999999</span><span class="p">))</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
  </span><span class="c1"># read only lines
</span><span class="w">  </span><span class="n">month_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">html_nodes</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"ul"</span><span class="p">)</span><span class="w">
  </span><span class="n">first_not_good</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">which</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"Template"</span><span class="p">)))</span><span class="w">
  </span><span class="n">first_good</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">which</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">,</span><span class="w"> </span><span class="s2">"wiki"</span><span class="p">)))</span><span class="w">
  </span><span class="n">paragraphs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paragraphs</span><span class="p">[</span><span class="n">first_good</span><span class="o">:</span><span class="p">(</span><span class="n">first_not_good</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)]</span><span class="w">
  
  </span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">paragraphs</span><span class="p">)</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">possible_days</span><span class="p">)){</span><span class="w">
    </span><span class="n">jours</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
    </span><span class="k">for</span><span class="p">(</span><span class="n">paragraph</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">paragraphs</span><span class="p">){</span><span class="w">
      </span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">days</span><span class="p">[</span><span class="n">days</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">paragraph</span><span class="p">]</span><span class="w">
      </span><span class="n">jours</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">jours</span><span class="p">,</span><span class="w">
                 </span><span class="n">possible_days</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">paragraph</span><span class="p">)</span><span class="o">==</span><span class="nf">min</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">paragraph</span><span class="p">)))])</span><span class="w">
    </span><span class="p">}</span><span class="w">
    </span><span class="n">possible_days</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">jours</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
  </span><span class="n">month_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">month_deaths</span><span class="p">[</span><span class="n">first_good</span><span class="o">:</span><span class="p">(</span><span class="n">first_not_good</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)]</span><span class="w">
  </span><span class="n">month_deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">map</span><span class="p">(</span><span class="n">month_deaths</span><span class="p">,</span><span class="w"> </span><span class="n">toString</span><span class="p">))</span><span class="w">
  
  
  
  </span><span class="c1"># transform for getting the different columns
</span><span class="w">  </span><span class="n">month_deaths_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">days</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">month_deaths</span><span class="p">,</span><span class="w">
                                   </span><span class="n">day_in_month</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">possible_days</span><span class="p">,</span><span class="w">
                                   </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">by_row</span><span class="p">(</span><span class="n">transform_day</span><span class="p">,</span><span class="w"> </span><span class="n">.to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"all_deaths"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">unnest_</span><span class="p">(</span><span class="s2">"all_deaths"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">group_by_</span><span class="p">(</span><span class="s2">"wiki_link"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">mutate_</span><span class="p">(</span><span class="n">index</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">filter_</span><span class="p">(</span><span class="o">~</span><span class="n">index</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">select_</span><span class="p">(</span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="n">index</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">group_by_</span><span class="p">(</span><span class="s2">"days"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">mutate_</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">lubridate</span><span class="o">::</span><span class="n">dmy</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">day_in_month</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">))))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">select_</span><span class="p">(</span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">day_in_month</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">select_</span><span class="p">(</span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">title</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">select_</span><span class="p">(</span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">days</span><span class="p">))</span><span class="w"> 
  
  </span><span class="n">month_deaths_table</span><span class="w">
  
</span><span class="p">}</span><span class="w">
</span>

Once the functions were written (and tested on a few pages) I simply mapped them to all the pages.

<span class="n">deaths_2004_2016</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pages_content</span><span class="w"> </span><span class="o">%>%</span><span class="w">  
  </span><span class="n">map</span><span class="p">(</span><span class="n">parse_month_deaths</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">bind_rows</span><span class="p">()</span><span class="w">
</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">deaths_2004_2016</span><span class="p">))</span><span class="w">
</span>
wiki_link name age country_role date
Harold_Henning Harold Henning 69 South African golfer. 2004-01-01
Elma_Lewis Elma Lewis 82 American arts leader. 2004-01-01
Manuel_F%C3%A9lix_L%C3%B3pez Manuel Félix López 66 Ecuadorian politician. 2004-01-01
Frederick_Redlich Frederick Redlich 93 Austrian-born American dean of the 2004-01-01
Etta_Moten_Barnett Etta Moten Barnett 102 American actress. 2004-01-02
Lynn_Cartwright Lynn Cartwright 76 U.S. actress. 2004-01-02

I was already quite happy to get this table, but I wanted to add a country to most rows, and separate the role of the person from the adjectival.

Get the demonyms table from Wikipedia

I discovered Wikipedia has a table (a table! a table!) of adjectivals for many countries and nations. The only thing I changed was getting one line by adjectivals when there were several ones by country or nation. I also calculated the number of words in this adjectival in order to be able to easily remove it from the "country_role" column and thus get the role on its own.

<span class="n">demonyms_page</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="s2">"https://en.m.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations"</span><span class="p">)</span><span class="w">

</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">html_nodes</span><span class="p">(</span><span class="n">demonyms_page</span><span class="p">,</span><span class="w"> </span><span class="s2">"table"</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">html_table</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"country"</span><span class="p">,</span><span class="w"> </span><span class="s2">"adjectivals"</span><span class="p">,</span><span class="w"> </span><span class="s2">"demonyms"</span><span class="p">,</span><span class="w"> </span><span class="s2">"colloquial_demonyms"</span><span class="p">)</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="o">::</span><span class="n">as_tibble</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">)</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">by_row</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w">
                         </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
                           </span><span class="n">adjectivals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble_</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">goodadjectivals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_split</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">adjectivals</span><span class="p">,</span><span class="w"> </span><span class="s2">" or "</span><span class="p">))))</span><span class="w">
                           </span><span class="p">},</span><span class="w"> </span><span class="n">.collate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="n">.to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"good_adjectivals"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_</span><span class="p">(</span><span class="s2">"good_adjectivals"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_</span><span class="p">(</span><span class="s2">"goodadjectivals"</span><span class="p">)</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select_</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w"> </span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">adjectivals</span><span class="p">))</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rename_</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w"> </span><span class="s2">"adjectivals"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"goodadjectivals"</span><span class="w"> </span><span class="p">)</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">by_row</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w">
                         </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
                           </span><span class="n">adjectivals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble_</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">goodadjectivals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">interp</span><span class="p">(</span><span class="o">~</span><span class="n">str_split</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">adjectivals</span><span class="p">,</span><span class="w"> </span><span class="s2">","</span><span class="p">))))</span><span class="w">
                         </span><span class="p">},</span><span class="w"> </span><span class="n">.collate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="n">.to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"good_adjectivals"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_</span><span class="p">(</span><span class="s2">"good_adjectivals"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">unnest_</span><span class="p">(</span><span class="s2">"goodadjectivals"</span><span class="p">)</span><span class="w">

</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select_</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w"> </span><span class="nf">quote</span><span class="p">(</span><span class="o">-</span><span class="w"> </span><span class="n">adjectivals</span><span class="p">))</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rename_</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w"> </span><span class="s2">"adjectivals"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"goodadjectivals"</span><span class="w"> </span><span class="p">)</span><span class="w">

</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w"> </span><span class="n">adjectivals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">trimws</span><span class="p">(</span><span class="n">adjectivals</span><span class="w"> </span><span class="p">))</span><span class="w">
</span><span class="n">demonyms_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">by_row</span><span class="p">(</span><span class="n">demonyms_table</span><span class="p">,</span><span class="w">
                         </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
                           </span><span class="nf">length</span><span class="p">(</span><span class="n">str_split</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">adjectivals</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">)[[</span><span class="m">1</span><span class="p">]])},</span><span class="w">
                         </span><span class="n">.to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"adj_length"</span><span class="p">,</span><span class="w"> </span><span class="n">.collate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cols"</span><span class="p">)</span><span class="w">
</span>

Using adjectivals to split deaths’ country and role

For finding which country/nation to add to a line I used fuzzyjoin::regex_left_join() which worked well but a bit slowly given the number of lines.

<span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">deaths_2004_2016</span><span class="w">
</span><span class="n">demonyms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">demonyms_table</span><span class="w">
</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> </span><span class="n">country_role</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">country_role</span><span class="p">,</span><span class="w">
                                                    </span><span class="s2">"<.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">

</span><span class="n">demonyms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">demonyms</span><span class="p">,</span><span class="w"> </span><span class="n">adjectivals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strsplit</span><span class="p">(</span><span class="n">adjectivals</span><span class="p">,</span><span class="w"> </span><span class="s2">","</span><span class="p">))</span><span class="w">
</span><span class="n">demonyms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unnest</span><span class="p">(</span><span class="n">demonyms</span><span class="p">,</span><span class="w"> </span><span class="n">adjectivals</span><span class="p">)</span><span class="w">
</span><span class="n">demonyms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">demonyms</span><span class="p">,</span><span class="w"> 
                   </span><span class="n">adjectivals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">adjectivals</span><span class="p">,</span><span class="w"> </span><span class="s2">" .*"</span><span class="p">))</span><span class="w">
</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">regex_left_join</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> 
                          </span><span class="n">demonyms</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"country_role"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"adjectivals"</span><span class="p">))</span><span class="w">

</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w">
                 </span><span class="n">country</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">str_detect</span><span class="p">(</span><span class="n">country_role</span><span class="p">,</span><span class="w"> </span><span class="s2">"American"</span><span class="p">),</span><span class="w">
                                  </span><span class="s2">"United States"</span><span class="p">,</span><span class="w">
                                  </span><span class="n">country</span><span class="p">))</span><span class="w">

</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w">
                 </span><span class="n">adjectivals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">country</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"United States"</span><span class="p">,</span><span class="w">
                                       </span><span class="s2">"American"</span><span class="p">,</span><span class="w"> </span><span class="n">adjectivals</span><span class="p">))</span><span class="w">
</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w">
                 </span><span class="n">adj_length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">country</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"United States"</span><span class="p">,</span><span class="w">
                                      </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">adj_length</span><span class="p">))</span><span class="w">
</span><span class="c1"># keep one country only
</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> </span><span class="n">wiki_link</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">index</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">index</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span>
<span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">by_row</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w">
                 </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
                   </span><span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">adj_length</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
                     </span><span class="n">country_role</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">trimws</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">country_role</span><span class="p">)</span><span class="w">
                     </span><span class="n">splitted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_split</span><span class="p">(</span><span class="n">country_role</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
                     </span><span class="n">role</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">toString</span><span class="p">(</span><span class="n">splitted</span><span class="p">[(</span><span class="n">df</span><span class="o">$</span><span class="n">adj_length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">splitted</span><span class="p">)])</span><span class="w">
                     </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">role</span><span class="p">,</span><span class="w"> </span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
                     </span><span class="p">}</span><span class="k">else</span><span class="p">{</span><span class="w">
                     </span><span class="n">df</span><span class="o">$</span><span class="n">country_role</span><span class="w">
                   </span><span class="p">}</span><span class="w">
                   
                 </span><span class="p">},</span><span class="w"> </span><span class="n">.to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"occupation"</span><span class="p">,</span><span class="w"> </span><span class="n">.collate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cols"</span><span class="p">)</span><span class="w">

</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">occupation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">occupation</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\r"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">occupation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">occupation</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\n"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> 
                 </span><span class="n">occupation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_replace_all</span><span class="p">(</span><span class="n">occupation</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\."</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w">

</span><span class="n">deaths</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">demonyms</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">colloquial_demonyms</span><span class="p">,</span><span class="w">
                   </span><span class="o">-</span><span class="w"> </span><span class="n">index</span><span class="p">)</span><span class="w">
</span><span class="n">readr</span><span class="o">::</span><span class="n">write_csv</span><span class="p">(</span><span class="n">deaths</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"data/deaths_with_demonyms.csv"</span><span class="p">)</span><span class="w">
</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">deaths</span><span class="p">))</span><span class="w">
</span>
wiki_link name age country_role date country adj_length adjectivals occupation
Harold_Henning Harold Henning 69 South African golfer. 2004-01-01 South Africa 2 South African .* golfer
Elma_Lewis Elma Lewis 82 American arts leader. 2004-01-01 United States 1 American arts leader
Manuel_F%C3%A9lix_L%C3%B3pez Manuel Félix López 66 Ecuadorian politician. 2004-01-01 Ecuador 1 Ecuadorian .* politician
Frederick_Redlich Frederick Redlich 93 Austrian-born American dean of the 2004-01-01 United States 1 American American dean of the
Etta_Moten_Barnett Etta Moten Barnett 102 American actress. 2004-01-02 United States 1 American actress
Lynn_Cartwright Lynn Cartwright 76 U.S. actress. 2004-01-02 United States[20] 1 U.S. .* actress

In the table I have information about 56303 notable deaths. I know the age of 97% of them, a country or nation for 96.2% of them. Not too bad I think! It was then time to stop scraping webpages and to start digging into the data… Who were these people? and How bad was 2016?

I’d like to end this post with a note from my husband, who thinks having a blog makes me an influencer. If you too like Wikipedia, consider donating to the Wikimedia foundation.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)