Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve recently begun reading through the book Modern Data Science with R, by Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. It’s quite clear and informative. One of the things I especially appreciate about it is that I’m not finding the math to be too cumbersome. That is, even for someone like me, whose primary background isn’t in math or statistics, I’m able to follow along with the book quite easily.

As I’m reading through the book, I’m doing the exercises at the back of the chapters, and I recently worked through chapter 3, which covers ggplot2 basics. One of the exercises at the end of this chapter asks us to recreate this graphic from FiveThirtyEight. The goal of the exercise is to use ggplot2 to make production-quality graphics.

The project makes use of the babynames package, which uses public data on baby names from the Social Security Administration. We then use the make_babynames_dist() function from the mdsr package that the authors developed to add variables relevant to the goals of the exercise. Basically, it takes the data from the lifetables table in the babynames package and adds variables and filters and returns just the data relevant to 2014.

Truth told, I was stumped by this exercise when I first read it. So I reached out to Nicholas Horton at Amherst and he helped me with the basic scripting and I was able to tweak it to recreate what I was looking for.

So first we load the necessary libraries and inspect the dataset.

library(mdsr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(babynames)

babynames_dist <- make_babynames_dist()
head(babynames_dist)

# A tibble: 6 x 9
year sex   name       n   prop alive_prob count_thousands age_today est_alive_today

1  1900 F     Mary   16707 0.0526          0           16.7        114               0
2  1900 F     Helen   6343 0.0200          0            6.34       114               0
3  1900 F     Anna    6114 0.0192          0            6.11       114               0
4  1900 F     Marga…  5304 0.0167          0            5.30       114               0
5  1900 F     Ruth    4765 0.0150          0            4.76       114               0
6  1900 F     Eliza…  4096 0.0129          0            4.10       114               0

So what we need to do is create some new variables that provide the total number of people with a given name who are likely still alive, and from that we can then calculate the percentage who are likely dead. Then (and this is what initially had me stumped) we need to select the top 10 male names and the top 10 female names.

deadest <- babynames_dist %>%
filter(year >= 1900) %>%
group_by(name, sex) %>%
summarise(N = n(),
total_est_alive_today = sum(est_alive_today),
total = sum(n)) %>%
mutate(percent_dead = 1 - (total_est_alive_today / total)) %>%
filter(total > 50000) %>%
group_by(sex) %>%
top_n(10)

The above gives us the following dataset, which we can then use to create the graphic:

# A tibble: 20 x 6
# Groups:   sex [2]
name     sex       N total_est_alive_today  total percent_dead

1 Mabel    F       111                20233.  96037        0.789
2 Gertrude F       111                31360. 145693        0.785
3 Myrtle   F        99                25491. 108941        0.766
4 Blanche  F       111                16509.  69524        0.763
5 Beulah   F       110                15642.  63361        0.753
6 Opal     F       111                17471.  65821        0.735
7 Florence F       111                77679. 284945        0.727
8 Agnes    F       111                37593. 134940        0.721
9 Viola    F       111                32957. 116666        0.718
10 Bessie   F       111                36824. 130155        0.717
11 Elmer    M       111                35548. 116830        0.696
12 Wilbur   M       111                17881.  54423        0.671
13 Homer    M       111                18809.  55639        0.662
14 Willard  M       111                28576.  74821        0.618
15 Hubert   M       111                21417.  55340        0.613
16 Chester  M       111                44995. 114370        0.607
17 Clarence M       111               113641. 280518        0.595
18 Herbert  M       111                88652. 217291        0.592
19 Harry    M       111               153501. 374524        0.590
20 Horace   M       111                20723.  50340        0.588

So then here is the code to create the actual data viz:

ggplot(deadest, aes(reorder(name, percent_dead), percent_dead, fill = sex)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Deadest Names", subtitle = "Estimated % of Americans with a given name born since 1900\nwho were dead as of Jan. 1, 2014") +
scale_x_discrete(NULL) + scale_y_continuous(NULL) +
scale_fill_manual(values = c("#f6b900", "#008fd5")) +
theme_fivethirtyeight() +
theme(axis.text.x = element_blank(),
panel.grid = element_blank(),
legend.position = "none")

I’ve used the fivethirtyeight theme from the ggthemes package, and apart from the footer that FiveThirtyEight uses, it looks pretty close.