Recreating (more) data visualizations from the book “Knowledge is Beautiful”: Part IV

Matt.0

4 years ago

[This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the last part of the series where I recreate data visualizations in R from the book Knowledge is Beautiful by David McCandless.

Links to part I, II, III of the series can be found here.

Plane Crashes

This dataset will be used for a couple of visualizations.

The first visualization is a stacked-barplot showing causes of crashes for every plane crash from 1993 to January 2017 (for flights that were not military, medical or a private chartered flight).

library(dplyr)
library(ggplot2)
library(tidyr)
library(extra)
df <- read.csv("worst_plane.csv")
# Drop the year plane model entered service
mini_df <- df %>% 
  select(-year_service) %>% 
# Gather the wide dataframe into a tidy format
  gather(key = cause, value = proportion, -plane)
# Order by cause
mini_df$cause <- factor(mini_df$cause, levels = c("human_error","weather", "mechanical", "unknown", "criminal"), ordered = TRUE)
# Create vector of plane names according to year they entered service
names <- unique(mini_df$plane)
names <- as.vector(names)
# sort by factor
mini_df$plane <- factor(mini_df$plane, levels = names)
ggplot(mini_df, aes(x=plane, y=proportion, fill=cause)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  # Reverse the order of a categorical axis
  scale_x_discrete(limits = rev(levels(mini_df$plane))) +
  # Select manual colors that McCandless used
  scale_fill_manual(values = c("#8E5A7E", "#A3BEC7", "#E1BD81", "#E9E4E0", "#74756F"), labels = c("Human Error", "Weather", "Mechanical", "Unknown", "Criminal")) +
  labs(title = "Worst Planes", caption = "Source: bit.ly/KIB_PlaneCrashes") +
  scale_y_reverse() +
  theme(legend.position = "right",
      panel.background = element_blank(),
        plot.title = element_text(size = 13, 
                                  family = "Georgia", 
                                  face = "bold", lineheight = 1.2),
      plot.caption = element_text(size = 5,
                                    hjust = 0.99, family = "Georgia"),
      axis.text = element_text(family = "Georgia"), 
      # Get rid of the x axis text/title 
      axis.text.x=element_blank(),
      axis.title.x=element_blank(),
      # and y axis title
      axis.title.y=element_blank(),
      # and legend title
      legend.title = element_blank(),
      legend.text = element_text(family = "Georgia"),
      axis.ticks = element_blank())

The second visualization is an alluvial diagram for which we can use the ggalluvial package. I should mention that the original visualization by McCandless is much fancier than what this produces but displays the same basic information.

library(alluvial)
library(ggalluvial)
crash <- read.csv("crashes_alluvial.csv")
# stratum = cause, alluvium = freq
ggplot(crash, aes(weight = freq,
                  axis1 = phase,
                  axis2 = cause,
                  axis3 = total_crashes)) +
  geom_alluvium(aes(fill = cause),
                width = 0, knot.pos = 0, reverse = FALSE) +
  guides(fill = FALSE) +
  geom_stratum(width = 1/8, reverse = FALSE) +
  geom_text(stat = "stratum", label.strata = TRUE, reverse = FALSE, size = 2.5) +
  scale_x_continuous(breaks = 1:3, labels = c("phase", "causes", "total crashes")) +
  coord_flip() +
  labs(title = "Crash Cause", caption = "Source: bit.ly/KIB_PlaneCrashes") +
  theme(panel.background = element_blank(),
        plot.title = element_text(size = 13, 
                                  family = "Georgia", 
                                  face = "bold",
                                  lineheight = 1.2,
                                  vjust = -3,
                                  hjust = 0.05), 
        plot.caption = element_text(size = 5,
                                    hjust = 0.99, family = "Georgia"),  
        axis.text = element_text(family = "Georgia"),
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      axis.ticks.y = element_blank())

Gender Gap

This visualization depicts the salary gap between males and females by industry in the UK with the mean salary of each position within a category. We can use group_by() and summarize_at() to create a new variable for each category and then use facet_wrap() . Since positions only belong to one category you need to set scales = "free_x" for missing observations.

gendergap <- read.csv("gendergap.csv")
# gather the dataset
tidy_gap <- gendergap %>% 
  gather(key = sex, value = salary, -title, -category)
category_means <- tidy_gap %>% 
  group_by(category) %>%
  summarize_at(vars(salary), mean)
tidy_gap %>% ggplot(aes(x = title, y = salary, color = sex)) +
  facet_wrap(~ category, nrow = 1, scales = "free_x") +
  geom_line(color = "white") +
  geom_point() +
  scale_color_manual(values = c("#F49171", "#81C19C")) +
  geom_hline(data = category_means, aes(yintercept = salary), color = "white", alpha = 0.6, size = 1) +
  theme(legend.position = "none",
      panel.background = element_rect(color = "#242B47", fill = "#242B47"),
      plot.background = element_rect(color = "#242B47", fill = "#242B47"),
      axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
      axis.text = element_text(family = "Georgia", color = "white"),
      axis.text.x = element_text(angle = 90),
      # Get rid of the y- and x-axis titles
      axis.title.y=element_blank(),
      axis.title.x=element_blank(),
      panel.grid.major.y = element_line(color = "grey48", size = 0.05),
      panel.grid.minor.y = element_blank(),
      panel.grid.major.x = element_blank(),
      strip.background = element_rect(color = "#242B47", fill = "#242B47"),
      strip.text = element_text(color = "white", family = "Georgia"))

One thing that I’m not sure how to handle is the spacing between each of the variables on the x-axis. Since there is a different number of variables for each facet it would be nice if one could specify they want equal spacing along the x-axis as an option in the facet_wrap(); however, I don’t think it’s possible (if you know a workaround please leave a comment!).

That’s all for me, it’s been fun doing this series and I hope you’ve enjoyed!

Recreating (more) data visualizations from the book “Knowledge is Beautiful”: Part IV was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.