Recreating (more) data visualizations from the book “Knowledge is Beautiful”: Part IV

July 16, 2018
By

(This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers)

Welcome to the last part of the series where I recreate data visualizations in R from the book Knowledge is Beautiful by David McCandless.

Links to part I, II, III of the series can be found here.

Plane Crashes

This dataset will be used for a couple of visualizations.

The first visualization is a stacked-barplot showing causes of crashes for every plane crash from 1993 to January 2017 (for flights that were not military, medical or a private chartered flight).

library(dplyr)
library(ggplot2)
library(tidyr)
library(extrafont)
df <- read.csv("worst_plane.csv")
# Drop the year plane model entered service
mini_df <- df %>%
select(-year_service) %>%
# Gather the wide dataframe into a tidy format
gather(key = cause, value = proportion, -plane)
# Order by cause
mini_df$cause <- factor(mini_df$cause, levels = c("human_error","weather", "mechanical", "unknown", "criminal"), ordered = TRUE)
# Create vector of plane names according to year they entered service
names <- unique(mini_df$plane)
names <- as.vector(names)
# sort by factor
mini_df$plane <- factor(mini_df$plane, levels = names)
ggplot(mini_df, aes(x=plane, y=proportion, fill=cause)) +
geom_bar(stat = "identity") +
coord_flip() +
# Reverse the order of a categorical axis
scale_x_discrete(limits = rev(levels(mini_df$plane))) +
# Select manual colors that McCandless used
scale_fill_manual(values = c("#8E5A7E", "#A3BEC7", "#E1BD81", "#E9E4E0", "#74756F"), labels = c("Human Error", "Weather", "Mechanical", "Unknown", "Criminal")) +
labs(title = "Worst Planes", caption = "Source: bit.ly/KIB_PlaneCrashes") +
scale_y_reverse() +
theme(legend.position = "right",
panel.background = element_blank(),
plot.title = element_text(size = 13,
family = "Georgia",
face = "bold", lineheight = 1.2),
plot.caption = element_text(size = 5,
hjust = 0.99, family = "Georgia"),
axis.text = element_text(family = "Georgia"),
# Get rid of the x axis text/title
axis.text.x=element_blank(),
axis.title.x=element_blank(),
# and y axis title
axis.title.y=element_blank(),
# and legend title
legend.title = element_blank(),
legend.text = element_text(family = "Georgia"),
axis.ticks = element_blank())

The second visualization is an alluvial diagram for which we can use the ggalluvial package. I should mention that the original visualization by McCandless is much fancier than what this produces but displays the same basic information.

library(alluvial)
library(ggalluvial)
crash <- read.csv("crashes_alluvial.csv")
# stratum = cause, alluvium = freq
ggplot(crash, aes(weight = freq,
axis1 = phase,
axis2 = cause,
axis3 = total_crashes)) +
geom_alluvium(aes(fill = cause),
width = 0, knot.pos = 0, reverse = FALSE) +
guides(fill = FALSE) +
geom_stratum(width = 1/8, reverse = FALSE) +
geom_text(stat = "stratum", label.strata = TRUE, reverse = FALSE, size = 2.5) +
scale_x_continuous(breaks = 1:3, labels = c("phase", "causes", "total crashes")) +
coord_flip() +
labs(title = "Crash Cause", caption = "Source: bit.ly/KIB_PlaneCrashes") +
theme(panel.background = element_blank(),
plot.title = element_text(size = 13,
family = "Georgia",
face = "bold",
lineheight = 1.2,
vjust = -3,
hjust = 0.05),
plot.caption = element_text(size = 5,
hjust = 0.99, family = "Georgia"),
axis.text = element_text(family = "Georgia"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank())

Gender Gap

This visualization depicts the salary gap between males and females by industry in the UK with the mean salary of each position within a category. We can use group_by() and summarize_at() to create a new variable for each category and then use facet_wrap() . Since positions only belong to one category you need to set scales = "free_x" for missing observations.

gendergap <- read.csv("gendergap.csv")
# gather the dataset
tidy_gap <- gendergap %>%
gather(key = sex, value = salary, -title, -category)
category_means <- tidy_gap %>% 
group_by(category) %>%
summarize_at(vars(salary), mean)
tidy_gap %>% ggplot(aes(x = title, y = salary, color = sex)) +
facet_wrap(~ category, nrow = 1, scales = "free_x") +
geom_line(color = "white") +
geom_point() +
scale_color_manual(values = c("#F49171", "#81C19C")) +
geom_hline(data = category_means, aes(yintercept = salary), color = "white", alpha = 0.6, size = 1) +
theme(legend.position = "none",
panel.background = element_rect(color = "#242B47", fill = "#242B47"),
plot.background = element_rect(color = "#242B47", fill = "#242B47"),
axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
axis.text = element_text(family = "Georgia", color = "white"),
axis.text.x = element_text(angle = 90),
# Get rid of the y- and x-axis titles
axis.title.y=element_blank(),
axis.title.x=element_blank(),
panel.grid.major.y = element_line(color = "grey48", size = 0.05),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(color = "#242B47", fill = "#242B47"),
strip.text = element_text(color = "white", family = "Georgia"))

One thing that I’m not sure how to handle is the spacing between each of the variables on the x-axis. Since there is a different number of variables for each facet it would be nice if one could specify they want equal spacing along the x-axis as an option in the facet_wrap(); however, I don’t think it’s possible (if you know a workaround please leave a comment!).

That’s all for me, it’s been fun doing this series and I hope you’ve enjoyed!


Recreating (more) data visualizations from the book “Knowledge is Beautiful”: Part IV was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)