Recreating data visualizations from the book “Knowledge is Beautiful”

June 6, 2018
By

(This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers)

n this series of posts I will set out to recreate some of the visualizations from the book “Knowledge is Beautiful” by David McCandless in R.

David McCandless is author of two bestselling infographics books and gave a TED talk about data visualization. I bought his second book “Knowledge is Beautiful”, in 2015, which contains 196 beautiful infographics.

At that time, I was really into The Walking Dead, and his book inspired me to make my own infographic:

Recently, I was trying to think of some fun data visualization projects and decided to choose a couple from the book that could be recreated as close to possible in R.

The book is an excellent resource for those who like these sorts of exercises as every single visualization in the book is paired with an online dataset to explore at your interest!!!. I never knew how rich the datasets were until I tried to recreate my first visualization, “Best in Show”. The dataset for Best in Show alone, is an excel file with eight sheets!

McCandles says the whole book took him 15,832 hours over two-years and I don’t doubt it. Whipping up a quick EDA plotcan be fast and simple for a meeting if you only care about actionable results. However, if your publishing something it takes some time to create a stunning visual.

Best in Show

Best in Show is a scatter-plot of dog silhouettes, color-coded based on the category of dog, sized accordingly, and pointing either left-or-right, depending on their intelligence.

Let’s load the environment; I like the needs() package which makes it simple to install & load packages into R.

knitr::opts_chunk$set(echo = TRUE)
if (!require("needs")) {
install.packages("needs", dependencies = TRUE)
library(needs)
}

As mentioned above, the data is an excel file with eight sheets. We can read in the Excel file with read_excel() and specify the sheet we would like to access with the sheet = argument.

# install&Load readxl package
needs(here,
readxl,
stringr,
dplyr, # to use select()
magrittr) # to use %<>% operators

path = here()

# Specify the worksheet by name
dog_excel <- read_excel("~/bestinshow.xlsx", sheet = "Best in show full sheet", range = cell_rows(3:91))

# Pick the desired variables
dog_excel %<>%
select(1,3,5:6, "intelligence category", "size category")

# Rename the columns
colnames(dog_excel) <- c("breed", "category", "score", "popularity", "intelligence", "size")

# Remove first row (non-descript column names)
dog_excel <- dog_excel[-1,]

Create a HTML table (supports filtering, pagination, and sorting. This only works in RMarkdown not on Medium so you need to check out the github.io https://moldach.github.io/dataRbeautiful/index.html

needs(DT)
datatable(dog_excel, options = list(pageLength = 5))

Take a look at the intelligence levels in the dataset.

# What are the intelligence levels?
unique(dog_excel$intelligence)
## [1] "Brightest" "Above average" "Excellent" "Average"
## [5] "Fair" "Lowest"

McCandless’ classified dogs intelligence as either “dumb” or “clever”. But here we see there are categories in this clean dataset. So let’s assign the top three factors as clever and the other three as dumb

dog_excel$intelligence %<>% 
str_replace(pattern = "Brightest", replace = "clever", .) %>%
str_replace(pattern = "Above average", replace = "clever", .) %>%
str_replace(pattern = "Excellent", replace = "clever", .) %>%
str_replace(pattern = "Average", replace = "dumb", .) %>%
str_replace(pattern = "Fair", replace = "dumb", .) %>%
str_replace(pattern = "Lowest", replace = "dumb", .)

There are 87 dogs in the original visualization. I found 24 silhouettes under an Creative Commons Attribution-Share Alike 4.0 License on SuperColoring.com. This means I can freely copy and redistribute in any medium or format as long as I give a link to the webpage and indicate the author’s name and the license. I’ve included that information in a .csv. and we will use this to subset McCandless’s data (both datasets can be found in my Github repo).

needs(readr)
dog_silhouettes <- read_csv("~/dog_silhouettes.csv")

dog_df <- dog_excel %>%
inner_join(dog_silhouettes, by = "breed")

# change popularity from string into numeric values
dog_df$popularity <- as.numeric(dog_df$popularity)

We need to use the magick package to make the white background surrounding the .svg silhouettes transparent, scale images to a common size (by width), and save them as .png’s.

needs(magick)

file.names <- dir(path, pattern = ".svg")

for (file in file.names){
# read in the file
img <- image_read_svg(file)
# scale all images to a common scale
img_scaled <- image_scale(img, 700)
# make the background transparent
img_trans <- image_transparent(img_scaled, 'white')
# get rid of .svg ending
file_name <- str_replace(file, pattern = ".svg", replace = "")
# write out the file as a .png
image_write(img_trans, paste0(file_name, ".png"))
}

Some of the dog silhouette’s are pointing their heads in the opposite direction. We need to use image_flop()from the magick package so that all are facing the same direction. Later we can subset the data frame by intelligence so clever and dumb dogs face opposite directions.

path = here()

flop.images <- c("~/labrador-retriever-black-silhouette.png", "~/border-terrier-black-silhouette.png", "~/boxer-black-silhouette.png", "~/french-bulldog-black-silhouette.png", "~/german-shepherd-black-silhouette.png", "~/golden-retriever-black-silhouette.png", "~/greyhound-black-silhouette.png", "~/rottweiler-black-silhouette.png")

for(i in flop.images){
i <- str_replace(i, pattern = "~/", replace = "")
img <- image_read(i)
img_flop <- image_flop(img)
image_write(img_flop, i)
}

The next step is to color the dogs according to category.

Let’s write a function that can be given three arguments: 1) df a data frame, 2) category the type of dog it is, 3) color a specific color for each category.

The first step is to select only those dogs which belong to the category of interest. This can be done with filter() from the dplyr package. However, if you want to use dplyr functions, like filter(), you need to follow the instruction at this website; set with enquo.

img_color <- function(df, category, color){
# Make filter a quosure
category = enquo(category)
# subset df on category
new_df <- df %>%
filter(category == !!category)
# get directory paths of images for the for loop
category_names <- new_df$breed %>%
tolower() %>%
str_replace_all(" ", "-") %>%
paste0("-black-silhouette.png")
for(name in category_names){
img <- image_read(name)
img_color <- image_fill(img, color, "+250+250")
image_write(img_color, name)
}
}

I wanted to replicate the figures as close as possible within R so to replicate the colors of the visualizations I scanned the book, saved the images, and then used this tool to get the html color codes.

Now let’s Use the created function above to color dog silhouettes according to their category.

# Herding color "#D59E7B"
img_color(dog_df, "herding", "#D59E7B")

# Hound color "#5E4D6C"
img_color(dog_df, "hound", "#5E4D6C")

# Non-sporting color "#6FA86C"
img_color(dog_df, "non-sporting", "#6FA86C")

# Sporting color "#B04946"
img_color(dog_df, "sporting", "#B04946")

# Terrier color "#A98B2D"
img_color(dog_df, "terrier", "#A98B2D")

# Toy color "#330000"
img_color(dog_df, "toy", "#330000")

# Working color "#415C55"
img_color(dog_df, "working", "#415C55")

Okay it’s finally time to make the visualization.

Usually one plots points with geom_point() from ggplot2 but in this case I want images for each of the breed’s instead. We can use the ggimage package, and with a little tweaking we can flop images based on intelligence. ggimage does not support color as an aesthetic like ggplot2 which is why I manually assigned colors & sizes earlier.

Since popularity scores range from 1 to 140 with 1 being the most popular we will need to reverse the y-axis with scale_y_reverse().

needs(ggplot2,
ggimage)
# add "~/" to get filenames of images for plotting
dog_df$name <- paste0("~/", dog_df$name)

# create a ggplot/ggimage object
p <- ggplot(subset(dog_df, intelligence == "clever"), aes(x = score, y = popularity, image = name), alpha = 0.5) + geom_image(image_fun = image_flop) + geom_image(data=subset(dog_df, intelligence == "dumb")) +
labs(title = "Best in Show", subtitle = "The ultimate datadog", caption = "Source: bit.ly/KIB_BestDogs") +
labs(x = NULL, y = NULL) +
theme(panel.background = element_blank(),
legend.position = "top",
legend.box = "horizontal",
plot.title = element_text(size = 13,
# I'm not sure what font he chose so I'll pick something I think looks similar
family = "AvantGarde",
face = "bold",
lineheight = 1.2),
plot.subtitle = element_text(size = 10,
family = "AvantGarde"),
plot.caption = element_text(size = 5,
hjust = 0.99),
axis.text = element_blank(),
axis.ticks = element_blank()) +
scale_y_reverse()

The final step is to add text annotations underneath the dog breeds. Since I subset dog_df by intelligence if I try to annotate with geom_text() it will only annotate part of the data. We will need to the annotate() function instea since geome are not mapped from variables of a data frame, but are instead passed in as vectors.

# Add annotations
p + annotate("text", x=dog_df$score[1:24], y=((dog_df$popularity[1:24])+6), label = dog_df$breed[1:24], size = 2.0)

The visualization looks similar to the original and highlights most of the aesthetics that were included. One could always add the additional details in Adobe Illustrator or Inkscape to make it look more like the final visualization

Stay tuned for Part II!


Recreating data visualizations from the book “Knowledge is Beautiful” was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)