Digging up embedded plots

May 7, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

The following multi-panel graph, which graces the cover of the most recent issue of the Journal of Computational and Graphical Statistics ,JCGS, (Vol 24, Num 1, March 2015) is from the paper by Grolemund and Wickham entitled Visualizing Complex Data With Embedded Plots. The four plots are noteworthy for a couple or reasons: 

  1. They present superb example of how an embedded plot with its additional set of axes can pack more information into the same area required for a traditional scatter plot or heatmap
  2. They provide clear and prominent testimony to the dreadful toll of civilian casualties from the war in Afghanistan. 

4AfghanPlots

Each plot provides a different view of casualty data collected by the U.S. military between 2004 and 2010 and made available by the WikiLeaks organization. Among other variables the dataset contains longitude and latitude coordinates and casualty statistics for more than 76,000 events. Casualties counts are recorded for four groups: civilian, enemy, Afghanistan police and coalition forces. The first plot is a simple scatter plot which suffers from severe over plotting that obscures the patterns in the data. The second plot, a heatmap, does show how the number of casualties varies by geography but provides no information as to how casualties are distributed among the various groups. The third and fourth plots are embedded plots which respectively show marginal and conditional distribution summaries of the data for different locations. 

From looking at the enlarged version below of the plot in the lower right corner it is clear that there are locations where civilian casualties dominate. For example, looking at the bar plot in the box on the seventh row down from the top and second column in from the left, the region around Herat, a city with a population of approximately 435,000 residents, it appears that civilian casualties exceed the sum of all others.

Afghan_casualty_2-d

 

Regarding the second point above: I think it was courageous of both the authors and the editors of the journal to call attention to the human tragedy of the Afghan War at time (2013) when the United States was still heavily invested with "boots on the ground" and the war was generating considerable controversy. 

One more mundane reason that I enjoy reading the JCGS is that it is apparently their policy to encourage authors to provide "supplementary materials" including code and data sets where feasible. Many times the supplementary material includes R code, and as you might expect, this was the case with the paper by Grolemund and Wickham. They provide R code for all of their examples as well as the data sets.

I was surprised, however, that my attempt to recreate the cover plot from the code provided (see below) turned out to be a small exercise in reproducible research. Running the code with a recent version of R will most likely generate the error:

Error in layout_base(data, vars, drop = drop) : 
At least one layer must contain all variables used for facetting

Things change, including R. In the sixteen months or so it took for the paper to be published it turns out that the code provided with the paper is no longer compatible with more current releases of R. See the discussion on Github.

Fortunately, R is fairly robust when it comes to reproducing past research. To generate the cover graph I downloaded the Windows binaries for R 3.0.2 and used the checkpoint function:  checkpoint("2014-09-18") to download an internally consistent set of packages that are required by the R scripts. (Note that MRAN archive used by checkpoint goes back to 2014-09-17.) The final step was to use some clever code from Cookbook for R to get all of the plots in a single graph.

My take is that even if it involves a bit of digital archaeology it is well worth the effort to explore embedded subplots. This form of visualization has been percolating for some time. (Grolemand and Wickham trace them back to the 1862 work of Charles Minard.) As the authors point out, embedded subplots are not always appropriate. There is a danger that they could easily lead to a visual complexity that would make them completely uninterpretable.  Nevertheless, when they do work embedded subplots can be spectacularly informative.

The following code, abstracted from the supplementary materials at the link above, will produce the produce the plots in the cover graphic.

# load and clean data that appears in the figures
 
library(reshape2)
library(plyr)
library(maps)
library(ggplot2)
library(ggsubplot)
 
# getbox by Heike Hoffman, trims map polygons for figure backgrounds
# https://github.com/ggobi/paper-climate/blob/master/code/maps.r
getbox <- function (map, xlim, ylim) {
  # identify all regions involved
  small <- subset(map, (long > xlim[1]) & (long < xlim[2]) & (lat > ylim[1]) & (lat < ylim[2]))
  regions <- unique(small$region)
  small <- subset(map, region %in% regions)  
 
  # now shrink all nodes back to the bounding box
  small$long <- pmax(small$long, xlim[1])
  small$long <- pmin(small$long, xlim[2])
  small$lat <- pmax(small$lat, ylim[1])
  small$lat <- pmin(small$lat, ylim[2])
 
  # Remove slivvers
  small <- ddply(small, "group", function(df) {
    if (diff(range(df$long)) < 1e-6) return(NULL)
    if (diff(range(df$lat)) < 1e-6) return(NULL)
    df
  })
 
  small
}
 
 
## Afghanistan for Figures 2 and 3
afghanistan <- getbox(world, c(60,75), c(28, 39))
map_afghan <- list(
  geom_polygon(aes(long, lat, group = group), data = afghanistan, 
    fill = "grey80", colour = "white", inherit.aes = FALSE, 
    show_guide = FALSE),
  scale_x_continuous("", breaks = NULL, expand = c(0.02, 0)),
  scale_y_continuous("", breaks = NULL, expand = c(0.02, 0)))
 
## Mexico and lower US for Figure 4
north_america <- getbox(both, xlim = c(-107.5, -80), ylim = c(11, 37.5))
map_north <- list(
  geom_polygon(aes(long, lat, group = group), data = north_america, fill = "grey80", 
    colour = "grey70", inherit.aes = FALSE, show_guide = FALSE),
  scale_x_continuous("", breaks = NULL, expand = c(0.02, 0)),
  scale_y_continuous("", breaks = NULL, expand = c(0.02, 0))) 
 
###############################################################
###                wikileaks Afghan War Diary               ###
###############################################################
 
# casualties data set loaded with ggsubplot and used as is in figure 2
# regional casualty data included as a supplemental file to paper
# how about casualties over time in different parts of the country?
load("casualties-by-region.RData")
 
###############################################################
###                       Figure 2                          ###
###############################################################
 
# Figure 2.a. raw Afghanistan casualty data
ggplot(casualties) + 
  map_afghan +
  geom_point(aes(lon, lat, color = victim), size = 1.75) +
  ggtitle("location of casualties by type") + 
  coord_map() +
  scale_colour_manual(values = rev(brewer.pal(5,"Blues"))[1:4])
ggsave("afgpoints.pdf", width = 7, height = 7)
 
 
 
# Figure 2.b. Afghanistan casualty heat map
ggplot(casualties) + 
  map_afghan +
  geom_bin2d(aes(lon, lat), bins = 15) +
  ggtitle("number of casualties by location") +
  scale_fill_continuous(guide = guide_legend()) +
  coord_map()
ggsave("afgtile.pdf", width = 7, height = 7)
 
 
 
# Figure 2.c. Afghanistan casualty embedded bar graphs (marginal distributions)
ggplot(casualties) + 
  map_afghan +
  geom_subplot2d(aes(lon, lat, 
    subplot = geom_bar(aes(victim, ..count.., fill = victim), 
      color = rev(brewer.pal(5,"Blues"))[1], size = 1/4)), bins = c(15,12), 
      ref = NULL, width = rel(0.8), height = rel(1)) + 
  ggtitle("casualty type by locationn(Marginal distribution)") + 
  coord_map() +
  scale_fill_manual(values = rev(brewer.pal(5,"Blues"))[c(1,4,2,3)]) 
ggsave("casualties.pdf", width = 7, height = 7)
 
 
 
# Figure 2.d. Afghanistan casualty embedded bar graphs (conditional distributions)
ggplot(casualties) + 
  map_afghan +
  geom_subplot2d(aes(lon, lat,
    subplot = geom_bar(aes(victim, ..count.., fill = victim), 
      color = rev(brewer.pal(5,"Blues"))[1], size = 1/4)), bins = c(15,12), 
      ref = ref_box(fill = NA, color = rev(brewer.pal(5,"Blues"))[1]), width = rel(0.7), height = rel(0.6), y_scale = free) + 
  ggtitle("casualty type by locationn(Conditional distribution)") +
  coord_map() +
  scale_fill_manual(values = rev(brewer.pal(5,"Blues"))[c(1,4,2,3)]) 
 
ggsave("casualties2.pdf", width = 7, height = 7)

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)