Today's guest post comes from Garrett Grolemund, a software developer at RStudio — ed.
I think of graphs as a type of visual summary for data. Yet I rarely see graphs used this way within visualizations. Consider tile plots. They group data into 2d bins and then summarize each group with a number. This approach is a go-to tool for understanding overplotted data, but it discards a lot of information. Since we’re already using graphs, why not summarize the data in each bin visually? In the same space that we devote to a single colored tile, we can draw a subplot that retains enough information to display interesting patterns. Take, for example, this visualization of the WikiLeaks Afghanistan War Diary. It replaces each tile with a bar graph that shows the number of casualties by type for the specified region. We still get a sense of where the highest frequencies of casualties occur, but we can also see trends. For example, civilian casualties outnumber combatant casualties in the capital city of Kabul.
I refer to this type of chart as an embedded plot, because it embeds subplots into a larger plot. Embedded subplots have been around for a long time. Charles Minard was embedding pie graphs into maps of France as early as 1862. Glyph plots, facetted plots, and other exotic graphs also rely on principles of embedded plots. However, embedded subplots seem to be under-utilized when we consider how useful they can be.
First, embedded plots can make patterns clear in the presence of overplotting. The diamonds data set in ggplot2 contains 53,000 observations. If we try to explore the data with a scatterplot, points occlude each other and hide patterns. Binning with embedded plots makes patterns visible, and this would be true even if the data contained 100,000, a million or even a trillion points.
Embedded plots are also useful for displaying spatio-temporal data, as in this illustration of daily temperatures in the western hemisphere from 1995-201. Because embedded plots provide additional axes we can plot longitude (x), latitude (y), and time (theta) and still have graphical power left over for variables of interest. Here daily temperature is mapped to r and the mean temperature for each region is mapped to the fill color.
Embedded plots can also show multidimensional relationships and interaction effects. The same subplots created above can be reorganized on new axes to show the relationship of seasonality to maximum and minimum temperatures. Surprisingly, the hottest places in the western hemisphere are not those near the equator.
Embedded plots may be under-used because they are difficult to make. Some programs like Gaugain, or the lattice and ggplot2 packages in R can make one or two specific types of embedded plots, but this doesn’t leave much flexibility when exploring a complicated data set. Things do not have to be this way. Embedded plots fit into the grammar of graphics quite nicely if we recognize that geoms are (very simple) subplots and subplots are (somewhat sophisticated) geoms. This realization creates some tantalizing insights about graphics. For example, graphs are hierarchical, or recursive. Also, facets are a type of geom (subplots) plotted against two categorical variables. The ggsubplot package extends ggplot2 to allow subplots to be used as a geom. Each of the graphs above was made with ggsubplot and the ggsubplot syntax closely follows that of ggplot2. For example, the graph of Afghanistan above is made with the following code (plus regular ggplot2 methods for tweaking color palettes and appearance)
ggplot(casualties, aes(lon, lat)) + map_afghanistan + geom_subplot2d(aes(subplot = geom_bar(aes(victim, ..count.., fill = victim))), bins = c(15,12), ref = NULL, width = rel(0.8), height = rel(1))
Now for a word of caution: I would not recommend using embedded plots when a simpler graph would suffice. They can be hard to interpret. But when embedded plots are necessary, use them with confidence. They do not violate any data to ink ratio; embedded plots increase data in proportion to ink. And they organize multiple levels of information in an admirably intuitive way. Embedded plots take a little longer to comprehend than simpler graphs, but they also contain more data to be comprehended. Once a viewer has processed all of the relevant information, embedded plots display patterns with the same "interocular impact" that Tukey prized in simpler graphs. In return for a little patience, embedded plots make it easy to see relationships that would be difficult or impossible to perceive otherwise.
Garrett Grolemund has recently left academia to develop software and course content for RStudio. With his dissertation adviser Hadley Wickham, he has worked to refine and promote R, an open-source computer language used for statistical computing and graphics. Grolemund’s research focuses on data analysis, statistical computation, statistics education and visualization. With Wickham, he co-authored the lubridate R package which provides methods to parse, manipulate, and do arithmetic with date-times. Grolemund earned a B.A. in psychology and a master’s degree in statistics, both in 2003, from Harvard University and a PhD in statistics from Rice University in 2012. He spent a year as a teaching fellow at Harvard University, another year as a clinical trials coordinator at Massachusetts General Hospital and, before coming to Rice, a year as a researcher at the UCLA School of Law Library. At Rice he has taught such classes as Statistics 405: “Introduction to Data Analysis,” and “Visualization in R with ggplot2.”