Today's guest post comes to us from Andrew Winterman, Data Designer at data visualization company Persiscopic. He shares with us the process of using the R language and other tools to create an interactive data application for a client — ed.
The Hewlett Foundation contacted us a few months ago because they were interested in exploring ways to visualize the distribution and impact of their grantmaking efforts over the last ten years. They hoped to make a tool with three functions: It would provide insight into where the Foundation has made the largest impact; provide grant seekers context for their applications; and help the Foundation’s officers make decisions about new grantmaking efforts, based on their existing portfolio. They had one request: No maps.
The data arrived, as it so often does, in the rough: An Excel document compiled quickly, by hand, with the primary goal of providing an overview, rather than complete accuracy. At this point in the process, we paint with broad brushes. We learn the data’s characteristics, determine which facets are interesting, and prototype visualization ideas.
At the beginning of a project, I always explore a few simple visualization techniques to get a feel for the data. For example, simple bar charts as shown in Figure 1, scatter plots, and choropleths, are great ways to get a visual sense of what the data is saying.
My main tools for this process are d3.js, R — ggplot2 in particular — and Tableau. For this project I used ggplot2 (version 0.9 came out halfway through) and the CRAN package 'beanplot'.
Once we have a feel for the data, we start brainstorming, and trying out ideas. For example, an early idea led us to explore using concentric circles to represent the tree of geographic categories (Hemisphere, Continent, Country, Region, County, City), and then filling an arc of a circle with a scatter plot to show individual grants. You can see this idea sketched, with mostly fake data in Figure 2. We ultimately decided the technique didn’t use space effectively enough for what we needed to convey.
Our next idea was to use modified beanplots [Figure 3] to succinctly describe the distribution of various quantities at the same time. These were made with the beanplot package available on CRAN. We denormalized them — meaning we hacked the beanplot function to make the total area of the beanplot proportional to volume. With traditional beanplots, the total area of each bean is always the same, since they represent probability distributions rather than counts. This is counter-intuitive if the viewer is unfamiliar with statistics. We actually went as far as developing a working tool using these modified bean plots.
The width of the bean at a given dollar amount shows the probability the next dollar falls at the given amount. After extensive user testing, this proved too high a cognitive hurdle for the casual viewer. Users liked the visual presentation, but were confused as to their meaning, even with a detailed page showing how to interpret the beanplot.
We decided to consider alternatives to the beanplot that still accomplish the same goals. We also wanted a very simple technique that could be explained in a phrase. After a few iterations, we agreed that interactive heatmaps [Figure 4] would be a good solution. You will be able to see them in action at Persiscopic.com when the final product launches at later this year.
R provides an ideal toolkit to explore methods to visualize data distributions. Between specialized packages and comprehensive toolkits like ggplot2, a wide range of techniques are available to the analyst. In particular, the transparent structure of most R functions make them easy to pull apart and put back together again, lending great flexibility to the patient programmer.
Andrew Winterman does Data Design for Periscopic. An inquisitive humanist, he is motivated by the promise of making ours a more rational society. He applies his skills to the problem of converting data into information, a process requiring scripting and research into the relevant fields of study. He holds a B.A. in Mathematics from Reed College, and patiently pursues a Masters of Science in Biostatistics at the Oregon Health and Sciences University. He greatly enjoys his daily bicycle commute, Portland’s artisanal culture, searing vegetables in cast iron, and thinking about epidemiology.