Tonight the NYC R Meetup will be discussing data visualization in R using ggplot2. As part of tonight’s meeting I will be providing a very brief show and tell, which includes mostly code examples and external resources. This exercise has had me thinking quite a bit about data visualization. In addition, a few days ago the Security Crank (great new blog) pinged me on the apparent uselessness of network analysis visualizations in the defense and intelligence communities. As I say in my comment at SC, I agree; however, only in that the method is abused by those that view it as only a means to generate “pretty pictures.” All of this has touched off a very important point about data analysis; possibly the most important, which is how best to convey an analysis visually.
Consumers of data analytics are very rarely analysts themselves, so those in the business of generating plots, figures, chats, graphs, etc. most not only be expert in the analytical process, but also in choosing the best format and medium for relaying that knowledge to an audience. Admittedly, I am not Edward Tufte, Ben Fry, or David McCandless, but I have been around long enough to know what does and does not work, and as such here (in no particular order) are my five rules for data visualization.
- The viz must be able to stand alone
This I learned early, after being dressed down multiple times while giving briefings to senior intelligence officers. Since then it has been reinforced while sitting in on failed job talks and conference presentations. The important thing to keep n mind is that when an audience sees a visualization it should be providing answers, not generating more questions.
This, to me, is the most difficult aspect of creating high quality data visualizations. As the creators we are often intimately familiar with the data, and thus take its subtleties for granted. Some people recommend asking yourself “would my Grandmother understand this,” but why insult Grandma’s intelligence? Here’s the bottom line: you have to decide the most efficient means of plotting the data (we’ll get to this), then you have a chart title, legend, possibly some axis labels, and if you are bold a short (140 characters is a good limit) footnote to get your point across. The best visualizations only require a subset of these to be effective, but once you have added the appropriate data accoutrements the chart better be self-explanatory. Very simple and imperfect example: restaurant tipping trends between men and women.
Why is the chart on the right better? First, it has more explanatory value. By splitting the data into two parts we are able to see the x-axis shift for men, i.e., in general they are tipping on higher bills. Also, we are able to use color in a more valuable way; rather than using it to distinguish between sex we can use it to highlight outliers and note general trends. Next, by reducing the amount of data in each plot the information is conveyed more efficiently. Finally, it achieves our ultimate goal, which is always to provide more answers than questions.
- Have a diverse tool set
Learning the quirks and syntax of various data visualizations tools is time consuming and often frustrating, but if you want to create impressive charts you have to do it. I am very sorry to report, but Microsoft Excel + PowerPoint do not generate the best data visualizations. In fact, they often generate visualizations in the 10-20th percentile of quality. The question; therefore is: how do you find the best tools for your task?
Most of us will not have the resources to use professional data visualizations suites, but even so these tools are often limited by the scope and vision of their creators. Explore the open-source and general purpose data visualization options out there, learn the three best that fit your needs, and always be open to learning the new stuff—it will pay off.
- People are terrible at distinguishing small differences
This could also be described as the “pie chart trap,” but clearly goes beyond that particular chart design. In fact, network visualizations are notorious for blurring subtle differences. For example, visualizations of massive amounts of social network data can be beautiful, but in nearly all cases they are much more art than science. If we are interested in telling a story with our data, and our data is large and complex, then we need to be creative about how to parse that complexity in order to enhance the clarity of our story. Example using networks: the structure of venture capital co-investments
The visualizations above examine the same data, and even use a similar technique to visualize it, but clearly the example on the right is conveying a more informative story. Admittedly, this visualization, which I generated, in many ways violates my first rule; however, it is still telling a story (e.g., there is a strong underlying structure among four notable communities of VC firms). The visualization on the left, taken from an initial attempt at analyzing this data, tells almost no story; save that the network is highly complex and there exist some disconnected firms.
- Color selection matters
This would seem to be a self-evident point, but it may be the most often violated rule of quality visualization. It seems the primary reason for this problem is laziness, as the default color schemes in many visualization packages were not designed to convey information (again, see the left panel of the figure above). I recently violated this rule while putting together the slides for tonight’s R meetup. Using a single line of R code I generated this chart:
data(whiskey,package="flexmix") library(ggplot2) ggplot(subset(whiskey_brands,Brand!="Other brands") ,aes(x=Type, fill=Brand))+geom_bar(position="fill")
In my defense, I was first excited that there was a built-in Scotch whiskey dataset in R, but I also wanted to show what could be done with a single line of code. Clearly, however, the color scheme I used is taking away from the story. The default color scheme in ggplot2 wants to use a gradient, which may be useful in some cases, but not here. To improve the above example I should override this default and construct a more informative color scheme; such as setting a base color for each Scotch type (e.g., blue for blends and green for single malts).
- Reduce, reuse, recycle
When developing statistical models we are often striving to specify the most “parsimonious” model, that is, the model that has the highest explanatory value-to-required variables ratio. We do this to reduce waste in our models, enhance our degrees of freedom, and provide a model that is most relevant to the data. The exact same rules apply to visualizations. Not all observations are created equally; therefore, they may not all belong in a visualization. Those who are analyzing large datasets take data reduction (or “munging”) as given, but in any visualization if something is not adding any value take it out. Developing new and meaningful methods for reducing data is a serious challenge, but one that should be considered before any attempt at visualization is done
On the other hand, if a reduction and/or visualization method has be successful in the past then it will likely b e successful in the future, so do not be afraid to reuse and recycle. Many of the most successful data visualizers have distinguished themselves by creating a method for visualization and sticking with it (think Gapminder). Not only will it possibly make you famous, but putting in the effort to create a useful method for combining, reducing and visualizing data will mean your efforts are more streamlined in the long term.
So that’s it. Nothing too profound there, but I wanted to post this in order to start a conversation. In that vein, what did I miss and where do you disagree? As always, I welcome your comments.