Color: The Cinderella of dataviz

[This article was first published on Dataspora » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

multivariate color strip plot Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question:

Why use color in data graphics?

If our data are simple, a single color is sufficient, even preferable. For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008. With just two dimensions of data to describe — the x and y location in the strike zone — black and white is sufficient. In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).

Fig 1. Location of Pitches (Villarreal, HOU, 2008)

Simple black and white scatter plot

But what if we’d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where? Or their speed?  Visualizations live in two dimensions, but the world they describe is rarely so confined.

The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas. (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).

Getting back to our pitching example, if we want to layer another dimension of data — pitch type — into our plot, we have several methods at our disposal:

  1. plotting symbols – vary the glyphs that we use (circles, triangles, etc.),
  2. small multiples – vary extra dimensions in space, creating a series of smaller plots
  3. color – we can color our data, encoding extra dimensions inside a color space

Which techniques you employ depend on the nature of the data and the media of your canvas. I will describe all three by way of example.

Multivariate Method I:  Vary Your Plotting Symbols

Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)

Scatterplot with varied plotting symbols.

In this plot, I’ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.

I consider this visualization an abject failure.  In fact, the prize for my most despised graphs in graduate school goes to bacterial growth curves rendered this way . The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call ‘pre-attentively processed‘ cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories. (Admittedly this can be improved with Chernoff faces or other iconic symbols, where the categorical mapping is self-evident).

Multivariate Method II:  Small Multiples on a Canvas

Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics. It has been employed everywhere from Galileo sunspot illustrations to William Cleveland’s trellis plots. And as Scott Mccloud’s unexpected tour de force on comics makes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.

In this plot below, the four types of pitches that Oscar throws are splintered horizontally.   By reducing our plot sizes, we’ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).

Fig 3:  Location and Pitch Type (Villarreal, HOU, 2008)

black and white strip plot

Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen. Both columns and rows can be used to lattice over additional dimensions, the result being a matrix of scatter plots (in R, see the ‘splom‘ function).

Multivariate Method III: Color Your Data

So why bother with color?

First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut. So color is a compensatory strength.

For multi-dimensional data, color can convey additional dimensions inside a unit of space — and can do so instantly. Color differences can be detected within 200 ms, before you’re even conscious of paying attention (the ‘pre-attentive’ concept I mentioned earlier).

But the most important reason to use color in multivariate graphics is that color is itself multidimensional. Our perceptual color space — however you slice it — is three-dimensioned.

In the example below, I’ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I’ve chosen is a divergent palette that moves along one dimension (think of it as the ‘redness-blueness’ dimension) in the CIELUV color space, while maintaining a constant level of luminosity.

Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)

isoluminant, diverging color ramp

color strip plot

Holding luminosity constant is important, because luminosity (similar to brightness) determines a color’s visual impact. Bright colors pop, and dark colors recede. A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.

I chose only seven gradations of color, so I’m downsampling (in a lossy way) our speed data – but further segmentation of our color ramp is not likely to be perceptible.

I’ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots. This is done to improve the perception of each pitch’s speed via its color: small patches of color are less perceptible. But a consequence of this choice — compounded by our choice to work with a series of smaller plots — is that more points overlap. We’ve further degraded some of our positional information. However, in our last step, we attempt to recover some of this.

Now I’ve finally brought color to bear on this visualization, but I’ve only encoded a single dimension — speed. Which leads to another question:

If color is three-dimensional, can I encode three dimensions with it?

In theory, yes. Colin Ware researched this exact question. In practice, it’s difficult. It turns out that asking observers to assess the amount of ‘redness’, ‘blueness’, and ‘greenness’ of points is possible, but not intuitive (I suspect it’s somewhat like parsing symbols).

Another complicating factor is that a nontrivial fraction of the population has some form of color blindness. This effectively reduces their color perception to two dimensions.

And finally, the truth is that our sensation of color is not equal along all dimensions; it’s thought the closely related ‘red’ and ‘green’ receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).

Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I  feel color is best used to encode no more than two dimensions of data.

So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator). This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.

Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)

two-dimensional color palette

multivariate color strip plot

Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.

One final point about using luminosity. Observing colors in a data visualization involves overloading, in the programming sense. We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).

Since we can overload color any way we want, whenever possible, we should choose mappings that are natural. Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth. Likewise, when sampling from the color space, we might as well choose colors found in nature. These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.

Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.

FutureMan Asks: What about Animation?

This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data. I’ve purposely neglected one very powerful tool:  motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization.  But packing  information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge.  Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like Processing and Prefuse are a promising start towards their development.

Methods

The final product of these five-dimensional pitch plots — for all available data for the 2008 season — can be explored via the PitchFX Django-driven web tool at Dataspora labs.

All of the visualizations here were developed using R and the Lattice graphics package.  (Of note, Hadley Wickham is developing ggplot2, a bold re-write of the R graphics system based on a grammar of graphics).

References for Further Reading

To leave a comment for the author, please follow the link and comment on their blog: Dataspora » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.