Over the years of my graduate studies I made a lot of plots. I mean tonnes. To get an extremely conservative estimate I grep’ed for every instance of “plot(” in all of the many R scripts I wrote over the past five years.
find . -iname "*.R" -print0 | xargs -L1 -0 egrep -r "plot(" | wc -l 2922
The actual number is very likely orders of magnitude larger as 1) many of these plot statements are in loops, 2) it doesn’t capture how many times I may have ran a given script, 3) it doesn’t look at previous versions, 4) plot is not the only command to generate figures in R (eg hist), and 5) early in my graduate career I mainly used gnuplot and near the end I was using more and more matplotlib. But even at this lower bound, that’s nearly 3,000 plots. A quick look at the TOC of my thesis reveals a grand total of 33 figures. Were all the rest a waste? (Hint: No.)
The overwhelming majority of the plots that I created served a very different function than these final, publication-ready figures. Generally, visualizations are either:
- A) Communication between you and data, or
- B) Communication between you and someone else, though data.
These two modes serve very different purposes and can require taking different approaches in their creation. Visualizations in the first mode need only be quick and dirty. You can often forget about all that nice axis labeling, optimal color contrast, and whiz-bang interactivity. As per my estimates above, this made up at the very least 10:1 of visuals created. The important thing is that, in this mode, you already have all of the context. You know what the variables are, you know what the colors, shapes, sizes, and layouts mean – after all, you just coded it. The beauty of this is that you can iterate on these plots very quickly. The conversation between you and the data can dialog back and forth as you intrepidly explore and shine your light into all of it’s dark little corners.
In the second mode, you are telling a story to someone else. Much more thought and care needs to be placed on ensuring that the whole story is being told with the visualization. It is all too easy to produce something that makes sense to you, but is completely unintelligible to your intended audience. I’ve learned the hard way that this kind of visual should always be test-driven by someone who, ideally, is a member of your intended audience. When you are as steeped in the data as you most likely are, your mind will fill in any missing pieces of the story – something your audience won’t do.
In my new role as part of the Data Science team at Penn Medicine, I’ll be making more and more data visualizations in the second mode. A little less talking to myself with data, and a little more communicating with others through data. I’ll be sharing some of my experiences, tools, wins, and disasters here. Stay tuned!