Coloring (and Drawing) Outside the Lines in ggplot

[This article was first published on rud.is » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Time for another Twitter-inspired blog post this week, this time from a tweet by @JonKalodimos:

I had seen and appreciated Ann’s post on her makeover of the main graphic in NPR’s story and did a quick mental check of how I’d do the same in ggplot2 as I was reading it. Jon’s question was a good prompt to dump physical memory to internet memory.

Here’s the NPR graphic:

When_Women_Stopped_Coding___Planet_Money___NPR

It is actually pretty darn good on it’s own, but I also agree with Ann that direct labeling could have made it better. Here’s her makeover:

Let’s see how to do this in ggplot2. We’ll use the actual data from NPR’s story since the graphic was built with D3 and, hence, the data is part of the graphic. Let’s get the library stuff out of the way:

library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(gridExtra)
library(grid)

Now, we’ll grab the CSV that the NPR folks used for the graphic and take a look at it. I found it via Developer Tools in Chrome:

# use the NPR story data file ---------------------------------------------
# and be kind to NPR's bandwidth budget
url <- "http://apps.npr.org/dailygraphics/graphics/women-cs/data.csv"
fil <- "gender.csv"
if (!file.exists(fil)) download.file(url, fil)
 
gender <- read.csv(fil, stringsAsFactors=FALSE)
 
# take a look at the CSV structure ----------------------------------------
 
glimpse(gender)
 
## Observations: 48
## Variables:
## $ date              (int) 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, ...
## $ Medical.School    (dbl) 0.09, 0.10, 0.10, 0.09, 0.09, 0.11, 0.14, 0.17, 0.20, 0.22, 0.24, 0.25, 0.25, 0.25, 0.28, 0.29, 0.31, ...
## $ Law.School        (chr) "0.04", "0.04", "0.05", "0.07", "0.07", "0.1", "0.12", "0.16", "0.2", "0.24", "0.27", "0.28", "0.3", "...
## $ Physical.Sciences (chr) "0.14", "0.14", "0.14", "0.14", "0.14", "0.15", "0.16", "0.16", "0.17", "0.19", "0.2", "0.2", "0.22", ...
## $ Computer.science  (dbl) 0.146, 0.108, 0.120, 0.130, 0.129, 0.136, 0.136, 0.149, 0.164, 0.190, 0.198, 0.239, 0.258, 0.281, 0.30...
 
tail(gender)
 
##    date Medical School Law School Physical Sciences Computer science
## 43 2008           0.48       0.47              0.41            0.177
## 44 2009           0.48       0.47              0.42            0.179
## 45 2010           0.48       0.47              0.41            0.182
## 46 2011           0.47         tk                tk            0.177
## 47 2012           0.47         tk                tk            0.182
## 48 2013           0.46         tk                              0.179

Those tk values are referred to in the code that makes the NPR graphic so we’ll replace them with NAs and make all the columns numeric:

gender <- mutate_each(gender, funs(as.numeric))

We should also clean up the column names since we’ll be using them for the legend and the direct labels:

colnames(gender) <- str_replace(colnames(gender), "\.", " ")
 
gender_long <- mutate(gather(gender, area, value, -date),
                      area=factor(area, levels=colnames(gender)[2:5],
                                  ordered=TRUE))</
 
That that code link also has the colors NPR used for the graphic, so let's define those now since we bothered to look at it:
 
<pre lang="rsplus">gender_colors <- c('#11605E', '#17807E', '#8BC0BF','#D8472B')
names(gender_colors) <- colnames(gender)[2:5]

We’ll be needing those names later on, hence why I named the values in the vector.

With the data, labels and colors defined, we can make a “standard” ggplot:

chart_title <- expression(atop("What Happened To Women In Computer Science?",
                               atop(italic("% Of Women Majors, By Field"))))
 
gg <- ggplot(gender_long)
gg <- gg + geom_line(aes(x=date, y=value, group=area, color=area))
gg <- gg + scale_color_manual(name="", values=gender_colors)
gg <- gg + scale_y_continuous(label=percent)
gg <- gg + labs(x=NULL, y=NULL, title=chart_title)
gg <- gg + theme_bw(base_family="Helvetica")
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.key=element_blank())
gg

Rplot01

That’s also pretty good on it’s own. It’s possible to make it look a bit more like the NPR chart, but it’s hard to format a title & subtitle in a ggplot title and have it left-justified, so I opted for font style. It’s also possible to make the legend look like NPR’s but that’s not the point of the post.

So, how do we make this look more like Ann’s makeover?

First we need to get the last values for each of the variables so we know what point on the y axis we need to place the labels. That’s made a bit trickier with the NAs:

last_vals <- sapply(colnames(gender)[2:5], function(x) last(na.exclude(gender[,x])))
last_date <- tail(gender$date)+1 # doing this ^ wld have made it a double

Next, we need to turn off the legend and increase the plot margin on the right-hand side:

gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin = unit(c(1, 7, 2, 1), "lines"))

I figured out those #’s by interactive trial-and-error, though I initially guessed 6 for the right-hand margin increase. Also, this should demonstrate one reason for the gg <- gg + madness you see in my code/posts since, when you start doing more in ggplot, you end up with that idiom more oft than not.

Now, we add the labels. We do it with with custom annotations that are placed “one year” after the latest x value and at the same y value as the last reading of each area. We also color the label the same as the line, which is why we needed a named vector.

for (i in 1:length(last_vals)) {
  gg <- gg + annotation_custom(grob=textGrob(names(last_vals)[i], hjust=0,
                                             gp=gpar(fontsize=8, 
                                                     col=gender_colors[names(last_vals)[i]])),
                               xmin=2014, xmax=2014,
                               ymin=last_vals[i], ymax=last_vals[i])
}

Finally, we have to do some of the remaining work by hand since we have to turn off panel clipping and the only way I know how to do that is at the grob/gtable level, but it’s not that scary or complex of a task. Also, since we are manipulating the built ggplot object, we have to use grid.draw to present our chart:

gb <- ggplot_build(gg)
gt <- ggplot_gtable(gb)
 
gt$layout$clip[gt$layout$name=="panel"] <- "off"
 
grid.draw(gt)

Here’s the result:

Rplot02

I’ve deliberately left the fonts a bit small and not-changed their positions on the y-axis to give readers a bit of homework. They both should be changed and the plot margins could also be tweaked a tad. You can find the complete code on github so tweak away!

If you have another way to accomplish the same task or want to show off your tweaked version, drop a note in the comments or at that gist link.

To leave a comment for the author, please follow the link and comment on their blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)