[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I don’t “do politics” at this blog, but I’m always happy to do charts. Here’s one that’s been doing the rounds on Twitter recently:

What’s the first thing that comes into your mind on seeing that chart?

It seems that there are two main responses to the chart:

1. Wow, what happened to all those Democrat voters between 2008 and 2016?
2. Wow, that’s misleading, it makes it look like Democrat support almost halved between 2008 and 2016

The question then is: when (if ever) is it acceptable to start a y-axis at a non-zero value?

1. What would ggplot2 say?

Let’s get some data into ggplot2 and find out. There’s lots of publicly-available election data; I’m using Wikipedia pages such as this one for the 2008 US election.

library(tidyr)
library(ggplot2)
library(scales)

# popular vote, electoral college votes and turnout 1980-2016
elections <- data.frame(year = c(1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016),
Rep.pop = c(43903230, 54455472, 48886097, 39104550, 39197469, 50456002, 62040610, 59948323, 60933504, 61201031),
Dem.pop = c(35480115, 37577352, 41809074, 44909806, 47401185, 50999897, 59028444, 69498516, 65915794, 62523126),
Rep.ec = c(489, 525, 426, 168, 159, 271, 286, 173, 206, 306),
Dem.ec = c(49, 13, 111, 370, 379, 266, 251, 365, 332, 232),
turnout = c(52.6, 53.3, 50.2, 55.2, 49.0, 51.2, 56.7, 58.2, 54.9, 53.7))


Let’s tidy that up a little so as the data are in “long format” with one variable per column, one value per variable:

elections.1 <- gather(elections, key, value, -year)
elections.2 <- separate(elections.1, key, into = c("variable", "vote"), sep = "\\.")


Now we try to replicate the chart seen on Twitter:

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
+ geom_bar(aes(fill = variable), stat = "identity", position = "dodge")
+ scale_x_continuous(breaks = seq(1980, 2016, 4))
+ scale_fill_manual(values = c("blue", "red"))
+ scale_y_continuous(labels = comma) + theme_bw()
+ labs(title = "US Election Popular vote 2008 - 2016")


ggplot2 by default starts the y-axis at zero when the chart is a bar chart. To replicate the Twitter chart, we add some extra options to scale_y_continuous.

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
+ geom_bar(aes(fill = variable), stat = "identity", position = "dodge")
+ scale_x_continuous(breaks = seq(1980, 2016, 4))
+ scale_fill_manual(values = c("blue", "red"))
+ scale_y_continuous(labels = comma, limits = c(52000000, 72000000), oob = rescale_none)
+ theme_bw() + labs(title = "US Election Popular vote 2008 - 2016")


There seem to be two reactions to this chart. One is that it’s effective in showing the decline in the Democrat popular vote since 2008, whilst the Republican vote has stayed relatively stable. The other is that by truncating the y-axis, the chart misleads people into thinking that the Democrat vote in 2016 is around 60% that of 2008. To be honest, I can see both points of view. Personally, my eye is drawn to the absolute values on the y-axis, but perhaps that is just me (and others like me).

This article tells us that “it’s OK not to start your y-axis at zero”, but then states that “column and bar charts should always have zeroed axes”. They use a chart from the Twitter IPO as an example.

If you were waiting for the obligatory bad-mouthing of Excel, look no further than a follow-up Tweet by the chart author.

Onwards. What if we use a line chart instead?

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
+ geom_line(aes(color = variable)) + geom_point()
+ scale_x_continuous(breaks = seq(1980, 2016, 4))
+ scale_color_manual(values = c("blue", "red"))
+ scale_y_continuous(labels = comma)
+ theme_bw() + labs(title = "US Election Popular vote 2008 - 2016")


Now ggplot2 thinks that it’s fine to use a non-zero y-axis. The eye no longer compares absolute heights.

How does the line chart look if we force the y-axis back to starting at zero?

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
+ geom_line(aes(color = variable)) + geom_point()
+ scale_x_continuous(breaks = seq(1980, 2016, 4))
+ scale_color_manual(values = c("blue", "red"))
+ scale_y_continuous(labels = comma, limits = c(0, 72000000))
+ theme_bw() + labs(title = "US Election Popular vote 2008 - 2016")


I think the blue decline is still apparent. The main issue with this one for me is not any attempt to mislead, just a lot of wasted white space.

2. What would Tufte say?

A common response is to ask what Tufte would say. You can read what he says here. In that particular quote he says the y-axis should reflect the range of the data and had nothing specific to say regarding bar charts. His last sentence is telling:

ggplot(subset(elections.2, vote == "pop"), aes(year, value))
+ geom_line(aes(color = variable)) + geom_point()
+ scale_x_continuous(breaks = seq(1980, 2016, 4))
+ scale_color_manual(values = c("blue", "red"))
+ scale_y_continuous(labels = comma) + theme_bw()
+ labs(title = "US Election Popular vote 1980 - 2016")


And indeed, it is interesting to add more elections starting from the year 1980.

3. In summary

It is interesting to see people react differently to the same chart. A cynic might say “often in a manner that reflects their beliefs.” However, the current collective wisdom seems to be:

• it’s OK to start your y-axis at a non-zero value
• unless it’s a bar/column chart
• listen to Tufte

Filed under: R, statistics Tagged: charts, graph, politics, visualisation  