On Whether Y-axis Labels Are Always Necessary

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The infamous @albertocairo blogged about a nice interactive piece on German company tax avoidance by @ProPublica. Here’s a snapshot of their interactive chart:

Dr. Cairo (his PhD is in the bag as far as I’m concerned 🙂 posited:

Isn’t it weird that the chart doesn’t have a scale on the Y-axis? It’s not the first time I see this, and it makes me feel uneasy.

I jumped over to the interactive piece to see if the authors used interactive tooltips since viewers can get a good idea for the scale limits if they do that and it kinda sorta makes not having Y-axis label mostly OK if they compensate with said interactive notations. The interactive had no tooltips and the Y-axis was completely unlabeled.

Now, they used D3, so there are built-in ways to create and add a Y-axis, so I don’t think this was an “oops…we forgot” moment. The Y values are “Short Interest Quantity” which is the quantity of stock shares that investors have sold short but not yet covered or closed out. It’s definitely a “1%-er” term and the authors already took time to explain some technical financial details and probably would have had to add even more text to explain this term properly (since that short definition is really not enough for most of us 99%-ers). It seems that they felt the the arrowed-annotations on the right hand side of the plot made up for the lack of actual Y-axis detail.

Should we always have labels on a given axis? Would knowing that the Y-axis on this chart went from 0 to 800 million have aided in the decoding or groking the overall message? Here’s another example to help frame that question. This is the seminal ggplot2::geom_density() demo chart:

RStudio 2

Given that folks outside the realm of statistics/datasci really don’t grok what that Y-axis is saying, would it be horribad to just leave it with a “density” Y-label (sans unit marks) and then explain it in text (or talk to/around it in text but not go into detail)? Or should we keep the full annotations and spend a precious paragraph of text talking about measuring the area under a curve? (Another argument is to choose the right vis for the right audience but that’s another post entirely).

To further illustrate the posit, I recently made a series of what I call a “rank ordered segment plot” for a report that we did at @Rapid7:

ssh_rank_chart-1

There are text annotations for countries at either end of the spectrum on the X-axis but they aren’t individually labeled cuz…ewwww that’d be messy. The interactive version (coming this week over at community.rapid7.com) has the full table and light hover popup-annotations. But the point wasn’t to really focus on the countries as it was to depict the sad state of the ratio of unencrypted vs encrypted for a given service type within a country.

So, should the ProPublica authors have tried to be more discrete w/r/t their Y-axis or is it fine the way it is? Does there always need to be discrete axes annotations or is there some wiggle room? Opines are welcome in the comments since I honestly don’t think there is “one answer to rule them all” for this.

And for those that really want to see more discrete info on the ProPublica Y-axis labels, here’s a static, faceted chart (you may need to click/select/tap the chart to make it big enough to view):

Plot_Zoom

Don’tTry This At Home!

ProPublica made that data available via two CSV files and the crosswalk org translation table via their main D3 javascript file (use Developer Tools “Inspect Element” to see such things). I ended up having to use Sys.setlocale('LC_ALL','C') and expand the translation table a bit due to some of the mixed encodings in the data sets. Code to make the chart is below.

library(ggplot2)
library(dplyr)
library(stringi)
library(hrbrmisc)
library(scales)
library(ggalt)
library(sitools)

# mixed encodings ftw!
Sys.setlocale('LC_ALL','C') 

# different names in different data sets; sigh
org_crosswalk <- read.table(text='company,trans
"Adidas AG","Adidas AG"
"Allianz SE","Allianz SE"
"BASF SE","BASF SE"
"Bayer AG","Bayer AG"
"Bayerische Motoren Werke AG","BMW AG"
"BMW AG","BMW AG",
"Beiersdorf AG","Beiersdorf AG"
"Commerzbank AG","Commerzbank AG"
"Continental AG","Continental AG"
"Daimler AG","Daimler AG"
"Deutsche Bank AG","Deutsche Bank AG"
"Deutsche Boerse AG","Deutsche Boerse AG"
"Deutsche Lufthansa AG","Deutsche Lufthansa AG"
"Deutsche Post AG","Deutsche Post AG"
"Deutsche Telekom AG","Deutsche Telekom AG"
"E.ON","E.ON"
"Fresenius Medical Care AG & Co. KGaA","Fresenius Medical Care AG"
"Fresenius Medical Care AG","Fresenius Medical Care AG"
"Fresenius SE & Co KGaA","Fresenius SE & Co KGaA"
"HeidelbergCement AG","HeidelbergCement AG"
"Henkel AG & Co. KGaA","Henkel AG & Co. KGaA"
"Infineon Technologies AG","Infineon Technologies AG"
"K+S AG","K+S AG"
"Lanxess AG","Lanxess AG"
"Linde AG","Linde AG"
"Merck KGaA","Merck KGaA"
"MŸnchener RŸckversicherungs-Gesellschaft AG","Munich RE AG"
"M�nchener R�ckversicherungs-Gesellschaft AG","Munich RE AG"
"M\x9fnchener R\x9fckversicherungs-Gesellschaft AG","Munich RE AG"
"M?nchener R?ckversicherungs-Gesellschaft AG","Munich RE AG"
"Munich RE AG","Munich RE AG"
"RWE AG","RWE AG"
"SAP SE","SAP SE"
"Siemens AG","Siemens AG"
"ThyssenKrupp AG","ThyssenKrupp AG"
"Volkswagen AG","Volkswagen AG"', stringsAsFactors=FALSE, sep=",", quote='"', header=TRUE)

# quicker/less verbose than left_join()
org_trans <- setNames(org_crosswalk$trans, org_crosswalk$company)

# get and clean both data sets, being kind to the propublica bandwidth $
rec_url <- "https://projects.propublica.org/graphics/javascripts/dividend/record_dates.csv"
rec_fil <- basename(rec_url)
if (!file.exists(rec_fil)) download.file(rec_url, rec_fil)

records <- read.csv(rec_fil, stringsAsFactors=FALSE)
records %>%
  select(company=1, year=2, record_date=3) %>%
  mutate(record_date=as.Date(stri_replace_all_regex(record_date,
                                                    "([[:digit:]]+)/([[:digit:]]+)+/([[:digit:]]+)$",
                                                    "20$3-$1-$2"))) %>%
  mutate(company=ifelse(grepl("Gesellschaft", company), "Munich RE AG", company)) %>% 
  mutate(company=org_trans[company]) -> records

div_url <- "https://projects.propublica.org/graphics/javascripts/dividend/dividend.csv"
div_fil <- basename(div_url)
if (!file.exists(div_fil)) download.file(div_url, div_fil)

dividends <- read.csv(div_fil, stringsAsFactors=FALSE)

dividends %>%
  select(company=1, pricing_date=2, short_int_qty=3) %>%
  mutate(pricing_date=as.Date(stri_replace_all_regex(pricing_date,
                                                     "([[:digit:]]+)/([[:digit:]]+)+/([[:digit:]]+)$",
                                                     "20$3-$1-$2"))) %>%
  mutate(company=ifelse(grepl("Gesellschaft", company), "Munich RE AG", company)) %>% 
  mutate(company=org_trans[company]) -> dividends

# sitools::f2si() doesn't work so well for this for some reason, so mk a small helper function
m_fmt <- function (x) { sprintf("%d M", as.integer(x/1000000)) }

# gotta wrap'em all
subt <- wrap_format(160)("German companies typically pay shareholders one big dividend a year. With the help of U.S. banks, international investors briefly lend their shares to German funds that don’t have to pay a dividend tax. The avoided tax – usually 15 percent of the dividend – is split by the investors and other participants in the deal. These transactions cost the German treasury about $1 billion a year. [Y-axis == short interest quantity]")

gg <- ggplot()

# draw the markers for the dividends
gg <- gg + geom_vline(data=records,
                      aes(xintercept=as.numeric(record_date)),
                      color="#b2182b", size=0.25, linetype="dotted")

# draw the time series
gg <- gg + geom_line(data=dividends,
                     aes(pricing_date, short_int_qty, group=company),
                     size=0.15)

gg <- gg + scale_x_date(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), labels=m_fmt,
                              limits=c(0,800000000))

gg <- gg + facet_wrap(~company, scales="free_x")

gg <- gg + labs(x="Red, dotted line == Dividend date", y=NULL,
                title="Tax Avoidance Has a Heartbeat",
                subtitle=subt,
                caption="Source: https://projects.propublica.org/graphics/dividend")

# devtools::install_github("hrbrmstr/hrbrmisc") or roll your own
gg <- gg + theme_hrbrmstr_an(grid="XY", axis="", strip_text_size=8.5,
                             subtitle_size=10)
gg <- gg + theme(axis.text=element_text(size=6))
gg <- gg + theme(panel.grid.major=element_line(size=0.05))
gg <- gg + theme(panel.background=element_rect(fill="#e2e2e233",
                                               color="#e2e2e233"))
gg <- gg + theme(panel.margin=margin(10,10,20,10))
gg <- gg + theme(plot.margin=margin(20,20,20,20))
gg <- gg + theme(axis.title.x=element_text(color="#b2182bee", size=9, hjust=1))
gg <- gg + theme(plot.caption=element_text(margin=margin(t=5)))
gg

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)