Dual axes time series plots with various more awkward data

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my most recent blog post I introduced the dualplot() R function, which allows you to create time series plots with two different scales on the vertical axes in a way that minimises the potential problems of misinterpretation. See that earlier post for a discussion of the pros and cons of the whole approach, which I won’t repeat here.

I’ve now made some minor enhancements:

  • improved handling of series that start at different times
  • fixed some minor problems with series that have different frequencies (eg monthly versus quarterly)
  • added a fairly sensible default choice of axes for when one or both of the series dip into negative territory
  • improved default legend and axis titles for situations when the data are columns in a data frame
  • improved x axis labels for when x is a an object with date or time characteristics

The main idea behind my first version was that the two series would be rendered on the screen as though they had first been converted to indexes (like the Consumer Price Index), and then have the original scales restored so the original absolute values can still be seen. The choice to be made then becomes which point to choose as the point to make the series use as their reference (eg the period at which they are defined as 1, 100 or 1000, if we were to leave them as indices).

When time series start at different times

Default to use first available “cross over point”

For the (common) situation that the two series you wish to plot start at different times, I decided it rarely made sense to have both series start at the same vertical position on the page at their respective earliest point. So I’ve set the default to make the two series appear as though their reference period is the first time period when both series have data. In effect, this makes the later series appear to grow out of the first series, and you can see its relative growth from that period onwards.

This is illustrated nicely in the first plot below, which shows how the share price of Fonterra Cooperative Group has performed performed relative to BHP Billiton. For non-Australasian readers, Fonterra (who’s historical origin is as a dairy farming cooperative) is one of the most important economic enterprises in New Zealand; and BHP is an Anglo-Australian multinational mining giant.

starts2

As well as seeing the relative growth since data on both series has become available, we can still see the approximate absolute values, eg just under $6 for a Fonterra Cooperative Group share.

Readers who have been paying attention will notice that in drawing this plot I appear to have broken one of my own rules on when to use a dual axes graphic, which was to not use two axes when the unit of both series is fundamentally the same. In this case, both series are showing dollars per share. I had a hard think about this and think that this is (marginally) a case where it is acceptable to use two different axes. While the units are the same (dollars per share), the magnitudes are sufficiently different that it is worthwhile, if we wish to compare growth paths (typically the issue of most interest with shares), to examine them on different scales.

Here’s the code to load that data from Yahoo Finance and draw the chart with my dualplot() function:

# Load up functionality
library(dplyr)
library(tidyr)
library(quantmod) # for getSymbols for share price data
library(ggseas)   # for nzbop data

# The dualplot() function:
source("https://gist.githubusercontent.com/ellisp/4002241def4e2b360189e58c3f461b4a/raw/9ab547bff18f73e783aaf30a7e4851c9a2f95b80/dualplot.R")     

#=================Different starting points==================
# Load some example data from the stock exchange
# Fonterra data only available from Yahoo for the past few years
fonterra <- getSymbols('FCG.NZ', src='yahoo', auto.assign = FALSE) 
bhp <- getSymbols('BHP.AX', src='yahoo', auto.assign = FALSE) 

# default - a cross over point happen at the earliest point of the shorter series
dualplot(x1 = time(bhp), y1 = bhp$BHP.AX.Close,
         x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close, 
         ylab1 = "BHP", ylab2 = "Fonterra", 
         legx = "topright", main = "Price per share of two key firms")

Note that this is a common case where both series would have been converted to an index somehow in many graphics. The absolute value of share prices is not meaningful to most people so there is little lost in converting them to indices. However, they do mean something for people who are actively concerned with them, so there is value in my dual axis approach which retains the absolute value of the two series while preserving the non-misleading characteristics of the index-based approach.

Starting at the same vertical position

Sometime it might make sense, depending on the data and the purpose of the graphic, to start the later series at the same vertical height as the first series. This is equivalent visually to plots of indexes where the reference point for each series is the first time period of the series. The tweaked version of dualplot() lets the user specify this (or more complex) combinations of reference points.

starts3

# or can override eg - each one begins at the same vertical height on its earliest point
dualplot(x1 = time(bhp), y1 = bhp$BHP.AX.Close,
         x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close, 
         ylab1 = "BHP", ylab2 = "Fonterra", 
         legx = "topright", 
         main = "Price per share of two key firms\n(starting at same vertical position)",
         ylim.ref = c(1, 1), silent = TRUE)

Converging to a common vertical position

… or you might want to adopt a teleological view of history and and see how growth in the past has led to the latest point:

starts4

In this case, the last two charts are almost identical, but that’s not always going to be the result. In this particular case, these two charts being similar just means the decline in BHP Billiton share price from the beginning of the chart to the end of the chart is roughly the same (a bit over 10%) as the decline in Fonterra Cooperative Group share price from the end of 2012 to the end of the chart.

dualplot(x1 = time(bhp), y1 = bhp$BHP.AX.Close,
         x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close, 
         ylab1 = "BHP", ylab2 = "Fonterra", 
         legx = "topright", 
         main = "Price per share of two key firms\n(finishing at same vertical position)",
         ylim.ref = c(nrow(bhp), nrow(fonterra)))

Time series that have different frequencies

I’ve tested the dualplot() function more thoroughly against data where the two series have different frequencies. Here’s an example comparing a longer series of Air New Zealand share prices against New Zealand’s national exports of services (which includes Air New Zealand sales to foreigners).

frequency

We can see that both series took a hit from the Global Financial crisis with a dip in the years after 2008; but the Air New Zealand share price was impacted much more severely than were New Zealand service exports as a whole (which of course comprise multiple industries and firms).

To do this I used an old copy of New Zealand balance of payments data that is one of the example datasets in the ggseas R package for seasonal adjustment on the fly. The data are not meant to be up to date, they just illustrate superimposing quarterly data on daily data.

There’s nothing special about the call to dualplot() to do this so I used this as a chance to show some of the optional features of the function; such as drawing gridlines (white ones in this case) and passing arguments through to legend() to colour the space behind the legend white.

airnz <- getSymbols('AIR.NZ', src='yahoo', auto.assign = FALSE, from = "2000-01-01") 

services <- nzbop %>%
   filter(Category == "Services; Exports total") %>%
   filter(TimePeriod > as.Date("1997-12-30"))

dualplot(x1 = services$TimePeriod, y1 = services$Value, x2 = time(airnz), y2 = airnz$AIR.NZ.Close,
	 ylab1 = "New Zealand service exports ($m)\n", ylab2 = "Air New Zealand share price",
	 yleg1 = "All service exports ($m) (left axis)", lwd = c(4,2), col = c("rosybrown", "steelblue"),
	 main ="New Zealand service exports and Air New Zealand share price over time",
	 colgrid = "white", bty = "o", bg = "white", box.col = "white")

For a second example, let’s go back to Fonterra, and this time compare their daily share price to weekly movements in the the world whole milk dairy powder price which I sourced from Quandl.

frequency

Having the scales carefully chosen to mimic indices means that we can see immediately the lower level of volatility in the share price than the dairy price; something that would be hidden by other visualisation methods such as showing each series on its own facet with a free y axis chosen to use the full plotting area.

dairy <- Quandl("GDT/WMP_PI")

par(family = "myfont")
dualplot(x1 = dairy[ , 1], y1 = dairy[, 8], 
	 x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close,
	 ylab1 = "Whole milk powder price index\n",
	 ylab2 = "Fonterra Cooperative Group\nshare price ($)",
	 col = colorspace::rainbow_hcl(2, c = 80, l = 50, start = 45, end = 270),
	 colgrid = "grey90",
	 main = "Fonterra share prices are less volatile than global milk prices")

Before I move on, a note on colour. Usually I wouldn’t use so many palettes in a single document for basically the same purpose (identification between two qualitative variables). The effect is (to me) somewhat jarring. But as I’m demonstrating use of a new graphics function some of the normal requirements of good aesthetics are suspended.

Negative data

When data goes negative, graphing an index rarely makes sense. Most indexes are of data that is strictly positive, such as share prices or the consumer price index. This creates a new problem of what scale to use for a dual axis plot, and there is no clear answer, which makes me inclined to doubt that such a graphic is often justified. However, to get at least a reasonable default, I’ve set it so that if one of the series is negative the graphic will choose two vertical scales such that the plotting area includes the mean of each series plus three standard deviations in each direction.

negatives

Here’s the code for that example usage with negative data; I struggled to think of a good (non-misleading) use case with negative data so had to simulate two series:

set.seed(123)
data1 <- data.frame(x = 1:100, y = arima.sim(list(ar = c(0.9)), 100))
data2 <- data.frame(x = 1:100, y = arima.sim(list(ar = c(0.9)), 100) * 50 - 3)

dualplot(x1 = data1$x, y1 = data1$y, x2= data1$x, y2 = data2$y)

But the dual axis approach has limitations

Finally, as an offshoot of preparing this blog post, here’s a use case that reminded me that dual axes plots are very often not going to be a good graphic. They are cluttered, and can be hard to interpret.

A graphic that does not superimpose two data series in the same plotting area is best in this case. After various attempts to relate the volume of Air New Zealand share sales with prices, I ended up with this as my best graphic:

airnz

The vertical lines connecting the bursts of high sales volumes in the lower plot to the price changes in the upper plot work far better than superimposing the series (cluttered and hard to see connections); un-annotated facets; or a connected scatter plot (of either the original or transformed data or growth rates).

Here’s the code behind that graphic:

airnz3 <- airnz %>%
   as.data.frame() %>%
   mutate(TimePeriod = time(airnz),
          PeakVol = AIR.NZ.Volume > 2 * 10 ^ 7) 

peaks <- filter(airnz3, PeakVol)$TimePeriod

airnz3 %>%
   select(TimePeriod, AIR.NZ.Close, AIR.NZ.Volume) %>%
   mutate(SquareRootOfVolume = sqrt(AIR.NZ.Volume)) %>%
   rename(ClosingPrice = AIR.NZ.Close) %>%
   gather(variable, value, -TimePeriod, -AIR.NZ.Volume) %>%
   ggplot(aes(x = TimePeriod, y = value)) +
   facet_wrap(~variable, scales = "free_y", ncol = 1) +
   geom_line() +
   geom_vline(xintercept = as.numeric(peaks), colour = "steelblue", size = 2.5, alpha = 0.1) +
   geom_line(colour = "brown") +
   labs(x = "", y = "", title = "Four big trading events for Air New Zealand shares since 2000")

dualplot()

Here’s the code for the latest version of dualplot; it’s a Gist on GitHub. The code above includes an example line to source it directly into an R program.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)