[This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently coauthored a couple of papers on trends in environmental data (Curtis & Simpson in press; Monteith et al. in press), which we estimated using GAMs. Both papers included plots like the one shown below wherein we show the estimated trend and associated point-wise 95% confidence interval, plus some other markings. The coloured sections show where the estimated trend is changing in a statistically significantly manner, i.e. where a 95% confidence interval on the first derivative (rate of change) of the trend does not include 0. That particular figure and the others in the papers were drawn using the lattice package (Sarkar 2008), but I could just have easily used ggplot2 (Wickham 2009) instead. I was recently asked via email how I produced the figures in the paper. Rather than just reply to that email, I thought I’d knock up a quick post for my blog to show how it was done. Figure 1: Nitrate concentrations in rainfall at upland UK deposition monitoring sites showing estimated trend and point-wise 95% confidence interval. Periods of significant increase (blue) or decrease (red) are indicated by the coloured sections of the trend

For the purposes of this post, I’m not going to show how we fitted the time series models. Instead I’m just going to show some dummy data (two random walks) that illustrate how the data need to be arranged for the plotting code I’m going to use. To start then, create the dummy data we’ll use to draw some plots

set.seed(1)
tdat <- data.frame(Site = rep(paste0("Site", c("A","B")),
each = 100),
Date = rep(seq(Sys.Date(), by = "1 day", length = 100), 2),
Fitted = c(cumsum(rnorm(100)), cumsum(rnorm(100))),
Signif = rep(NA, 200))
tdat <- transform(tdat, Upper = Fitted + 1.5, Lower = Fitted - 1.5)
## select 1 region per Site as signif
take <- sample(10:70, 2)
take <- take + 100
tdat$Signif[take:(take+25)] <- tdat$Fitted[take:(take+25)]
tdat$Signif[take:(take+25)] <- tdat$Fitted[take:(take+25)]


This results in the following data frame

R> head(tdat)
Site       Date  Fitted Signif  Upper   Lower
1 SiteA 2013-10-23 -0.6265     NA 0.8735 -2.1265
2 SiteA 2013-10-24 -0.4428     NA 1.0572 -1.9428
3 SiteA 2013-10-25 -1.2784     NA 0.2216 -2.7784
4 SiteA 2013-10-26  0.3168     NA 1.8168 -1.1832
5 SiteA 2013-10-27  0.6463     NA 2.1463 -0.8537
6 SiteA 2013-10-28 -0.1741     NA 1.3259 -1.6741


The first data.frame() call created the first four columns of tdat, where we have

• Site, a factor variable indicating the two time series in the data,
• Date, a “Date” class vector which starts from today’s date and increase daily for the next 100 days, which we replicate twice, once per Site,
• Fitted, a numeric vector holding the trend estimates from the model.

Here I just use two separate random walks, but for the papers we used the output from predict() applied to the “gamm” classed model objects

• Signif, another numeric vector that will contain the same values as Fitted, but only for regions that are important or significant in some way. At first this is initialised with NAs.

In the papers we had two variables, Increasing and Decreasing, which contained the values of the estimated trend (i.e. duplicated Fitted) where the trend was either increasing or decreasing significantly. The general principle is the same, however; the non-NA locations will be indicated by a thicker line width and hence we duplicate the Fitted values only for the sections that are interesting.

The transform() line just adds some dummy confidence intervals to data frame, creating variables Upper and Lower. In the papers these were approximate, point-wise 95% confidence intervals computing using the standard errors of the realizations from the estimated trend, as returned by predict() with argument se.fit = TRUE.

The last section in the code block just selects two random points within the interior of the each time series, which we then use to mark the start of the “interesting” period. This and the next 25 values in each time series are used as indices to copy into Signif the corresponding values from Fitted.

With that done, we can start plotting. I’ll show the lattice version first and then the ggplot one.

### lattice version

library("lattice")


The key to creating the sort of plot shown in Figure 1 is to recognise that each of the lines we want to draw can be viewed as a separate y-axis variable. lattice allows for this by specifying multiple values on the left-hand-side of the formula used to describe the plot. We also need to facet the plot on Site. To draw the figure we use xyplot()

1 xyplot(Fitted + Upper + Lower + Signif ~ Date | Site,
2        data = tdat,
3        type = "l",
4        lty = c(1, 2, 2, 1),
5        lwd = c(1, 1, 1, 3),
6        col.line = c(rep("black",3), "red"))


The formula used describes the plot: Fitted + Upper + Lower + Signif ~ Date | Site. The variables" we want to plot are all passed to the left-hand-side of the formula, with Date used to the right of ~, indicating the x-axis variable to be used. The last part of the formula indicates conditioning on Site and is what instructs xyplot() to facet the resulting plot into separate panels for each Site. The parameters lty, lwd, and col.line all control the aesthetics of the plot, and are specified in the order that the variables appear in the formula. Hence we use solid lines for Fitted and Signif and dashed (type 2) for the confidence intervals (Upper and Lower). In a departure from base graphics, it is the col.line argument that is used to specify the colours used for lines drawn in the panels.

The resulting figure is shown below

### ggplot2

library("ggplot2")


With ggplot2 the key is to notice that each of the lines we want to draw on each panel can be drawn using different geom_line() layers, added sequentially to the plot. With each additional layer, we can override the default mapping by changing the y data in each layer using aes() within the geom_line() call. The code to create the plot is shown below.

1 ggplot(tdat, aes(x = Date, y = Fitted, group = Site)) +
2     geom_line() +
3     geom_line(mapping = aes(y = Upper), lty = "dashed") +
4     geom_line(mapping = aes(y = Lower), lty = "dashed") +
5     geom_line(mapping = aes(y = Signif), lwd = 1.3, colour = "red") +
6     facet_wrap( ~ Site)


The first line sets up the basic ggplot object with a mapping and a data object, to which we add a geom_line() layer (line 2). Note that here we don’t specify any arguments to geom_line(), so it picks up defaults from the base object created in line 1. In lines 3 to 5 we add additional geom_line() layers, but now we need to override the mapping of variables to axes on the plot, which we do by updating the mapping. We only need to change the y data used for each layer; the x data are taken from the base object created in line 1. Notice how we specify attributes for these lines outside the aes() calls? This controls how each line is drawn. The final line in the code chunk uses facet_wrap() to split the data up by Site and draw a separate panel for each of Site.

The resulting figure is shown below

I don’t think any of this is particularly revelatory, but, as someone did ask me how it was done, hopefully some readers will find this useful. Happy plotting!

Curtis C.J. & Simpson G.L. et al. (in press) Trends in bulk deposition of acidity in the UK, 1988–2007, assessed using additive models. Ecological Indicators.

Monteith D.T., Evans C.D., Henrys P.A., Simpson G.L. & Malcolm I.A.et al. (in press) Trends in the hydrochemistry of acid-sensitive surface waters in the UK 1988–2008. Ecological Indicators.

Sarkar D. et al. (2008) Lattice: Multivariate Data Visualization with R. Springer, New York.

Wickham H. et al. (2009) ggplot2: elegant graphics for data analysis. Springer New York.