Functional ANOVA using INLA

Posted on January 13, 2012 by simonbarthelme in R bloggers | 0 Comments

[This article was first published on dahtah » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[Update alert: INLA author Håvard Rue found a problem with the code below. See here]

Ramsay and Silverman’s Functional Data Analysis is a tremendously useful book that deserves to be more widely known. It’s full of ideas of neat things one can do when part of a dataset can be viewed as a set of curves – which is quite often. One of the methods they’ve developed is called Functional ANOVA. It can be understood I think as a way of formulation a hierarchical prior, but the way they introduce it is more as a way of finding and visualising patterns of variation in a bunch of curves. Ramsay and Silverman use classical penalised likelihood estimation techniques, but I thought it’d be useful to explore functional ANOVA in a Bayesian framework, so here’s a quick explanation of how you can do that armed with R and INLA (Rue et al., 2009).

The quickest way to understand what fANOVA is about is to go through Ramsay and Silverman’s analysis of Canadian temperature profiles (a summary of their analysis is available here). The dataset is the daily temperature average in 35 Canadian locations, for the period 1960-1994. Here’s a summary plot:

Each curve represents the smoothed temperature profile at a location, and the fuzzy points around the curve are the raw averages (not much noise here). Note to potential American readers who can’t figure out the y-axis: those temperatures range from “Cool” down to “Not appreciably above absolute zero”.

Importantly, the 35 locations are grouped into 4 regions: Atlantic, Continental, Pacific and Arctic. The profiles seem to differ across regions, but it’s hard to immediately say how from this graph, especially since, in a not totally unexpected fashion, there’s a big Summer-Winter difference in there (spot it?)

Enter fANOVA: we will attempt to describe the variability in these curves in terms of the combined effects of variation due to Season and Region and ascribe whatever variability remains to individual Locations (and noise). The model used by Ramsay & Silverman is

$\displaystyle y_{ij}(t)=s(t)+r_{j}\left(t\right)+\epsilon_{ij}\left(t\right)$

${y_{ij}\left(t\right)}$ is the temperature at location ${i}$ in region ${j}$ at time ${t}$ . Every location shares the same basic profile ${s(t)}$ , then there’s some extra variation ${r_{j}\left(t\right)}$ that depends on the region, and everything else gets lumped into a term ${\epsilon_{ij}\left(t\right)}$ that is specific to a particular location.

Ramsay & Silverman fit this model and obtain the following regional effects ${r_{j}\left(t\right)}$ :

(Figure nicked from their website).

What’s really nice is that this is immediately interpretable: for example, compared to everybody else, the Pacific region gets a large bonus in the winter, but a smaller one in the summer, while the Atlantic region gets the same bonus year-round. The Artic sucks, but significantly less so in the summer.

To fit the model Ramsay and Silverman use a penalty technique (and some orthogonality constraints I won’t get into), essentially, to estimate ${s(t),r_{j}\left(t\right),\ldots}$ they maximise the sum of the log-likelihood plus a penalty on the wiggliness of these functions, to get smooth curves as a result.

In a Bayesian framework one could do something in the same spirit by putting priors on the functions ${s(t),r_{j}\left(t\right),\ldots}$ . Inference would be done as usual by looking for example at the posterior ${p\left(s(t)|\mathbf{y}\right)}$ to draw conclusions about ${s(t)}$ . We want to set up the priors so that:

The constant null function ${f(t)=0}$ is a priori the most likely, and everything that’s not the null function is less likely than the null.
Smooth functions are more likely than non-smooth functions (the smoother the better).

(NB: if the prior is proper than this rids us of the need for pesky orthogonality constraints)

Why insist that the prior impose conditions (1) and (2)? Condition (2) makes sense among other things because we don’t want our functions to fit noise in the data, i.e. we don’t want to interpolate exactly the signal. More importantly, conditions (1) and (2) are needed because we want variation to be attributed to a global effect or inter-regional differences whenever it makes sense to do so. Going back to the model equation:

$\displaystyle y_{ij}(t)=s(t)+r_{j}\left(t\right)+\epsilon_{ij}\left(t\right)$

it’s obvious that this is as such unidentifiable – we could stick all seasonal variation in the regional terms ${r_{1},\ldots,r_{4}}$ and the predictions would be the same. Explicitly if we change the variables so that ${s'(t)=0}$ , ${r_{j}'\left(t\right)=s(t)+r_{j}\left(t\right)}$ , than the likelihood has not changed but all the global seasonal variability has now moved to the intra-regional level.

This is of course something we’d like to avoid. Imposing conditions (1) and (2) lets us do that. It’s useful at this point to think of the prior as imposing a cost: the further from 0 the function, and the less smooth, the higher the cost. When we shift variability away from the global level ${s(t)}$ to the regional level ${r_{j}\left(t\right)}$ , with the change ${s'(t)=0,\, r_{j}'\left(t\right)=s(t)+r_{j}\left(t\right)}$ then:

The cost for ${s(t)}$ has now been reduced.
However we have now four functions ${r_{j}'\left(t\right)=s(t)+r_{j}\left(t\right)}$ that are now in all likelihood wigglier and further from 0 than they used to be.

In total the cost will have risen – the prior puts a price on coincidence (four regions having the same global pattern by chance). There’s admittedly a lot of hand-waving here, but this is the rough shape of it.

One way to impose conditions (1) and (2) is to put Gaussian Process priors on our latent functions and use MCMC, but we’d have a posterior distribution with ${365\times\left(35+4+1\right)=14,600}$ dimensions, and sampling from it would be rather slow. The INLA package in R lets us do the same thing, much, much more efficiently.

INLA is very fast for two reasons. One, it prefers Gauss-Markov Processes to Gaussian processes – Gauss-Markov Processes more or less approximate GPs, but have sparse inverse covariance matrices, which speeds up inference a lot. Second, the philosophy of INLA is to not even attempt to capture the joint posterior distribution, but only approximate uni-dimensional marginals – for example ${p(s(23)|\mathbf{y})}$ , the posterior distribution of the seasonal effect at day 23. It does so using a variant of the Laplace approximation of Tierney & Kadane (note to ML folks: it’s related to, but not what you probably think is a Laplace approximation). This means that MCMC is not used in INLA, but fast optimisation algorithms instead.

The R INLA package has an interface that’s not completely unlike that of MGCV (itself similar to lm and glm), although they’re very different behind the scenes. You specify a model using the formula interface, e.g.:


inla(y ~ x,data=dat,family=''gaussian'')

will perform inference on the linear regression model:

$\displaystyle \begin{array}{rcl} y_{i} & = & \beta x_{i}+\beta_{0}+\epsilon\\ \epsilon & \sim & \mathcal{N}\left(0,\tau^{-1}\right) \end{array}$

This means INLA will return marginal distributions for ${\beta}$ , ${\beta_{0}}$ , integrated over the hyperparameter ${\tau}$ (unknown noise precision).

A nonlinear regression

$\displaystyle y_{i}=f(x_{i})+\beta_{0}+\epsilon$

can be done using:

res <- inla(y ~ f(x,model="rw2",diagonal=1e-5),dat=df,family="gaussian")

model=”rw2” specifies a 2nd-order random walk model, which is for all instance and purposes similar to a spline penalty (i.e., the estimated ${f(x)}$ will be smooth).
diagonal = 1e-5 makes the prior proper by adding a small diagonal component to the precision matrix of the Gaussian prior.

This is not always necessary but stabilises INLA, which sometimes won’t work when two values of ${x}$ are too close together. This is due to an ill-conditioned prior precision matrix (INLA is great but still has some rough edges, and it helps to know the theory when trying to figure out why something is not working).

Every term that comes enclosed in a ${f()}$ in the formula, INLA calls a random effect. Everything else is called a fixed effect. The different priors are called models. This clashes a bit with traditional terminology, but one gets used to it.

In the case of the Canadian temperature data, I found after some tweaking that the formula below works well:


temp~f(day,model="rw2",cyclic=T,diagonal=.0001) +f(day.region,model="rw2",replicate=region.ind,cyclic=T,diagonal=.01) +f(day.place,model="rw2",cyclic=T,diagonal=.01,replicate=place.ind) +region

temp is the temperature
day is the time index (1 to 365). Several functions will depend on the time index, which INLA doesn’t like, so we duplicate it artificially (day.region, day.place).
compared to Ramsay & Silverman’s specification, ${\epsilon_{ij}\left(t\right)}$ is decomposed into a smooth component (the third in the formula), and measurement noise (because we use Gaussian likelihood)
the regional and place effects are “replicated”, which means in the case of the regional effects that INLA will consider that ${r_{1}\left(t\right),\ldots,r_{4}\left(t\right)}$ have the same hyperparameters (here this means the same level of smoothness).
the “region” factor comes in as a linear effect – effectively this decomposes the regional effect into a constant shift and a smooth part.
We set cyclic=T because the functions we are trying to infer are periodic (over the year).
Roughly speaking, the diagonal component imposes a penalty on ${\int\left|f(t)\right|^{2}dt}$ . We want the global component to be larger than the regional ones, so it gets a smaller diagonal penalty.

The data and formula can now be fed into INLA. As a first check, we can plot the “fitted” model:

The smooth lines are the posterior expected value of the linear predictor. We do not seem to be doing anything terribly wrong so far.

We can also plot the estimated regional effects (again, we plot posterior expected values):

Compared to the effects estimated by Ramsay&Silverstein, we have a lot more wiggly high-frequency stuff going on. Closer examination shows that the wiggly bits actually seem to really be in the data, and not just made up by the procedure. Since I’m no meteorologist I have no idea what causes them. Smoothing our curves (using MGCV) recovers essentially the curves R&S had originally inferred (INLA first, R&S second):

What about the global, seasonal component? Here’s another plot modelled on one by R&S, showing the global component (dashed gray line), and the sum of the global component and each seasonal effect (coloured lines)

Here’s R&S’s original plot:

The code to reproduce the examples is given below. You’ll need to download INLA from r-inla.org, it’s not on CRAN.

References

Ramsay, J. and Silverman, B. W. (2005). Functional Data Analysis (Springer Series in Statistics). Springer, 2nd edition.

Rue, H., Martino, S., and Chopin, N. (2009). Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319-392.

</div>
<div>#INLA for functional ANOVA on the Canadian temperature dataset

#Usage:
#Load the data
#cw = load.data()
#Run INLA
#res <- inla.fanova.temperature(cw)
#
#Plot "fitted" model
#plot.fitted(cw,res)
#
#etc.

require(fda)
require(INLA)
require(ggplot2)
require(stringr)
require(mgcv)
require(directlabels)

#Use the same colour scheme Ramsay&Silverman did
col.scale <- c('red','blue','darkgreen','black')

load.data <- function()
{
cw <- with(CanadianWeather,ldply(1:length(place),function(ind) data.frame(temp=dailyAv[,ind,1],place=place[ind],region=region[ind],day=1:365)))
#We need to create multiple copies of the time index because we need multiple functions of time
cw <- mutate(cw,day.region=day,day.place=day,
region.ind=as.numeric(region),
place.ind=as.numeric(place))

#INLA apparently doesn't like the original factor levels, we modify them
levels(cw$place) <- str_replace(levels(cw$place),'. ','_')
cw
}

inla.fanova.temperature <- function(cw)
{
#The formula for the model
formula <- temp ~ f(day.region,model="rw2",replicate=region.ind,cyclic=T,diagonal=.01)+f(day,model="rw2",cyclic=T,diagonal=.0001)+f(day.place,model="rw2",cyclic=T,diagonal=.01,replicate=place.ind) + region
#Note that 'region' comes in as a factor, and inla treats factors
#in the same way as the 'lm' function, i.e., using contrasts
#This is not strictly necessary in a Bayesian analysis and here complicates things a bit
#The default "treatment" contrasts used here mean that the "place" factor has the Pacific region as the default level to which other regions are compared

#Call inla
#We use control.fixed to impose a proper gaussian prior on the fixed effects
#and control.predictor to make INLA compute marginal distributions for each value of the linear predictor
#The call takes 160 sec. on my machine
inla(formula,data=cw,family="gaussian",
control.predictor=list(compute=T),
control.fixed=list(prec=list(default=0.01,prec.intercept = 0.001)),verbose=T)
}

plot.raw.data <- function(cw)
{
p <- ggplot(cw,aes(day,temp,group=place,colour=region))+geom_point(alpha=.1)+geom_smooth(method="gam",form = y ~ s(x))+labs(x='Day of the year',y='Temp. (° C)')+scale_colour_manual(values=col.scale)+facet_wrap(~ region)
p + theme_bw() + opts(legend.position="none")
}

plot.fitted <- function(cw,res)
{
cw$fitted <- (res$summary.fitted.values$mean)
p <- ggplot(cw,aes(day,temp,group=place,colour=region))+geom_point(alpha=.1)+facet_wrap(~ region)+scale_colour_manual(values=col.scale)+geom_path(aes(y=fitted))
p + theme_bw() + opts(legend.position="none")
}

extract.regional.effects <- function(cw,res)
{
reg.effect <- reff(res)$day.reg
names(reg.effect)[1] <- 'day'
reg.effect$region <- gl(4,365,lab=levels(cw$region))

#Total regional effect is equal to smooth component + intercept + regional effect
feff.region <- feff(res)[str_detect(attributes(feff(res))$row.names,"region"),]
feff.inter <- feff(res)[str_detect(attributes(feff(res))$row.names,"Inter"),]
offsets <- c(feff.inter$mean,feff.inter$mean+feff.region$mean)
reg.effect$total.effect <- reg.effect$mean+rep(offsets,each=365)
reg.effect
}

plot.regional.effects <- function(cw,res,smooth=F)
{
reg.effect <- extract.regional.effects(cw,res)
p <- ggplot(reg.effect,aes(day,total.effect,colour=region))+geom_line()+scale_colour_manual(values=c('red','blue','darkgreen','black'))+scale_y_continuous(lim=c(-18,15))
if (smooth) p <- p + geom_smooth(method="gam",form = y ~ s(x),lty=2)
p <- p + geom_abline(slope=0,intercept=0,lty=3,col="lightblue")
p <- p + geom_dl(aes(label=region),method="top.qp")
p + theme_bw() +opts(panel.grid.minor = theme_blank(),panel.grid.major = theme_blank(),legend.position="none") + labs(x='\n Day of the year',y='Temp. (° C)')
}

plot.regional.avg <- function(cw,res)
{
reg.effect <- extract.regional.effects(cw,res)
regionavg <- data.frame(day=1:365,
glob.effect=reff(res)$day$mean,
reg.avg=reff(res)$day$mean+reg.effect$total.effect,
region=reg.effect$region )
p <- ggplot(regionavg,aes(day,reg.avg,colour=region))+geom_line()+facet_wrap(~ region)+scale_colour_manual(values=col.scale)+geom_line(aes(y=glob.effect),col="darkgrey",lty=2)
p <- p + geom_abline(slope=0,intercept=0,lty=3,col="lightblue")
p+ theme_bw() +opts(panel.grid.minor = theme_blank(),panel.grid.major = theme_blank(),legend.position="none") + labs(x='\n Day of the year',y='Temp. (° C)')

}

#Extract "random effects" summary from an inla object
reff <- function(res.inla)
  {
    res.inla$summary.random
  }

#Extract "fixed effects"
feff <- function(res.inla)
  {
    as.data.frame(res.inla$summary.fixed)
  }