[Update alert: INLA author Håvard Rue found a problem with the code below. See here]
Ramsay and Silverman’s Functional Data Analysis is a tremendously useful book that deserves to be more widely known. It’s full of ideas of neat things one can do when part of a dataset can be viewed as a set of curves – which is quite often. One of the methods they’ve developed is called Functional ANOVA. It can be understood I think as a way of formulation a hierarchical prior, but the way they introduce it is more as a way of finding and visualising patterns of variation in a bunch of curves. Ramsay and Silverman use classical penalised likelihood estimation techniques, but I thought it’d be useful to explore functional ANOVA in a Bayesian framework, so here’s a quick explanation of how you can do that armed with R and INLA (Rue et al., 2009).
The quickest way to understand what fANOVA is about is to go through Ramsay and Silverman’s analysis of Canadian temperature profiles (a summary of their analysis is available here). The dataset is the daily temperature average in 35 Canadian locations, for the period 1960-1994. Here’s a summary plot:
Each curve represents the smoothed temperature profile at a location, and the fuzzy points around the curve are the raw averages (not much noise here). Note to potential American readers who can’t figure out the y-axis: those temperatures range from “Cool” down to “Not appreciably above absolute zero”.
Importantly, the 35 locations are grouped into 4 regions: Atlantic, Continental, Pacific and Arctic. The profiles seem to differ across regions, but it’s hard to immediately say how from this graph, especially since, in a not totally unexpected fashion, there’s a big Summer-Winter difference in there (spot it?)
Enter fANOVA: we will attempt to describe the variability in these curves in terms of the combined effects of variation due to Season and Region and ascribe whatever variability remains to individual Locations (and noise). The model used by Ramsay & Silverman is
is the temperature at location in region at time . Every location shares the same basic profile , then there’s some extra variation that depends on the region, and everything else gets lumped into a term that is specific to a particular location.
Ramsay & Silverman fit this model and obtain the following regional effects :
(Figure nicked from their website).
What’s really nice is that this is immediately interpretable: for example, compared to everybody else, the Pacific region gets a large bonus in the winter, but a smaller one in the summer, while the Atlantic region gets the same bonus year-round. The Artic sucks, but significantly less so in the summer.
To fit the model Ramsay and Silverman use a penalty technique (and some orthogonality constraints I won’t get into), essentially, to estimate they maximise the sum of the log-likelihood plus a penalty on the wiggliness of these functions, to get smooth curves as a result.
In a Bayesian framework one could do something in the same spirit by putting priors on the functions . Inference would be done as usual by looking for example at the posterior to draw conclusions about . We want to set up the priors so that:
- The constant null function is a priori the most likely, and everything that’s not the null function is less likely than the null.
- Smooth functions are more likely than non-smooth functions (the smoother the better).
(NB: if the prior is proper than this rids us of the need for pesky orthogonality constraints)
Why insist that the prior impose conditions (1) and (2)? Condition (2) makes sense among other things because we don’t want our functions to fit noise in the data, i.e. we don’t want to interpolate exactly the signal. More importantly, conditions (1) and (2) are needed because we want variation to be attributed to a global effect or inter-regional differences whenever it makes sense to do so. Going back to the model equation:
it’s obvious that this is as such unidentifiable – we could stick all seasonal variation in the regional terms and the predictions would be the same. Explicitly if we change the variables so that , , than the likelihood has not changed but all the global seasonal variability has now moved to the intra-regional level.
This is of course something we’d like to avoid. Imposing conditions (1) and (2) lets us do that. It’s useful at this point to think of the prior as imposing a cost: the further from 0 the function, and the less smooth, the higher the cost. When we shift variability away from the global level to the regional level , with the change then:
- The cost for has now been reduced.
- However we have now four functions that are now in all likelihood wigglier and further from 0 than they used to be.
In total the cost will have risen – the prior puts a price on coincidence (four regions having the same global pattern by chance). There’s admittedly a lot of hand-waving here, but this is the rough shape of it.
One way to impose conditions (1) and (2) is to put Gaussian Process priors on our latent functions and use MCMC, but we’d have a posterior distribution with dimensions, and sampling from it would be rather slow. The INLA package in R lets us do the same thing, much, much more efficiently.
INLA is very fast for two reasons. One, it prefers Gauss-Markov Processes to Gaussian processes – Gauss-Markov Processes more or less approximate GPs, but have sparse inverse covariance matrices, which speeds up inference a lot. Second, the philosophy of INLA is to not even attempt to capture the joint posterior distribution, but only approximate uni-dimensional marginals – for example , the posterior distribution of the seasonal effect at day 23. It does so using a variant of the Laplace approximation of Tierney & Kadane (note to ML folks: it’s related to, but not what you probably think is a Laplace approximation). This means that MCMC is not used in INLA, but fast optimisation algorithms instead.
The R INLA package has an interface that’s not completely unlike that of MGCV (itself similar to lm and glm), although they’re very different behind the scenes. You specify a model using the formula interface, e.g.:
inla(y ~ x,data=dat,family=''gaussian'')
will perform inference on the linear regression model:
This means INLA will return marginal distributions for , , integrated over the hyperparameter (unknown noise precision).
A nonlinear regression
can be done using:
res <- inla(y ~ f(x,model="rw2",diagonal=1e-5),dat=df,family="gaussian")
- model=”rw2” specifies a 2nd-order random walk model, which is for all instance and purposes similar to a spline penalty (i.e., the estimated will be smooth).
- diagonal = 1e-5 makes the prior proper by adding a small diagonal component to the precision matrix of the Gaussian prior.
This is not always necessary but stabilises INLA, which sometimes won’t work when two values of are too close together. This is due to an ill-conditioned prior precision matrix (INLA is great but still has some rough edges, and it helps to know the theory when trying to figure out why something is not working).
Every term that comes enclosed in a in the formula, INLA calls a random effect. Everything else is called a fixed effect. The different priors are called models. This clashes a bit with traditional terminology, but one gets used to it.
In the case of the Canadian temperature data, I found after some tweaking that the formula below works well:
temp~f(day,model="rw2",cyclic=T,diagonal=.0001) +f(day.region,model="rw2",replicate=region.ind,cyclic=T,diagonal=.01) +f(day.place,model="rw2",cyclic=T,diagonal=.01,replicate=place.ind) +region
- temp is the temperature
- day is the time index (1 to 365). Several functions will depend on the time index, which INLA doesn’t like, so we duplicate it artificially (day.region, day.place).
- compared to Ramsay & Silverman’s specification, is decomposed into a smooth component (the third in the formula), and measurement noise (because we use Gaussian likelihood)
- the regional and place effects are “replicated”, which means in the case of the regional effects that INLA will consider that have the same hyperparameters (here this means the same level of smoothness).
- the “region” factor comes in as a linear effect – effectively this decomposes the regional effect into a constant shift and a smooth part.
- We set cyclic=T because the functions we are trying to infer are periodic (over the year).
- Roughly speaking, the diagonal component imposes a penalty on . We want the global component to be larger than the regional ones, so it gets a smaller diagonal penalty.
The data and formula can now be fed into INLA. As a first check, we can plot the “fitted” model:
The smooth lines are the posterior expected value of the linear predictor. We do not seem to be doing anything terribly wrong so far.
We can also plot the estimated regional effects (again, we plot posterior expected values):
Compared to the effects estimated by Ramsay&Silverstein, we have a lot more wiggly high-frequency stuff going on. Closer examination shows that the wiggly bits actually seem to really be in the data, and not just made up by the procedure. Since I’m no meteorologist I have no idea what causes them. Smoothing our curves (using MGCV) recovers essentially the curves R&S had originally inferred (INLA first, R&S second):
What about the global, seasonal component? Here’s another plot modelled on one by R&S, showing the global component (dashed gray line), and the sum of the global component and each seasonal effect (coloured lines)
Here’s R&S’s original plot:
The code to reproduce the examples is given below. You’ll need to download INLA from r-inla.org, it’s not on CRAN.