Spatial predictions with GAMs and rasters

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Spatial predictions with GAMs and rasters

One powerful use of GAMs is for interpolating to unsampled locations. We
can combine GAMs with raster package to conveniently predict a GAM model
to places we have not got data.

Simulate some spatial data

We’ll simulate some spatial data based on rasters. There are two spatial
covariates, x1 and x2. We use these to simulate ‘true’ values of the
response, based on linear and polynomial relationships with x2 and x1
respectively.

library(raster)
library(dplyr)
library(ggplot2)
library(patchwork)

rbase <- raster(extent(c(0,1,0,1)), nrow = 50, ncol = 50)
rx1 <- rx2 <- rbase
rx1[] <- xFromCell(rbase, 1:ncell(rbase))
rx2[] <- yFromCell(rbase, 1:ncell(rbase))
rtrue <- 6*rx1  - 7*rx1^2- 4*rx2
par(mfrow = c(2,2))
plot(rx1, col = RColorBrewer::brewer.pal(11, "RdBu"), main = "x1")
plot(rx2, col = RColorBrewer::brewer.pal(11, "RdBu"), main = "x2")
plot(rtrue, col = RColorBrewer::brewer.pal(11, "RdBu"), main = "True values")

Extract data at study sites

Here we create some random study site locations, then extract the values
of the covariates from the raster at the study site locations.

set.seed(42)
site_means <- data.frame(x = runif(50), y = runif(50)) %>%
  mutate(site = 1:50,
         x1 = extract(rx1, cbind(x,y)),
         x2 = extract(rx2, cbind(x,y)),
         eta = extract(rtrue, cbind(x,y)),
         z = rnorm(50, sd = 1),
         yhat = eta + z)

Now we will assume 3 transects at each site with a a group random effect
(so transects at each site share a common deviation from the ‘true
effect’). There is also individual transect level variation.

dat <- inner_join(site_means,
                  expand.grid(site = 1:50, transect = 1:3)) %>%
  mutate(b = rnorm(150, mean = yhat, sd = 0.5))

## Joining, by = "site"

g1 <- ggplot(dat) +
aes(x = eta, y = b) +
geom_point()

g2 <- ggplot(dat) +
  aes(x = yhat, y = b) +
  geom_point()

g3 <- ggplot(site_means) +
aes(x = x, y = y, color = yhat) +
geom_point()

g3 | g1 /g2


The plot shows site means (colours) for location, then the relationship
between the ‘true’ values (eta) and the response (‘b’) and the site
means (yhat) and the response.

Fit a GAM

Now we fit a GAM, with smoothers on covariates x1 and x2 and a random
effect (bs = "re") for the sites.

library(visreg)
library(mgcv)

dat$sitef <- factor(dat$site)
m1 <- gam(b ~ s(x1) + s(x2) + s(sitef, bs = 're'),
          data = dat)
visreg(m1)

The GAM detects some non-linearity with x1 and a linear relationship
with x2.

Check the summary:

summary(m1)

##
## Family: gaussian
## Link function: identity
##
## Formula:
## b ~ s(x1) + s(x2) + s(sitef, bs = "re")
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.2716     0.1322  -9.619 4.83e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##             edf Ref.df      F  p-value    
## s(x1)     3.424  3.494  7.982 5.76e-05 ***
## s(x2)     1.000  1.000 49.915 1.01e-10 ***
## s(sitef) 40.718 47.000  9.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.901   Deviance explained = 93.1%
## GCV = 0.32768  Scale est. = 0.22688   n = 150

The edf on s(x2) is exactly 1, confirming the GAM has just fit a linear
relationship (correctly) for that covariate. The edf on s(x1) is 3.4,
indicating it has probably overfit that covariate a little (for a 2nd
order polynomial, we’d expect an edf of 2).

Don’t get too excited about the high deviance explained, because that
calculation includes the deviance explained by the random effect
smoother.

We can see how much variation ended up in the random effect with
gam.vcomp (which counter to its name gives variance components as
standard deviations):

gam.vcomp(m1)

##        s(x1)        s(x2)     s(sitef)
## 1.335173e+01 1.685585e-04 8.934146e-01

So the SD for sites was about 0.89, which is close to the true value of
1.

Predict everywhere

We can use our GAM to predict across the entire raster. We need to stack
the rasters together and name each layer of the stack so that it
corresponds to the covariate names in the call to gam().

One trick here is that we want to predict to the ‘average’ site, not to
individual sites, so we need to take out the random site effect. We can
do this by creating a raster of zeros and calling it ‘sitef’. Then the
predictions will return values for the average site (given of course x1
and x2 values).

rzeros <- rbase
rzeros[] <- 0
rstack <- stack(rx1, rx2, rzeros)
names(rstack) <- c("x1", "x2", "sitef")
rpred <- predict(rstack, m1)

## Warning in predict.gam(model, blockvals, ...): factor levels 0 not in original
## fit

This returns a warning, but don’t worry, it was designed to work this
way!

We used a raster stack to form predictions, so the predictions will come
out as a raster, so we can just plot them as rasters:

par(mfrow = c(2,2))
plot(rpred, col = RColorBrewer::brewer.pal(11, "RdBu"), main = "Predictions")
plot(rtrue, col = RColorBrewer::brewer.pal(11, "RdBu"), main = "True values")
plot(rpred - rtrue, col = RColorBrewer::brewer.pal(11, "RdBu"),
     main = "Prediction - true")

Overall our model has captured the main spatial gradients, but is
tending to under-predict the region in the middle (in red) and southern
portion of the space and slightly over-predict in the top corners.

Posterior predictions

This is just bonus content, because I find it interesting that you can
get Bayesian posterior distributions from a fitted GAM.

This example is straight out of `?predict.gam. These will be
conditional (not marginalized) on the site random effect.

newdat <- with(dat,
               data.frame(x1 = seq(min(x1), max(x1),
                                   by = 0.01),
               x2 = mean(x2),
               sitef = 0)) #again set sitef to zero
            # to ignore site deviations
m1p <- predict(m1, newdata = newdat, type = "lpmatrix")

## Warning in predict.gam(m1, newdata = newdat, type = "lpmatrix"): factor levels 0
## not in original fit

Now we have a predictions matrix, sample from it:

rmvn <- function(n,mu,sig) { ## MVN random deviates
  L <- mroot(sig);m <- ncol(L);
  t(mu + L%*%matrix(rnorm(m*n),m,n))
}

br <- rmvn(1000,coef(m1),m1$Vp)
m1post <- matrix(NA, nrow = nrow(newdat),
                 ncol = 1000)
for (i in 1:1000){
  m1post[,i] <- m1p %*% br[i,]
}
CIdat <- t(apply(m1post, 1, quantile, probs = c(0.025, 0.5, 0.975))) %>%
  data.frame(., newdat)

g1 <- ggplot(CIdat) +
  aes(x = x1, y = X50.) +
  geom_point() +
  geom_ribbon(aes(ymin = X2.5., ymax = X97.5.),
              color = NA, alpha = 0.5) +
  ylab("Predictions") +
  ggtitle("Predcitions with 95% C.I.")

g2 <- ggplot(data.frame(b = m1post[1,])) +
  aes(x = b) +
  geom_density(fill = "Purple") +
  ggtitle("Posterior density at x = 0.01")

g1 / g2

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)