[This article was first published on R-bloggers on Mikkel Meyer Andersen, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Finding prediction intervals (for future observations) is something different than finding confidence intervals (for unknown population parameters).

Here, I demonstrate one approach to doing so.

First we load the library and simulate some data:

library(mgcv)
set.seed(1)
dat <- gamSim(eg = 1, n = 400, dist = "normal", scale = 2)
## Gu & Wahba 4 term additive model

The simulated in dat contains the “truth” in the f variables:

str(dat)
## 'data.frame':    400 obs. of  10 variables:
##  $y : num 3.3407 -0.0758 10.6832 8.7291 14.9911 ... ##$ x0: num  0.266 0.372 0.573 0.908 0.202 ...
##  $x1: num 0.659 0.185 0.954 0.898 0.944 ... ##$ x2: num  0.8587 0.0344 0.971 0.7451 0.2733 ...
##  $x3: num 0.367 0.741 0.934 0.673 0.701 ... ##$ f : num  5.51 3.58 8.69 8.75 16.19 ...
##  $f0: num 1.481 1.841 1.948 0.569 1.184 ... ##$ f1: num  3.74 1.45 6.74 6.02 6.6 ...
##  $f2: num 2.98e-01 2.88e-01 8.61e-05 2.16 8.40 ... ##$ f3: num  0 0 0 0 0 0 0 0 0 0 ...
fit_lm <- lm(f ~ f0 + f1 + f2 + f3, data = dat)
plot(dat$f, predict(fit_lm)) abline(0, 1) And what can be used to demonstrate GAMs are available in the y and x variables: fit_gam <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat) The idea is to simulate parameter vectors, and using such a simulated parameter vector, we can draw observations. Doing this many times can be used to generate a prediction interval. We first extract the parameter estimates and their covariance matrix: beta <- coef(fit_gam) V <- vcov(fit_gam) We then to simulate a number of parameter vectors. This is done by assuming approximate normality and using the Cholesky trick. The trick is that $\hat{\beta} + L \nu \quad \dot \sim \quad \text{MultivariateNormal}(\beta, V) ,$ where $$V = L L^\top$$ is the Cholesky factorisation of the covariance matrix $$V$$ and $$\nu \sim N(0, 1)$$. num_beta_vecs <- 10000 Cv <- chol(V) set.seed(1) nus <- rnorm(num_beta_vecs * length(beta)) beta_sims <- beta + t(Cv) %*% matrix(nus, nrow = length(beta), ncol = num_beta_vecs) dim(beta_sims) ## [1] 37 10000 This should give us something similar to the confidence intervals of the model fit: d_beta <- cbind(summary(fit_gam)$se, apply(beta_sims, 1, sd))
head(d_beta)
##                  [,1]      [,2]
## (Intercept) 0.1036817 0.1039225
## s(x0).1     0.3977579 0.3988932
## s(x0).2     0.7249847 0.7212706
## s(x0).3     0.1967577 0.1982036
## s(x0).4     0.4260063 0.4265382
## s(x0).5     0.1450504 0.1444811
plot(d_beta[, 1], d_beta[, 2],
xlab = "Calculated SE",
ylab = "Simulated SE")
abline(0, 1)

We can now proceed to simulating observations. We do that by randomly sampling from the observations with replacement:

n_obs <- 100
sim_idx <- sample.int(nrow(dat), size = n_obs, replace = TRUE)
sim_dat <- dat[sim_idx, c("x0", "x1", "x2", "x3")]
dim(sim_dat)
## [1] 100   4
# sim_dat <- dat[, c("x0", "x1", "x2", "x3")]
# dim(sim_dat)

Instead of doing such a sampling, it is also possible to use e.g. a grid of values:

if (FALSE) { # Intentionally not run
xs_range <- apply(dat[, c("x0", "x1", "x2", "x3")], 2, range)
xs_seq <- apply(xs_range, 2, function(x) seq(from = x[1], to = x[2], length.out = 10))
xs_seq_lst <- as.list(as.data.frame(xs_seq))
sim_dat <- do.call(expand.grid, list(xs_seq_lst, KEEP.OUT.ATTRS = FALSE))
dim(sim_dat)
}

Now we can calculate the linear predictors:

covar_sim <- predict(fit_gam, newdata = sim_dat, type = "lpmatrix")
linpred_sim <- covar_sim %*% beta_sims

Now, the inverse link function must be applied to obtain the expected values. So if e.g. family = Gamma(link = log) was used in gam(), then exp() must be used. We used the Gaussian family with identity link so the inverse is the identity:

invlink <- function(x) x
exp_val_sim <- invlink(linpred_sim)

Now, the family itself must be used. If Gamma was used, then rgamma() (properly parametised) must be used. We just used Gaussian, so:

y_sim <- matrix(rnorm(n = prod(dim(exp_val_sim)),
mean = exp_val_sim,
sd = summary(fit_gam)$scale), nrow = nrow(exp_val_sim), ncol = ncol(exp_val_sim)) dim(y_sim) ## [1] 100 10000 So now entry $$(i, j)$$ of y_sim contains the $$i$$’th simulated observation’s simulated response under parameter vector $$j$$ (column $$j$$ of beta_sims). We can then find a 95% prediction interval: pred_int_sim <- apply(y_sim, 1, quantile, prob = c(.025, 0.975)) dim(pred_int_sim) ## [1] 2 100 And e.g. plot it against x0: plot(y ~ x0, data = dat) sim_dat_x0ord <- order(sim_dat$x0)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2)

Confidence intervals can be added:

plot(y ~ x0, data = dat)

sim_dat_x0ord <- order(sim_dat$x0) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2) sim_dat_pred <- predict(fit_gam, newdata = sim_dat, se = TRUE) lines(sim_dat$x0[sim_dat_x0ord], sim_dat_pred$fit[sim_dat_x0ord], col = 1, lwd = 2) upr_ci <- invlink(sim_dat_pred$fit + 2*sim_dat_pred$se.fit) lwr_ci <- invlink(sim_dat_pred$fit - 2*sim_dat_pred$se.fit) lines(sim_dat$x0[sim_dat_x0ord], upr_ci[sim_dat_x0ord], col = 3, lwd = 2)
lines(sim_dat\$x0[sim_dat_x0ord], lwr_ci[sim_dat_x0ord], col = 3, lwd = 2)

To leave a comment for the author, please follow the link and comment on their blog: R-bloggers on Mikkel Meyer Andersen.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)