[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]


The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) {
odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70  The second method is based on the simulation and outlined as below: 1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model; 2. Repeated the above simulations many times, e.g. 1000, to generate many response series; 3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations; 4. Refitted models with these updated samples 5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models; 6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models 7. Collected all Poisson random numbers from the previous step and calculated the percentiles. sim_pi <- function(model, pdata, n, p) { odata <- model$data
yhat <- predict(model, type = "response")
lp <- (1 - p) / 2
up <- 1 - lp
set.seed(2016)
seeds <- round(runif(n, 1, 1000), 0)
sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
set.seed(seeds[i])
sim_y <- rpois(length(yhat), lambda = yhat)
sdata <- data.frame(y = sim_y, odata[names(model\$x)])
refit <- glm(y ~ ., data = sdata, family = poisson)
bpred <- predict(refit, type = "response", newdata = pdata)
rpois(length(bpred),lambda = bpred)
}
sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}

sim_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    26    52
#3 26.97271    17    39
#4 10.69951     4    18
#5 52.50839    38    68


As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.  To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)