Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

In this post, we study a seemingly easy question; what is a 95% prediction interval in simple linear regression? We assume that our data comes from $y_i = 1 + 2x_i + \epsilon_i$ where $$\epsilon_i$$ are iid normal with mean 0 and standard deviation 3.

## R Code for Prediction

We can build a linear model and prediction intervals using the following R code:

set.seed(4172018)
x <- runif(30, 0, 10)
y <- 1 + 2 * x + rnorm(30, 0, 3)
mod <- lm(y~x)
predict(mod, newdata = data.frame(x = 5), interval = "pre")
##        fit      lwr      upr
## 1 12.03934 6.313055 17.76562

We see that we have an estimate of 12 (true value: $$1 + 2 \times 5$$), a lower prediction bound of 6.3 and an upper of 17.8. What percentage of future observations will fall in this interval?

sim_dat <- replicate(10000, {
new_point <- 1 + 2 * 5 + rnorm(1, 0, 3)
new_point > 6.313055 && new_point < 17.76562
})
mean(sim_dat)
## [1] 0.9281

We see that 9281 out of 10,000 future data points fall in this 95% prediction interval. Since we know the true way of generating data, we can compute the exact probability that future data will fall in this interval via:

pnorm(17.76562 , 11, 3) - pnorm(6.313055, 11, 3)
## [1] 0.9288329

What’s going on? How does constructing a 95% prediction interval lead to an interval that collects 92.88% of future data?

## Sampling Distribution

To understand this better, let’s estimate (via simulation) the sampling distribution of the probability that a prediction interval will contain future values.

sampling_dist <- replicate(1000, {
x <- runif(30, 0, 10)
y <- 1 + 2 * x + rnorm(30, 0, 3)
mod <- lm(y~x)
pre <- predict(mod, newdata = data.frame(x = 5), interval = "pre")
pnorm(pre[3], 11, 3) - pnorm(pre[2], 11, 3)
})
hist(sampling_dist)

min(sampling_dist) #Smallest percentage of future values in prediction interval
## [1] 0.7742589
mean(sampling_dist) #Expected percentage of future values in prediction interval
## [1] 0.9506352
mean(sampling_dist < 0.9) #percentage of times that prediction interval predicts less than 90% of future values
## [1] 0.084
density(sampling_dist)$x[which.max(density(sampling_dist)$y)] #Most likely percentage of future values contained in prediction interval
## [1] 0.9673389

The histogram of attained probabilities that a prediction interval will contain future data explains what is going on. The probabilities have expected value of 0.95, but the distribution is skew. The most likely outcome seems to be that the prediction interval contain about 97% of future data, but there is about a 8% chance that it will contain less than 90% of future data. The minimum amount of future data contained (in this sample size of 1000) is 77%!

## Sample Size

I have heard people say that the benefit of having more samples is that the prediction intervals get narrower. That is true, but also true is that a 95% prediction interval is also much more likely to contain about 95% of future values! Let’s see what happens when we have 300 and 3000 data points, rather than 30:

## Table of Standard Deviations

Finally, we estimate the standard deviation of the sampling distribution of the probability that a future value will be in a prediction interval for various sample sizes.

sample_sizes <- c(10, 20, 50, 100, 200, 500, 1000, 2000, 5000)
sds <- sapply(sample_sizes, function(n) {
sd(replicate(1000, {
x <- runif(n, 0, 10)
y <- 1 + 2 * x + rnorm(n, 0, 3)
mod <- lm(y~x)
pre <- predict(mod, newdata = data.frame(x = 5), interval = "pre")
pnorm(pre[3], 11, 3) - pnorm(pre[2], 11, 3)
}))
})
knitr::kable(data.frame(Sample Size = sample_sizes, Standard Dev = sds))
Sample.Size Standard.Dev
10 0.0705585
20 0.0386866
50 0.0242093
100 0.0162205
200 0.0117437
500 0.0071819
1000 0.0050827
2000 0.0037369
5000 0.0022930