# Prediction Bands

**LeaRning Stats**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

In this post, we study a seemingly easy question; what is a 95% prediction interval in simple linear regression? We assume that our data comes from \[ y_i = 1 + 2x_i + \epsilon_i \] where \(\epsilon_i\) are iid normal with mean 0 and standard deviation 3.

## R Code for Prediction

We can build a linear model and prediction intervals using the following R code:

set.seed(4172018) x <- runif(30, 0, 10) y <- 1 + 2 * x + rnorm(30, 0, 3) mod <- lm(y~x) predict(mod, newdata = data.frame(x = 5), interval = "pre") ## fit lwr upr ## 1 12.03934 6.313055 17.76562

We see that we have an estimate of 12 (true value: \(1 + 2 \times 5\)), a lower prediction bound of 6.3 and an upper of 17.8. What percentage of future observations will fall in this interval?

sim_dat <- replicate(10000, { new_point <- 1 + 2 * 5 + rnorm(1, 0, 3) new_point > 6.313055 && new_point < 17.76562 }) mean(sim_dat) ## [1] 0.9281

We see that 9281 out of 10,000 future data points fall in this 95% prediction interval. Since we know the true way of generating data, we can compute the exact probability that future data will fall in this interval via:

pnorm(17.76562 , 11, 3) - pnorm(6.313055, 11, 3) ## [1] 0.9288329

What’s going on? How does constructing a 95% prediction interval lead to an interval that collects 92.88% of future data?

## Sampling Distribution

To understand this better, let’s estimate (via simulation) the sampling distribution of the probability that a prediction interval will contain future values.

sampling_dist <- replicate(1000, { x <- runif(30, 0, 10) y <- 1 + 2 * x + rnorm(30, 0, 3) mod <- lm(y~x) pre <- predict(mod, newdata = data.frame(x = 5), interval = "pre") pnorm(pre[3], 11, 3) - pnorm(pre[2], 11, 3) }) hist(sampling_dist)

min(sampling_dist) #Smallest percentage of future values in prediction interval ## [1] 0.7742589 mean(sampling_dist) #Expected percentage of future values in prediction interval ## [1] 0.9506352 mean(sampling_dist < 0.9) #percentage of times that prediction interval predicts less than 90% of future values ## [1] 0.084 density(sampling_dist)$x[which.max(density(sampling_dist)$y)] #Most likely percentage of future values contained in prediction interval ## [1] 0.9673389

The histogram of attained probabilities that a prediction interval will contain future data explains what is going on. The probabilities have *expected value* of 0.95, but the distribution is skew. The most likely outcome seems to be that the prediction interval contain about 97% of future data, but there is about a 8% chance that it will contain less than 90% of future data. The minimum amount of future data contained (in this sample size of 1000) is 77%!

## Sample Size

I have heard people say that the benefit of having more samples is that the prediction intervals get narrower. That is true, but also true is that a 95% prediction interval is also much more likely to contain about 95% of future values! Let’s see what happens when we have 300 and 3000 data points, rather than 30:

## Table of Standard Deviations

Finally, we estimate the standard deviation of the sampling distribution of the probability that a future value will be in a prediction interval for various sample sizes.

sample_sizes <- c(10, 20, 50, 100, 200, 500, 1000, 2000, 5000) sds <- sapply(sample_sizes, function(n) { sd(replicate(1000, { x <- runif(n, 0, 10) y <- 1 + 2 * x + rnorm(n, 0, 3) mod <- lm(y~x) pre <- predict(mod, newdata = data.frame(x = 5), interval = "pre") pnorm(pre[3], 11, 3) - pnorm(pre[2], 11, 3) })) }) knitr::kable(data.frame(`Sample Size` = sample_sizes, `Standard Dev` = sds))

Sample.Size | Standard.Dev |
---|---|

10 | 0.0705585 |

20 | 0.0386866 |

50 | 0.0242093 |

100 | 0.0162205 |

200 | 0.0117437 |

500 | 0.0071819 |

1000 | 0.0050827 |

2000 | 0.0037369 |

5000 | 0.0022930 |

**leave a comment**for the author, please follow the link and comment on their blog:

**LeaRning Stats**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.