Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## What does under or over-dispersion look like?

One issue that often comes up in analysis with linear models is under or over dispersion.

For instance, if you are fitting a linear regression model, you are assuming the residuals (difference between the line of best fit and the data-points) are normally distributed. Residuals are said to be overdispersed if they are ‘fatter’ in the tails than a normal bell curve. Whereas, if the residuals are too peaked in the middle, they are said to be under-dispersed.

Under or over dispersion is an issue because it can bias the calculation of p-values. Over-dispersion is often of particular concern because it may cause p-values that are biased too low. If you are taking 0.05 as significant, over-dispersion will mean you would reject the null hypothesis more than 5% of the time. Thus, you may be more likely to falsely reject the null hypothesis.

If there is under-dispersion the opposite is true, your p-values may be too high. P-values that are biased high can be a problem, because they will give you low power to detect real effects.

So, let’s take a look at under and overdispersion. First, up here is a normal bell curve:

x <- seq(-4, 4, length.out = 100)
p <- dnorm(x, mean = 0, sd = 1)
plot(x, p, type = 'l', lwd = 2, col = "red",
xlab = "Residual", ylab = "Density")


Now, we can add an over-dispersed curve to that. Here is one, calculated using the Student t distribution:

p_t <- dt(x, df = 1.4, ncp = 0)

plot(x, p, type = 'l', lwd = 2, col = "red",
xlab = "Residual", ylab = "Density")
lines(x, p_t, col = "darkblue", lwd = 2)


It wouldn’t matter how large we made the standard devation of the normal curve, we would never get it to match the Student-t.

Finally, let’s draw an under-dispersed distribution, using the Laplace distribution

library(rmutil)
p_l <- dlaplace(x, m = 0, s = 0.4)


Now plot them all together:

plot(x, p_l, xlim = c(-4, 4), lwd = 2, xlab = "Residuals", main = "", type = "l")
lines(x, p, col = "red", lwd = 2, lty = 2)
lines(x, p_t, col = "darkblue", lwd = 2)
legend("topright", legend = c("normal", "under-dispersed", "over-dispersed"),
lty = c(2,1,1), col = c("red", "black", "darkblue"), lwd = 2)


Here’s what they look like if we sample measures from each distribution and plot them on a normal QQ plot, which you may be familiar with. It is one of the standard checks of model residuals:

set.seed(1997)
par(mfrow = c(1,3))
qqnorm(rnorm(1000, mean = 0, sd = 1), main = "normal")
abline(0,1)
qqnorm(rt(1000, df = 1.4, ncp = 0), main = "over-dispersed")
abline(0,1)
qqnorm(rlaplace(1000, m = 0, s = 0.4), main = "under-dispersed")
abline(0,1)


Playing around with the parameters for each distribution should help your understanding. For instnace, if you set the scale for the Laplace distribution to a larger number it will become over-dispersed, because it gets fatter tails than the normal (despite being more peaked at its mode).

Under-dispersion is more common than you might think. It can occur when you have a censoring process. For instance, perhaps you machine can only measure length’s to a certain precision and any distance that is to small gets rounded down to zero.