Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Time to finally patch a hole in the leaky roof of my knowledge: what are Generalised Linear Models anyway?
< section id="groundwork-what-are-linear-models-anyway" class="level2">Groundwork: what are Linear Models anyway?
Generalised Linear Models (GLMs) are a short step from Linear Models, provided you have the right understanding of linear models. There’s also some stats jargon to get a handle on. So we’ll start with an intuitive explanation of linear models.
With a linear model, we have some quantity (the response variable) that we’re trying to predict from some other quantities (the predictors). The predictors are knowable things that you can measure with good precision, such as age and weight. The response variable is something with natural variation and/or measurement uncertainty, like maximum jump height. Different people of the same age and weight will have different maximum jumps. For every age and weight combination, we aim to predict the average jump height.
The natural variation usually (but not always) follows a normal distribution for given predictors. That is to say, we should expect a bell curve of jump heights for people who are 80 years of age and weigh 80kg. For people who are 18 years of age and weight 60kg, we should expect a bell curve with a different (probably higher) mean but the same variance. Having the the variance in jump height not change for different ages and weights is important, we’ll come back to that later.
What we do in a linear model is devise a weighted sum of our predictors (a linear combination, in the lingo) that best predicts the mean of the bell curve. Here’s the general form of it:
Here
To determine those coefficients we use an algorithm like Ordinary Least Squares (OLS) which works to find the coefficients that give the means with the smallest (squared) residuals, i.e. minimal squared distance between the observations and the mean for any given predictors. We don’t need to know too much about how this works today, just that it’s an efficient and useful.
< section id="next-level-the-glm" class="level2">Next level: the GLM
With that under our belt, GLMs aren’t so scary. There are just two things to recognise.
First, a normal distribution isn’t always appropriate for the natural variation. What if rather than jump height, our response variable was something discrete, like number of pets? You can’t have a non-integer or negative number of pets. We’d want to swap the normal distribution for a discrete distribution from the same exponential family, most likely a Poisson distribution.
That leads directly to the second thing: the mean of a Poisson distribution must be positive, so we need to transform the result of that weighted sum in some way that maps it into the valid range of Poisson means. This transformation is called the link function. The form of this would be:
where
For different types of data, you’d choose a different type of distribution:
- Poisson for counts
- Binomial for binary probabilities
- Gamma for positive continuous data
- Normal for normally distributed data
Each of these has a “canonical” (standard) link function:
- Poisson: log
- Binomial: logit
- Gamma: inverse
- Normal: identity
You can choose another link function, but let’s leave that for another day.
Remember that the linear model assumed the variance was constant? GLMs don’t need this, because for distributions other than the normal distribution, the variance is a function of the mean. The technical term for this property is heteroscedasticity (whereas a constant mean is homoscedasticity).
One more important consequence: because of the link function, we don’t necessarily have a mean of the response variable that varies linearly with the predictors. In other words we can model non-linear relationships, which is of course powerful.
< section id="play-time" class="level2">Play time
Time to learn by playing. We can use a quirky dataset compiled by Russian statistician Ladislaus von Bortkiewicz in the late 1800s: deaths by horse kick per year for regiments in the Prussian army. In February 2025 Antony Unwin and Bill Venables updated the dataset with additional data and realised von Bortkiewicz’s original dream of publishing it to CRAN.
library(tidyverse) library(knitr) library(Horsekicks) hk.tidy <- hkdeaths |> pivot_longer( c(kick, drown, fall), names_to = "death.type", values_to = "death.count" ) hk.tidy |> head() |> kable()
year | corps | regiments | NCOs | vonB_kick | death.type | death.count |
---|---|---|---|---|---|---|
1875 | G | 8 | 0 | 0 | kick | 0 |
1875 | G | 8 | 0 | 0 | drown | 3 |
1875 | G | 8 | 0 | 0 | fall | 0 |
1875 | I | 6 | 0 | 0 | kick | 0 |
1875 | I | 6 | 0 | 0 | drown | 5 |
1875 | I | 6 | 0 | 0 | fall | 0 |
Unwin and Venables added deaths by drowning, and deaths by falling off a horse among other improvements. The data cover 14 corps, each with a fixed number of regiments, over 33 years. A regiment is about 500 soldiers, as far as I can tell. I have no idea where Prussia was or where it’s gone.
Let’s explore data visually. Plotting deaths over time shows that equestrian misadventures seem quite stable, but deaths by drowning did reduce. We will fit a GLM to these time series.
prussian.colours <- c("#333333", "#F9BE11", "#BE0007") hk.year <- hk.tidy |> group_by(year, death.type) |> summarise(death.count = sum(death.count)) ggplot(hk.year) + aes(x = year, y = death.count, group = death.type, colour = death.type) + geom_line() + scale_colour_manual(values = prussian.colours, name = "Death type") + labs(x = "Year", y = "Number of deaths") + theme_minimal()
Note that for each death type and each year we actually have a distribution across all regiments, not just a single number. We could choose to plot the distributions as a series of box plots.
ggplot(hk.tidy) + aes(x = year, y = death.count, group = year, fill = death.type) + facet_wrap(~death.type, dir = "v") + geom_boxplot(alpha = 0.5) + scale_fill_manual(values = prussian.colours, name = "Death type") + theme_minimal() + labs(x = "Year", y = "Number of deaths")
Time for that model. Since you can’t literally be kicked half to death, at least not as far as Preussische Statistik was concerned, all the death statistics are integer counts. Therefore the obvious choice for the distribution is Poisson. We don’t have a reason to provide a different link function, so we just take the canonical link function for Poisson (log).
R makes this trivial to do. glm
is a built-in function:
glm( formula = drown ~ year, # i.e. drownings are the response variable, year is the predictor family = poisson, data = hkdeaths )
Call: glm(formula = drown ~ year, family = poisson, data = hkdeaths) Coefficients: (Intercept) year 44.05820 -0.02256 Degrees of Freedom: 461 Total (i.e. Null); 460 Residual Null Deviance: 854.3 Residual Deviance: 766.9 AIC: 2170
The result is an intercept (which we con’t care much about) and a coefficient for the year of -0.022, which tells us that the model predicts drownings are decreasing 2.2% every year.
ggplot2
makes it even easier by allowing us to bung the above model into stat_smooth
and run it for each death type.
ggplot(hk.year) + aes(x = year, y = death.count, group = death.type, colour = death.type) + geom_point() + stat_smooth( method = "glm", formula = y ~ x, method.args = list(family = poisson) ) + scale_colour_manual(values = prussian.colours, name = "Death type") + theme_minimal() + labs(x = "Year", y = "Number of deaths")
Et voila. That’s really all there was to it!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.