Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For a long time, I’ve been curious about something. It is a truth I’ve seen casually dropped in textbooks, package documentation, and tweets: random effects and penalized smoothing splines are the same thing. It sounds so profound and enlightened. What does it mean? How are they the same? What deep statistical gnosis was I missing out on?

I have spent months, off and on, trying to understand this equivalence. I can’t give you the full mathematical treatment, but I have the gist of it and I can point you to the equations. In this post, I will try to highlight the connections between the two.

Here are the main takeaways:

• Mixed effects models use partial pooling to strike a balance between a grand mean (complete pooling) and individual means (no pooling).
• Smoothing splines work by penalizing model coefficients to reduce the model degrees of freedom.
• You can use the computational machinery of one framework to estimate the other.

## Mixed model review

Let’s review what these things means. Mixed effects models, apparently the main focus of this blog over the years, are used to estimate “random” or “varying” effects. Here is the classic equation set up:

$\mathbf{y} = \mathbf{X\beta} + \mathbf{Zb} + \mathbf{\epsilon} \\ \mathbf{b} \sim \textsf{Normal}(0, \sigma_b) \\ \mathbf{\epsilon} \sim \textsf{Normal}(0, \sigma_y) \\ \mathbf{X}: \textrm{fixed effects model matrix} \\ \mathbf{Z}: \textrm{random effects model matrix} \\ \sigma_b, \sigma_y : \textrm{variance components} \\ \sigma_b : \textrm{where the magic happens} \\$

The magic here is the σb, as it ties all of the individual effects in b under a common distribution. If σb were replaced with a fixed number like 10, then all of the effects in b would be independent and unaware of each other: There would be no pooling of information between the groups.

Consider the radon dataset example from Gelman and Hill (2007). Radon measurements were taken in Minnesota counties. We would like to estimate the average radon measurement for each county. We have a repeated measures situation, and some counties have more observations than others. We use a mixed effects model to estimate a population distribution of county estimates, and the county-level estimates are randomly varying effects. They are drawn from a random distribution, the scale of which we estimate from the data.

library(tidyverse)
theme_set(theme_grey(base_size = 14))
library(brms)

log_radon ~ 1 + (1 | county),
family = gaussian,
)
#>  Family: gaussian
#>   Links: mu = identity; sigma = identity
#> Formula: log_radon ~ 1 + (1 | county)
#>    Data: radon (Number of observations: 919)
#> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 4000
#>
#> Group-Level Effects:
#> ~county (Number of levels: 85)
#>               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(Intercept)     0.30      0.05     0.22     0.40 1.00     1782     2894
#>
#> Population-Level Effects:
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept     1.35      0.05     1.26     1.45 1.00     2749     3198
#>
#> Family Specific Parameters:
#>       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma     0.77      0.02     0.73     0.80 1.00     7374     3105
#>
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).


Here sd(Intercept) corresponds to σb.

We can plot the observed county means alongside the model estimated ones. First, I do some wrangling so that the difference between observed means and estimated means are computed for use later on.

radon_aug <- radon %>%
group_by(county) %>%
mutate(
county_n = n()
) %>%
ungroup() %>%
mutate(
observed_minus_model = observed_mean - .value
) %>%
# summarize fitted values
ggdist::median_qi(.value, observed_minus_model)

radon_aug$type <- "mixed model estimates" radon$type <- "observed means"

aes(x = fct_infreq(county), y = log_radon) +
stat_summary(
aes(color = type, shape = type),
fun = mean,
geom = "point"
) +
geom_point(
aes(y = .value, color = type, shape = type)
) +
# want to include 0 in the figure
geom_blank(aes(y = 0)) +
labs(
x = "county (in decreasing order by sample size)",
) +
scale_color_manual(values = c("blue", "grey40")) +
labs(color = NULL, shape = NULL) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.title = element_blank(),
legend.position = "top",
legend.direction = "horizontal",
legend.justification = "left",
) We see a classic example of partial pooling. For counties with many observations, the estimate mean is hardly adjusted. For counties with less data, the estimate is pulled towards the population mean (Intercept in the summary above).

The following plot shows difference between the observed means and the estimated means, subtracting the grey triangles from the blue squares in the plot above.

ggplot(radon_aug) +
aes(x = county_n, y = observed_minus_model) +
geom_point() +
labs(
x = "Number of observations in county",
y = "Observed mean - estimated mean"
) The contention behind the smooths = random effects claim is that what we just did is a case of smoothing. These random effects are, in a way, smoothed fixed effects.

## But what’s smoothing?

Now let’s walk through a generalized additive model to demonstrate a penalized smoothing spline. That was a mouth full, but basically additive models are like the smoothing expansion pack for the standard linear model. We’re still doing regression, but we have some new syntax and our models can do nonlinear relationships more easily now.

I will walk through a basic example of how a spline’s basis functions are weighted to approximate a nonlinear trend, but this is not going to be a full tutorial. Other people have made video introductions to additive models or the mgcv package. I first learned them from a tutorial for linguists and then from the MGCV textbook, but there are other resources online.

We use the mcycle dataset which gives the head acceleration in a simulated motorcycle accident. We are going to fit a model, plot the smooth from it, and then we are going to work through what the model did.

library(mgcv)

mcycle <- MASS::mcycle %>%
tibble::rowid_to_column()

# Fit the model
gam_20 <- gam(
accel ~ 1 + s(times, bs = "cr", k = 20),
data = mcycle,
method = "REML"
)

mcycle$.fitted <- fitted(gam_20) ggplot(mcycle) + aes(x = times, y = accel) + geom_point(alpha = .5) + geom_line(aes(y = .fitted), color = "blue") + labs(x = "time after impact [ms]", y = "acceleration [g]") So what happened here? We will cover it visually. ### Splines are the sums of weighted wiggles Let’s look at the regression formula. formula(gam_20) #> accel ~ 1 + s(times, bs = "cr", k = 20)  We told gam() to estimate accel using an intercept term and a smooth term on the time predictor (s(times, ...)). Specifically, we created our smooth using a cubic regression spline basis (bs = "cr") with k = 20 - 1 curves. Our model is estimating a function by adding up smaller components called basis functions, and the space that defines those components is the basis. These basis functions are weighted and summed together to produce a smooth trend called a spline. The name splines is inspired by drafting splines which are flexible strips of wood that can be weighted and anchored in place to make a nice curve. To reiterate, conceptually, we are decomposing the times predictor into a bunch of individual wiggly lines (basis functions), and these are weighted and summed together to approximate some nonlinear function. My post on orthogonal polynomials illustrates the same principle but with polynomial basis functions. Richard McElreath provides a friendly 30-minute introduction splines in a Bayesian model in his Statistical Rethinking course. One line I appreciate from his description is that with splines, we replace a predictor variable, like times, with a set of “synthetic” predictor variables. An easy way to pull the wiggles is to use the model matrix. We have 20 columns for the intercept and 19 basis functions. model.matrix(gam_20) %>% tibble::trunc_mat(width = 72) #> # Description: dbl[,20] [133 x 20] #> (Intercept) s(times).1 s(times).2 s(times).3 s(times).4 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 -0.143 -0.165 -0.154 -0.311 #> 2 1 -0.0317 -0.193 -0.133 -0.293 #> 3 1 0.292 -0.270 -0.0716 -0.242 #> 4 1 0.492 -0.309 -0.0357 -0.208 #> 5 1 0.669 -0.332 -0.00603 -0.175 #> 6 1 0.933 0.0792 -0.0455 -0.0280 #> 7 1 0.808 0.278 -0.0969 -0.0106 #> 8 1 0.730 0.383 -0.122 -0.00387 #> 9 1 0.286 0.851 -0.182 0.00169 #> 10 1 0.123 0.957 -0.143 -0.0130 #> # ... with 123 more rows, and 15 more variables: s(times).5 <dbl>, #> # s(times).6 <dbl>, s(times).7 <dbl>, s(times).8 <dbl>, #> # s(times).9 <dbl>, s(times).10 <dbl>, s(times).11 <dbl>, #> # s(times).12 <dbl>, s(times).13 <dbl>, s(times).14 <dbl>, #> # s(times).15 <dbl>, s(times).16 <dbl>, s(times).17 <dbl>, #> # s(times).18 <dbl>, s(times).19 <dbl>  To visualize the matrix, I am using a helper function from my personal R package for plotting matrices in ggplot2. What we see is times on the x axis and one line for the intercept and for each of the basis functions. # Helper function to plot the lines of a matrix ggmatplot <- tjmisc::ggmatplot # Helper function to label on a theme_grey() plot annotate_grey <- tjmisc::annotate_label_grey ggmatplot(cbind(mcycle$times, model.matrix(gam_20)), x_axis_column = 1) +
annotate_grey("intercept", 0, 1.02, size = 5) +
annotate_grey("individual\nbasis\nfunctions", 0, .16, size = 5) +
expand_limits(y = 1.2)  +
labs(x = "time [ms]", title = NULL) Now we can weight these by multiplying by the model coefficients. Here we use the diag(coef()) trick to prevent the weighted predictors from being summed together.

weighted_coefs <- model.matrix(gam_20) %*% diag(coef(gam_20))

ggmatplot(cbind(mcycle$times, weighted_coefs), x_axis_column = 1) + annotate_grey("weighted intercept", 35, -40, size = 5) + annotate_grey("weighted basis functions", 0, 26, size = 5) + labs(x = "time [ms]", title = NULL) We can see the two main inflections points in the dataset now. The basis functions around 20 ms and 30 ms become very active in order to push the spline away from 0 at those times. If we sum the lines together, we get the regression line (the intercept plus the smoothing spline). ggmatplot( cbind(mcycle$times, weighted_coefs),
x_axis_column = 1,
n_colors = 1
) +
stat_summary(
aes(group = 1),
color = "maroon",
fun = sum,
geom = "line",
size = 1.5
) +
annotate_grey("sum", 10, -70, size = 5, color = "maroon") +
labs(x = "time [ms]", title = NULL) Our plots so far demonstrate regression with basis functions, but smoothing splines go one step further: They penalize wiggliness to prevent overfitting. The idea is as follows: For the above demonstration, we chose a 20-dimension spline basis (19 curves because 1 is removed for identifiability, apparently). But where did that number 20 come from? Thin air. What if we specified a dimension of 50? That’s 50 predictors (1 intercept and 49 basis functions). Isn’t it really easy to overfit the data with this approach?

Well, let’s look at a 50-dimension version.

gam_50 <- gam(
accel ~ s(times, bs = "cr", k = 50),
data = mcycle,
method = "REML"
)

mcycle$.fitted50_fx <- fitted.values(gam_50_fx) ggplot(mcycle) + aes(x = times, y = accel) + geom_point(alpha = .5) + geom_line(aes(y = .fitted20_fx, color = "20"), size = 1) + geom_line(aes(y = .fitted50_fx, color = "50"), size = 1) + labs( x = "time after impact [ms]", y = "acceleration [g]", color = "Dimension" ) Thus, when we disable the penalty, the 50-dimension splines is free to wiggle all over the place. ### How much smoothing happened: Effective degrees of freedom The smoothing parameter, λ, is a hyperparameter. It controls the spline coefficients (basis function weights), and it is estimated from the data. We can set the λ manually and crank it up way, way up. In this case, the model tries to find the least wiggly curve that decreases modeling error: A straight line. gam_20_sp <- gam( accel ~ s(times, bs = "cr", k = 20, sp = 10000000), data = mcycle, method = "REML" ) mcycle$.fitted20_sp <- fitted.values(gam_20_sp)

ggplot(mcycle) +
aes(x = times, y = accel) +
geom_point() +
geom_line(aes(y = .fitted, color = "estimated"), size = 1) +
geom_line(aes(y = .fitted20_fx, color = "no smoothing"), size = 1) +
geom_line(aes(y = .fitted20_sp, color = "10,000,000"), size = 1) +
labs(
x = "time after impact [ms]",
y = "acceleration [g]",
color = "lambda"
) We need some way to talk about how much smoothing took place. On the one hand, we might treat each basis function as an independent predictor that uses up a full degree of freedom in fitting the curve. On the other hand, we might penalize the basis function weights so much that they produce a straight line, and thus, the batch of predictors effectively acts just like a single predictor variable would. That is, they are effectively estimating a curve that has just 1-degree-of-freedom’s worth of action in it. And indeed this is how mgcv describes the smoothness of the models: It reports the effective (or estimated) degrees of freedom (EDFs) behind each smooth.

If we look at the model summary, we see that our 20-dimension basis smooth has an EDF of 11.78 (see edf under Approximate significance of smooth terms).

summary(gam_20)
#>
#> Family: gaussian
#>
#> Formula:
#> accel ~ 1 + s(times, bs = "cr", k = 20)
#>
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -25.546      1.956  -13.06   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value
#> s(times) 11.78  13.97 33.93  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) =  0.782   Deviance explained = 80.1%
#> -REML = 616.01  Scale est. = 509.01    n = 133


Similarly, our 50-dimension basis smooth has only 12.81 effective degrees of freedom, but the unpenalized version uses all 49 of its basis function curves and uses 49 degrees of freedom.

gam_50
#>
#> Family: gaussian
#>
#> Formula:
#> accel ~ s(times, bs = "cr", k = 50)
#>
#> Estimated degrees of freedom:
#> 12.8  total = 13.81
#>
#> REML score: 616.068

gam_50_fx
#>
#> Family: gaussian
#>
#> Formula:
#> accel ~ s(times, bs = "cr", k = 50, fx = TRUE)
#>
#> Estimated degrees of freedom:
#> 49  total = 50
#>
#> REML score: 403.4616


## The big trick: Turn λ into a random effect variance in a mixed model

Okay, so far, here’s what we have:

• A spline decomposes a predictor into a number of wiggly basis functions.
• A penalized spline adds a penalty term to the model to reduce wiggliness.
• This penalty shrinks model coefficients so that they use a smaller number of degrees of freedom used by the model.
• The amount of smoothing is controlled by a hyperparameter λ.

We saw another hyperparameter earlier on in this post whose job was to pull individual parameter estimates closer to 0: σb. Both of these hyperparameters are estimated from the data and perform shrinkage on a batch of related coefficients (random effects or basis function weights).

So, here’s the big thing… If you do a bunch of linear algebra (as in slide 7 here), you can express the smooth as a mixed model:

\begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{\beta} + \mathbf{Zb} + \mathbf{\epsilon} \\ \mathbf{b} &\sim \textsf{Normal}(0, \sigma/\lambda) \\ \mathbf{\epsilon} &\sim \textsf{Normal}(0,\sigma) \\ \mathbf{X}, \mathbf{Z} &: \textrm{matrices from transforming spline and penalty matrices} \\ \mathbf{X} &: \textrm{unpenalized (fixed) effect model matrix} \\ \mathbf{Z} &: \textrm{penalized (random) effects model matrix} \\ \lambda &: \textrm{smoothness parameter} \\ \sigma &: \textrm{residual variance component} \\ \end{align*}

And right there, on the second line, we see the mixed effects magic again: σ/λ = σb: Model coefficients are related under a common distribution so that they can share information with each other. We can smuggle penalized smooths into the mixed effects framework.

### So… you can turn splines into mixed models?

Yes. What this means is that we can use nlme or lme4 to estimate a smooth as mixed effects model. mgcv provides this feature in its aptly named smooth2random() function.

Below you can see the extent of the internal transformation needed to convert our nice wiggly cubic regression basis into matrices for the mixed effects framework.

# Construct a smoothing basis outside of a model
sm_raw <- smoothCon(
s(times, k = 20, bs = "cr"),
data = mcycle,
absorb.cons = TRUE,
diagonal.penalty = TRUE
)

re <- smooth2random(sm_raw[], "", type = 2)

# 1 fixed effect and 18 random effect columns
mixed_matrix <- cbind(re$Xf, re$rand$Xr) ggmatplot(mixed_matrix) + labs(title = NULL) I don’t understand it very well myself, and I probably won’t be converting smoothing bases into mixed effects model matrices and fitting them with lme4 anytime soon, but it’s useful to know about this idea because of the following fact. ### This point seems obscure, but it is what brms uses! What happens when we fit a smooth in brms? We are fitting a mixed effects model. b_gam_20 <- brm( accel ~ s(times, bs = "cr", k = 20), data = mcycle, family = gaussian, file = "b-gam-20" ) summary(b_gam_20) #> Family: gaussian #> Links: mu = identity; sigma = identity #> Formula: accel ~ s(times, bs = "cr", k = 20) #> Data: mcycle (Number of observations: 133) #> Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; #> total post-warmup samples = 4000 #> #> Smooth Terms: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> sds(stimes_1) 5.00 1.14 3.21 7.66 1.00 1095 1579 #> #> Population-Level Effects: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> Intercept -25.50 2.02 -29.39 -21.45 1.00 5454 2867 #> stimes_1 1.65 0.34 0.98 2.30 1.00 5156 3033 #> #> Family Specific Parameters: #> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS #> sigma 22.77 1.45 20.20 25.81 1.00 4851 3277 #> #> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS #> and Tail_ESS are effective sample size measures, and Rhat is the potential #> scale reduction factor on split chains (at convergence, Rhat = 1).  You see sds(times_1)? That’s the variance of smooth weights. You see stimes_1? That the singe fixed effects term (re$Xf) in the code above. I’m pretty sure about this because I learned smooth2random() from studying the brms source code.

Indeed, if we look at the actual Stan code used to fit the model, we see a mixed effects model. I suggest you only skim this code.

b_gam_20model #> // generated with brms 2.14.4 #> functions { #> } #> data { #> int<lower=1> N; // total number of observations #> vector[N] Y; // response variable #> // data for splines #> int Ks; // number of linear effects #> matrix[N, Ks] Xs; // design matrix for the linear effects #> // data for spline s(times,bs="cr",k=20) #> int nb_1; // number of bases #> int knots_1[nb_1]; // number of knots #> // basis function matrices #> matrix[N, knots_1] Zs_1_1; #> int prior_only; // should the likelihood be ignored? #> } #> transformed data { #> } #> parameters { #> real Intercept; // temporary intercept for centered predictors #> vector[Ks] bs; // spline coefficients #> // parameters for spline s(times,bs="cr",k=20) #> // standarized spline coefficients #> vector[knots_1] zs_1_1; #> real<lower=0> sds_1_1; // standard deviations of spline coefficients #> real<lower=0> sigma; // residual SD #> } #> transformed parameters { #> // actual spline coefficients #> vector[knots_1] s_1_1; #> // compute actual spline coefficients #> s_1_1 = sds_1_1 * zs_1_1; #> } #> model { #> // likelihood including all constants #> if (!prior_only) { #> // initialize linear predictor term #> vector[N] mu = Intercept + rep_vector(0.0, N) + Xs * bs + Zs_1_1 * s_1_1; #> target += normal_lpdf(Y | mu, sigma); #> } #> // priors including all constants #> target += student_t_lpdf(Intercept | 3, -13.3, 35.6); #> target += student_t_lpdf(sds_1_1 | 3, 0, 35.6) #> - 1 * student_t_lccdf(0 | 3, 0, 35.6); #> target += std_normal_lpdf(zs_1_1); #> target += student_t_lpdf(sigma | 3, 0, 35.6) #> - 1 * student_t_lccdf(0 | 3, 0, 35.6); #> } #> generated quantities { #> // actual population-level intercept #> real b_Intercept = Intercept; #> }  Okay, that’s a lot. But let me highlight and translate the key part of it. \begin{align*} \mathbf{S} &: \texttt{s_1_1} & \texttt{// actual spline coefficients}\\ \mathbf{\sigma} &: \texttt{sds_1_1} & \texttt{// standard deviations of spline coefficients}\\ \mathbf{Z} &: \texttt{zs_1_1} & \texttt{// standarized spline coefficients}\\ \mathbf{S} &= \sigma \cdot \mathbf{Z} & \texttt{s_1_1 = sds_1_1 * zs_1_1}; \\ \mathbf{Z} &\sim \textsf{Normal}(0, 1) & \texttt{target += std_normal_lpdf(zs_1_1);}\\ \mathbf{\sigma} &\sim \textsf{StudentT}(3, 0, 35.6) & \texttt{target += student_t_lpdf(sds_1_1 | 3, 0, 35.6) ...}\\ \end{align*} The spline coefficients start as z scores, drawn from normal distribution with mean 0 and a standard deviation of 1. They are then scaled by σ, a scaling factor. This scaling factor acts as the hyperparameter that is learned from the data. This formulation of a mixed model is call the noncentered parameterization. It’s one of those topics you eventually run into in the Stanverse. Richard McElreath does a friendly tour of it as a matter of algebra. For a deeper dive, Michael Betancourt covers the topic in the context of debugging degenerate posteriors. We can see completely analogous code in the Bayesian radon model above. I’ll just show the relevant part: b_radonmodel
#> ...
#> parameters {
#>   ...
#>   vector<lower=0>[M_1] sd_1;  // group-level standard deviations
#>   vector[N_1] z_1[M_1];  // standardized group-level effects
#> }
#> transformed parameters {
#>   vector[N_1] r_1_1;  // actual group-level effects
#>   r_1_1 = (sd_1 * (z_1));
#> }
#> model {
#>   ...
#>   target += student_t_lpdf(sd_1 | 3, 0, 2.5)
#>     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
#>   target += std_normal_lpdf(z_1);
#> }


## Simple random effects are category smoothers

One consequence of this relationship is that you can walk this relation backwards: You can fit a simple random effects using a basis matrix and penalty matrix. Indeed, mgcv provides a random effects ("re") smoother basis so we can estimate our mixed model from above using a smooth.

gam_radon <- gam(
log_radon ~ 1 + s(county, bs = "re"),
method = "REML"
)


In this case, the basis matrix is just a single indicator variable for county.

ggmatplot(model.matrix(gam_radon)[, -1]) (Yikes, that one stings the eyes!)

The penalty matrix is a diagonal as each county effect is equally penalized.

gam_radon %>%
gratia::penalty() %>%
gratia::draw() +
theme(axis.text = element_blank()) mgcv knows about how σ/λ = σb and it will provide the random effect variance estimate for us with gam.vcomp() (variance components):

gam.vcomp(gam_radon)
#>
#> Standard deviations and 0.95 confidence intervals:
#>
#>             std.dev     lower     upper
#> s(county) 0.2976671 0.2202676 0.4022639
#> scale     0.7660696 0.7304109 0.8034692
#>
#> Rank: 2/2


Which closely matches the estimates from the Bayesian model above:

b_radon %>%
posterior_summary(pars = c("sd_county__Intercept", "sigma")) %>%
round(3)
#>                      Estimate Est.Error  Q2.5 Q97.5
#> sd_county__Intercept    0.305     0.046 0.220 0.400
#> sigma                   0.767     0.019 0.731 0.805


What I think is the coolest feature of random intercepts as smooths is what the effective degrees of freedom tells us:

nlevels(radon\$county)
#>  85
#>
#> Family: gaussian
#>
#> Formula:
#> log_radon ~ 1 + s(county, bs = "re")
#>
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    1.350      0.047   28.72   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#>             edf Ref.df     F p-value
#> s(county) 39.57     84 1.573  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) =  0.126   Deviance explained = 16.4%
#> -REML = 1092.4  Scale est. = 0.58686   n = 919


The data have 85 counties, but there is effectively 39.5 counties worth of parameters here. For what it’s worth, we find a similar number when we ask brms to use the LOO method to estimate the effective number of parameters from its model.

loo(b_radon)
#>
#> Computed from 4000 by 919 log-likelihood matrix
#>
#>          Estimate   SE
#> elpd_loo  -1083.8 28.7
#> p_loo        44.9  4.0
#> looic      2167.7 57.5
#> ------
#> Monte Carlo SE of elpd_loo is 0.1.
#>
#> Pareto k diagnostic values:
#>                          Count Pct.    Min. n_eff
#> (-Inf, 0.5]   (good)     917   99.8%   398
#>  (0.5, 0.7]   (ok)         2    0.2%   227
#>    (0.7, 1]   (bad)        0    0.0%   <NA>
#>    (1, Inf)   (very bad)   0    0.0%   <NA>
#>
#> All Pareto k estimates are ok (k < 0.7).
#> See help('pareto-k-diagnostic') for details.


Here p_loo is the effective number of parameters from a leave-one-out cross-validation method. It’s around 45. Our mgcv model has around 41.6 parameters (39.6 from the smooth plus intercept and sigma). These two parameter estimating methods are unrelated (as far as I know), but they both seem to telling us something similar about how much information in county variable we have after partially pooling down the model parameters.

For a deeper dive on random effects, Gavin Simpson recently wrote about how to use mgcv for random effects, so I encourage readers to look at that post.

## Okay, why are you telling me this?

The purpose of this post was to demystify the connection between penalized smooths and mixed effects models. Formally, it all boils down to a hyperparameter that penalizes model coefficients (or pools information) so that the model uses a smaller number of effective parameters. This connection enriches my understanding of the two kinds of models. A by-county random intercept is a category smoother. We can estimate a smooth by putting a prior on the variance of the basis function weights (provided that we incorporate the penalty matrix somehow). It’s wonderful.

Random effects are funny things. We are first taught them in a repeated measures framework as a way to handle a problem in the data (non-independence in clustered units). But they are so much more. To use a Richard McElreath phrase, they provide an “adaptive regularizing prior”: Something that constrains our estimates to reduce overfitting but is actively learned from the data. Hodge and Clayton (2011) outline this problem between repeated-measures random effects and pooling/smoothing random effects in their paper Random Effects Old and New, and I think it’s worth remembering this distinction when we use the phrase “random effects”.

Last knitted on 2021-02-26. Source code on GitHub.1

1. sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Windows 10 x64
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/Chicago
#>  date     2021-02-26
#>
#> - Packages -------------------------------------------------------------------
#>  ! package        * version    date       lib
#>    abind            1.4-5      2016-07-21 
#>    arrayhelpers     1.1-0      2020-02-04 
#>    assertthat       0.2.1      2019-03-21 
#>    backports        1.2.1      2020-12-09 
#>    base64enc        0.1-3      2015-07-28 
#>    bayesplot        1.8.0.9000 2021-02-01 
#>    boot             1.3-27     2021-02-12 
#>    bridgesampling   1.0-0      2020-02-26 
#>    brms           * 2.14.4     2020-11-03 
#>    Brobdingnag      1.2-6      2018-08-13 
#>    broom            0.7.5      2021-02-19 
#>    callr            3.5.1      2020-10-13 
#>    cellranger       1.1.0      2016-07-27 
#>    cli              2.3.1      2021-02-23 
#>    coda             0.19-4     2020-09-30 
#>    codetools        0.2-18     2020-11-04 
#>    colorspace       2.0-0      2020-11-11 
#>    colourpicker     1.1.0      2020-09-14 
#>    crayon           1.4.1      2021-02-08 
#>    crosstalk        1.1.1      2021-01-12 
#>    curl             4.3        2019-12-02 
#>    DBI              1.1.1      2021-01-15 
#>    dbplyr           2.1.0      2021-02-03 
#>    digest           0.6.27     2020-10-24 
#>    distributional   0.2.2      2021-02-02 
#>    dplyr          * 1.0.4      2021-02-02 
#>    DT               0.17       2021-01-06 
#>    dygraphs         1.1.1.6    2018-07-11 
#>    ellipsis         0.3.1      2020-05-15 
#>    emmeans          1.5.4      2021-02-03 
#>    estimability     1.3        2018-02-11 
#>    evaluate         0.14       2019-05-28 
#>    fansi            0.4.2      2021-01-15 
#>    farver           2.0.3      2020-01-16 
#>    fastmap          1.1.0      2021-01-25 
#>    forcats        * 0.5.1      2021-01-27 
#>    fs               1.5.0      2020-07-31 
#>    gamm4            0.2-6      2020-04-03 
#>    generics         0.1.0      2020-10-31 
#>    ggdist           2.4.0      2021-01-04 
#>    ggplot2        * 3.3.3      2020-12-30 
#>    ggridges         0.5.3      2021-01-08 
#>    git2r            0.28.0     2021-01-10 
#>    glue             1.4.2      2020-08-27 
#>    gratia           0.5.1.9002 2021-02-16 
#>    gridExtra        2.3        2017-09-09 
#>    gtable           0.3.0      2019-03-25 
#>    gtools           3.8.2      2020-03-31 
#>    haven            2.3.1      2020-06-01 
#>    here             1.0.1      2020-12-13 
#>    highr            0.8        2019-03-20 
#>    hms              1.0.0      2021-01-13 
#>    htmltools        0.5.1.1    2021-01-22 
#>    htmlwidgets      1.5.3      2020-12-10 
#>    httpuv           1.5.5      2021-01-13 
#>    httr             1.4.2      2020-07-20 
#>    igraph           1.2.6      2020-10-06 
#>    inline           0.3.17     2020-12-01 
#>    jsonlite         1.7.2      2020-12-09 
#>    knitr          * 1.31       2021-01-27 
#>    labeling         0.4.2      2020-10-20 
#>    later            1.1.0.1    2020-06-05 
#>    lattice          0.20-41    2020-04-02 
#>    lifecycle        1.0.0      2021-02-15 
#>    lme4             1.1-26     2020-12-01 
#>    loo              2.4.1      2020-12-09 
#>    lubridate        1.7.9.2    2020-11-13 
#>    magrittr         2.0.1      2020-11-17 
#>    markdown         1.1        2019-08-07 
#>    MASS             7.3-53.1   2021-02-12 
#>    Matrix           1.2-18     2019-11-27 
#>    matrixStats      0.58.0     2021-01-29 
#>    mgcv           * 1.8-33     2020-08-27 
#>    mime             0.10       2021-02-13 
#>    miniUI           0.1.1.1    2018-05-18 
#>    minqa            1.2.4      2014-10-09 
#>    modelr           0.1.8      2020-05-19 
#>    multcomp         1.4-16     2021-02-08 
#>    munsell          0.5.0      2018-06-12 
#>    mvnfast          0.2.5.1    2020-10-14 
#>    mvtnorm          1.1-1      2020-06-09 
#>    nlme           * 3.1-152    2021-02-04 
#>    nloptr           1.2.2.2    2020-07-02 
#>    patchwork        1.1.1      2020-12-17 
#>    pillar           1.5.0      2021-02-22 
#>    pkgbuild         1.2.0      2020-12-15 
#>    pkgconfig        2.0.3      2019-09-22 
#>    plyr             1.8.6      2020-03-03 
#>    prettyunits      1.1.1      2020-01-24 
#>    processx         3.4.5      2020-11-30 
#>    projpred         2.0.2      2020-10-28 
#>    promises         1.2.0.1    2021-02-11 
#>    ps               1.5.0      2020-12-05 
#>    purrr          * 0.3.4      2020-04-17 
#>    R6               2.5.0      2020-10-28 
#>    ragg             1.1.0      2021-02-15 
#>    Rcpp           * 1.0.6      2021-01-15 
#>  D RcppParallel     5.0.2      2020-06-24 
#>    readr          * 1.4.0      2020-10-05 
#>    reprex           1.0.0      2021-01-27 
#>    reshape2         1.4.4      2020-04-09 
#>    rlang            0.4.10     2020-12-30 
#>    rprojroot        2.0.2      2020-11-15 
#>    rsconnect        0.8.16     2019-12-13 
#>    rstan            2.21.2     2020-07-27 
#>    rstanarm         2.21.1     2020-07-20 
#>    rstantools       2.1.1      2020-07-06 
#>    rstudioapi       0.13       2020-11-12 
#>    rvest            0.3.6      2020-07-25 
#>    sandwich         3.0-0      2020-10-02 
#>    scales           1.1.1      2020-05-11 
#>    sessioninfo      1.1.1      2018-11-05 
#>    shiny            1.6.0      2021-01-25 
#>    shinyjs          2.0.0      2020-09-09 
#>    shinystan        2.5.0      2018-05-01 
#>    shinythemes      1.2.0      2021-01-25 
#>    statmod          1.4.35     2020-10-19 
#>    stringi          1.5.3      2020-09-09 
#>    stringr        * 1.4.0      2019-02-10 
#>    survival         3.2-7      2020-09-28 
#>    svUnit           1.0.3      2020-04-20 
#>    systemfonts      1.0.1      2021-02-09 
#>    textshaping      0.3.0      2021-02-10 
#>    TH.data          1.0-10     2019-01-21 
#>    threejs          0.3.3      2020-01-21 
#>    tibble         * 3.0.6      2021-01-29 
#>    tidybayes        2.3.1      2020-11-02 
#>    tidyr          * 1.1.2      2020-08-27 
#>    tidyselect       1.1.0      2020-05-11 
#>    tidyverse      * 1.3.0      2019-11-21 
#>    tjmisc           0.0.0.9000 2021-02-25 
#>    utf8             1.1.4      2018-05-24 
#>    V8               3.4.0      2020-11-04 
#>    vctrs            0.3.6      2020-12-17 
#>    withr            2.4.1      2021-01-26 
#>    xfun             0.21       2021-02-10 
#>    xml2             1.3.2      2020-04-23 
#>    xtable           1.8-4      2019-04-21 
#>    xts              0.12.1     2020-09-09 
#>    zoo              1.8-8      2020-05-02 
#>  source
#>  CRAN (R 4.0.0)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.0)
#>  local
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.0)
#>  CRAN (R 4.0.4)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.4)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.0)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  Github (gavinsimpson/[email protected])
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.0)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.4)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.4)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.0)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.4)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.4)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  local
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.3)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>  CRAN (R 4.0.2)
#>
#>  C:/Users/Tristan/Documents/R/win-library/4.0
#>  C:/Program Files/R/R-4.0.4/library
#>
#>  D -- DLL MD5 mismatch, broken installation.