Hierarchical linear models and lmer

[This article was first published on Anything but R-bitrary, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hierarchical linear models and lmer

Article by Ben Ogorek
Graphics by Bob Forrest


Hierarchical Linear Models
My last article [1] featured linear models with random slopes. For estimation and prediction, we used the lmer function from the lme4 package[2].

Today we’ll consider another level in the hierarchy, one where slopes and intercepts are themselves linked to a linear predictor. We’ll simulate data to build intuition, derive the lmer formula using the linear mixed model $$ y = \mathbf{X \phi} + \mathbf{Z b} + \mathbf{\epsilon}, $$ and recover the system parameters.

Article sections


A realistic example

The nlme [3] package contains the data set MathAchieve, consisting of students, their socio-economic statuses (SES) and math achievement (MathAch) scores.
head(MathAchieve, 3)

## Grouped Data: MathAch ~ SES | School
##   School Minority    Sex    SES MathAch MEANSES
## 1   1224       No Female -1.528   5.876  -0.428
## 2   1224       No Female -0.588  19.708  -0.428
## 3   1224       No   Male -0.528  20.349  -0.428

If you inquired about the relationship between SES and MathAch scores, you’d have to acknowledge that the student isn’t the only unit. There is also the school. Notice how the variable MEANSES is constant over rows; it is a property of a school, not of a student. If the relationship between SES and MathAch was fundamentally different for schools according to MEANSES, what would that mean?

You can see Jon Fox’s analysis of these data [4]. We won’t be analyzing them. We’ll be making up our own.

A not-so-realistic example

For the remainder of this article, we will simulate data from a hierarchical linear model and then analyze it. This may sound circular, but it can also be informative. First, we’ll know if we are recovering the parameters (we chose them). We can quickly ask “what-if” questions, adjusting the sample sizes and parameter values. And finally, creating the expected data for a technique is a tactile way to understand its assumptions.

Our not-so-realistic data will feature $N$ abstract “units” at the top of the hierarchy, analogous to the schools in our realistic example and indexed by $i=1,\ldots, N$. These units will have an attribute (e.g., MEANSES) called $a_i$. Instead of students, we’ll call our second level “measurements” with index $j=1,\ldots,J$. Our response and predictor variables (e.g. MathAch and SES) will be called $y_{ij}$ and $x_{ij}$, respectively. As we will have $J$ measurements for every unit, there will be $M = NJ$ rows of simulated data.

Simulating data

The unit-level

We first generate data at the unit level, generating those characteristics from the standard normal distribution. That is, $a_i \sim N(0,1).$
rm(list = ls())

N <- 30
unit.df = data.frame(unit = c(1:N), a = rnorm(N))

head(unit.df, 3)

##   unit        a
## 1    1 -1.19142
## 2    2  0.54930
## 3    3 -0.06241

Every unit will have its own slope, $\alpha_i$, and intercept, $\beta_i$. The key is, these are now linearly related to $a_i$.
unit.df <-  within(unit.df, {
    E.alpha.given.a <-  1 - 0.15 * a
    E.beta.given.a <-  3 + 0.3 * a
head(unit.df, 3)

##   unit        a E.beta.given.a E.alpha.given.a
## 1    1 -1.19142          2.643          1.1787
## 2    2  0.54930          3.165          0.9176
## 3    3 -0.06241          2.981          1.0094

Adding noise

In a realistic scenario, a variable like $a_i$ probably won't tell you everything about unit $i$, including its exact slope and intercept. Thus for units $i=1,\ldots,N$, we'll need to simulate random disturbances that displace the slopes and intercepts from their conditional expectations given $a_i$.

Below we generate these random disturbances from the bivariate normal distribution, made easy thanks to the mvtnorm package [5,6].
q = 0.2
r = 0.9
s = 0.5
cov.matrix <-  matrix(c(q^2, r * q * s, r * q * s, s^2), nrow = 2,
    byrow = TRUE)
random.effects <-  rmvnorm(N, mean = c(0, 0), sigma = cov.matrix)

These disturbances are the random effects. Our first three (out of $N=30$) look like this:
##          [,1]    [,2]
## [1,]  0.19373  0.7516
## [2,]  0.07904  0.1776
## [3,] -0.30462 -0.5849

For the random effects covariance matrix, we used
##      [,1] [,2]
## [1,] 0.04 0.09
## [2,] 0.09 0.25

Based on our choices of $q$ and $s$ above, slopes vary more than intercepts. Also we've specified a correlation of $.9$ (i.e., $.09/ \sqrt{.04 *.25}$), so when the intercept's random effect is above its mean (of zero), the same is likely true for the slope.

Finally, we add the random effects to arrive at the unit-level intercepts ($\alpha_i$s) and slopes ($\beta_i$s).
unit.df$alpha = unit.df$E.alpha.given.a + random.effects[, 1]
unit.df$beta = unit.df$E.beta.given.a + random.effects[, 2]
head(unit.df, 3)

##   unit        a E.beta.given.a E.alpha.given.a  alpha  beta
## 1    1 -1.19142          2.643          1.1787 1.3724 3.394
## 2    2  0.54930          3.165          0.9176 0.9966 3.342
## 3    3 -0.06241          2.981          1.0094 0.7047 2.396

The relationship of $\alpha_i$ and $\beta_i$ to $a_i$ is depicted below.
Linear relationships among coefficients

The within-unit level

We now move to the within-unit level, generating a fixed $x$-grid to be shared among units.
J = 30
M = J * N  #Total number of observations
x.grid = seq(-4, 4, by = 8/J)[0:30]

within.unit.df <-  data.frame(unit = sort(rep(c(1:N), J)), j = rep(c(1:J),
    N), x =rep(x.grid, N))

Next, we generate data from the individual linear models based on $\alpha_i$, $\beta_i$, and within-unit level noise $\epsilon_{ij} \sim N(0, .75^2)$.
flat.df = merge(unit.df, within.unit.df)

flat.df <-  within(flat.df, y <-  alpha + x * beta + 0.75 * rnorm(n = M))

Below is a depiction of the $N=30$ linear systems.

Using lmer

Before moving back to lmer, we boil down our data set (which currently includes fields we wouldn't know in practice) to the essentials.
simple.df <-  flat.df[, c("unit", "a", "x", "y")]
head(simple.df, 3)

##   unit      a      x      y
## 1    1 -1.191 -4.000 -10.95
## 2    1 -1.191 -3.733 -11.68
## 3    1 -1.191 -3.467 -10.43

If we ignore $a_i$, we could still get random lines with the following syntax.
my.lmer <-  lmer(y ~ x + (1 + x | unit), data = simple.df)
cat("AIC =", AIC(my.lmer))

## AIC = 2226

For the purpose of comparison, we'll keep track of the Akaike information criterion (AIC), a general-purpose criterion for model comparison. Smaller is better, and there is good reason to believe we will find a model with a smaller AIC. For while $\alpha_i$ and $\beta_i$ vary, they vary not about fixed means, but rather about conditional means given $a_i$.

Separating fixed from random

To skip the math that's coming, see the next code block. But knowing what to put in lmer means understanding how to frame your problem in terms of the linear mixed model $$ \mathbf{y} = \mathbf{X \phi} + \mathbf{Z b} + \mathbf{\epsilon}. $$ In it, $\mathbf{X}$ and $\mathbf{Z}$ are $M$-row design matrices, and $\mathbf{y}$ and $\mathbf{\epsilon}$ are $M$ length vectors. In terms of explainable effects on $\mathbf{y}$, $\mathbf{X \phi}$ is the contribution from the fixed effects, and $\mathbf{Z b}$ is the is the contribution from the random effects.

As the relationship of $(\alpha_i, \beta_i)$ to $a_i$ depends on four quantities, so will $\mathbf{\phi}$ be a $4\times 1$ vector. As there are two random effects for every unit, $\mathbf{b}$ is a $2N \times 1$ vector.

The fixed effects vector is called $\mathbf{\phi}$ instead of the usual $\mathbf{\beta}$ because the latter is reserved for $\mathbf{\beta}_i = (\alpha_i, \beta_i)^T$ in the unit-level processes $$ \mathbf{y}_i = \mathbf{X}_i \mathbf{\beta_i} + \mathbf{\epsilon}_i. $$ It turns out that $\mathbf{\beta_i}$ is itself a function of $\phi$ and $\mathbf{b}_i$, written as $$ \mathbf{\beta_i} = \mathbf{A}_i \phi + \mathbf{b}_i, $$ where $$\mathbf{A}_i = \begin{bmatrix} 1 & a_i & 0 & 0 \\ 0 & 0 & 1 & a_i \end{bmatrix}, $$ $\mathbf{\phi} = (\phi_0, \phi_1, \phi_2, \phi_3)$, and $\mathbf{b}_i$ is the $2 \times 1$ random effect for the $i^{th}$ unit. Review your matrix multiplication if necessary, and convince yourself that this indeed what we've been working with.

Distributing $\mathbf{X}_i$ over the terms in the $\mathbf{\beta_i}$ equation leads to $$ \mathbf{y_i} = \mathbf{X}_i \mathbf{A}_i \phi + \mathbf{X}_i \mathbf{b}_i + \mathbf{\epsilon}_i. $$ We complete the linear mixed effects model picture by setting $$ \mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \mathbf{A}_1 \\ \mathbf{X}_2 \mathbf{A}_2 \\ ... \\\mathbf{X}_N \mathbf{A}_N \end{bmatrix} $$ and $$ \mathbf{Z} = \begin{bmatrix} \mathbf{X}_1 & 0 & \ldots & 0\\ 0 & \mathbf{X}_2 & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 &\ldots & \mathbf{X}_N \end{bmatrix}. $$

The lmer formula

Textbook problems often go from regression model to design matrix. We go the other way, from design matrix to linear model.

Fixed Effects

The design matrix for the fixed effects, $\mathbf{X}$, is seen above as a vertical stack of $\mathbf{X}_i \mathbf{A}_i$ matrices. If we understand $\mathbf{X}_i \mathbf{A}_i$, we understand $\mathbf{X}$. It is $$ \mathbf{X}_i \mathbf{A}_i = \begin{bmatrix} \mathbf{1}_J & \mathbf{x}_i\end{bmatrix} \begin{bmatrix} 1 & a_i & 0 & 0 \\ 0 & 0 & 1 & a_i \end{bmatrix}, $$ where $\mathbf{1}_J$ is a $J$-vector of ones and $\mathbf{x}_i$ is the within-unit covariate vector $(x_{i1}, x_{i2}, \ldots, x_{iJ})^T$. Multiplying these matrices, we arrive at $$ \mathbf{X}_i \mathbf{A}_i = \begin{bmatrix} 1 & a_i & x_{i1} & x_{i1} a_i\\ 1 & a_i & x_{i2} & x_{i2} a_i \\ \ldots & \dots & \ldots & \ldots \\ 1 & a_i & x_{iJ} & x_{iJ} a_i \end{bmatrix}. $$ These must be stacked vertically, but the structure is clear. This is the design matrix for the linear model $\text{E}(y|x, a) = \phi_0 + \phi_1 a + \phi_2 x + \phi_3 x a$. The fixed effects portion is just a simple linear model with interaction!

The random effects

After accounting for the fixed effects, the random effects are specified as if the coefficients were completely random. This part of the syntax is (1 + x | unit), specifying random effects for both the intercept and slope after adjusting for their conditional expectations given $a_i$.

The lmer code

The lmer formula is a concatenation of the linear model with interaction syntax and the random effects syntax.
my.lmer <-  lmer(y ~ x + a + x * a + (1 + x | unit), data = simple.df)

## Linear mixed model fit by REML 
## Formula: y ~ x + a + x * a + (1 + x | unit) 
##    Data: simple.df 
##   AIC  BIC logLik deviance REMLdev
##  2206 2244  -1095     2174    2190
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr  
##  unit     (Intercept) 0.0315   0.177          
##           x           0.2842   0.533    0.915 
##  Residual             0.5641   0.751          
## Number of obs: 900, groups: unit, 30
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   0.9776     0.0410   23.85
## x             3.0670     0.0980   31.31
## a            -0.1108     0.0475   -2.34
## x:a           0.3577     0.1134    3.15
## Correlation of Fixed Effects:
##     (Intr) x      a     
## x    0.723              
## a   -0.025 -0.018       
## x:a -0.018 -0.025  0.723
Notice that this model has indeed achieved a lower AIC (2206) by incorporating $a_i$. Furthermore, you can find noisy versions of all our key parameters in the output including
  • the random effects correlation of .9 (.915) and standard deviations of .2 (.117) for the intercept and .5 (.533) for slope,
  • the standard deviation of $\epsilon_{ij}$ of .75 (.751),
  • the regression parameters for $\alpha_i$ and $\beta_i$ versus $a_i$ of 1.0 (.9776), 3.0 (3.0670), -0.15 (-0.1108), and 0.30 (.3577).
You can increase the samples sizes to get closer estimates on average.


This article did not discuss lmer's predictions of $\alpha_i$ and $\beta_i$ or their relationship to the individual estimates $\hat{\alpha}_i$ and $\hat{\beta}_i$ from lm. Nor did we investigate what happens one of the distributional assumptions is violated.

To readers who have made it this far, I hope your familiarity with random effects models has increased in general, and that linear mixed modeling tools, such as lmer, are more available to your specific hierarchical applications.


All images created by Bob Forrest using ggplot2 [7], Cairo [8] and R functions for base graphics [9]. Bob also helped choose the colors for the R syntax highlighting theme used. Custom syntax highlighting and html rendering made possible with the knitr package [10], and made enjoyable via RStudio's Rhtml editor [11]. Thanks to Josh Mills, Chris Nicholas, and Mike McCaslin for their helpful comments. Keep up with ours and other great articles on R-Bloggers, and follow Ben on Twitter (@baogorek) for the latest research updates.


  1. Ben Ogorek (2012). Random regression coefficients using lme4. Anything but R-bitrary. URL http://anythingbutrbitrary.blogspot.com/2012/06/random-regression-coefficients-using.html

  2. Douglas Bates, Martin Maechler and Ben Bolker (2011). lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-42. URL http://CRAN.R-project.org/package=lme4

  3. Jose Pinheiro, Douglas Bates, Saikat DebRoy, Deepayan Sarkar and the R Development Core Team (2012). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-103.

  4. John Fox (2002). Linear Mixed Models. Appendix to An R and S-PLUS Companion to Applied Regression. URL http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-mixed-models.pdf

  5. Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, Torsten Hothorn (2012). mvtnorm: Multivariate Normal and t Distributions. R package version 0.9-9992. URL http://CRAN.R-project.org/package=mvtnorm

  6. Alan Genz, Frank Bretz (2009), Computation of Multivariate Normal and t Probabilities. Lecture Notes in Statistics, Vol. 195., Springer-Verlage, Heidelberg. ISBN 978-3-642-01688-2.

  7. H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

  8. Simon Urbanek and Jeffrey Horner (2011). Cairo: R graphics device using cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output. R package version 1.5-1. URL http://CRAN.R-project.org/package=Cairo

  9. R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

  10. Yihui Xie (2012). knitr: A general-purpose package for dynamic report generation in R. R package version 0.6. http://CRAN.R-project.org/package=knitr

  11. RStudio IDE for Windows. URL http://www.rstudio.com/ide/

To leave a comment for the author, please follow the link and comment on their blog: Anything but R-bitrary.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)