Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post was originally part of my previous post about linear models. However, I later decided to split it into several texts because it was effectively too long and complex to navigate.
If you struggle to follow the code in this page please refer to this post (for example for the necessary packages): Linear Models (lm, ANOVA and ANCOVA) in Agriculture

## Linear Mixed-Effects Models

This class of models are used to account for more than one source of random variation. For example, assume we have a dataset where again we are trying to model yield as a function of nitrogen level. However, this time the data were collected in many different farms. In this case would need to be consider a cluster and the model would need to take this clustering into account. Another common set of experiments where linear mixed-effects models are used is repeated measures where time provide an additional source of correlation between measures. For these models we do not need to worry about the assumptions from previous models, since these are very robust against all of them. For example, for unbalanced design with blocking, probably these methods should be used instead of the standard ANOVA.

At the beginning on this tutorial we explored the equation that supports linear model:

This equation can be seen as split into two components, the fixed and random effects. For fixed effect we refer to those variables we are using to explain the model. These may be factorial (in ANOVA), continuous or a mixed of the two (ANCOVA) and they can also be the blocks used in our design. The other component in the equation is the random effect, which provides a level of uncertainty that it is difficult to account in the model. For example, when we work with yield we might see differences between plants grown from similar soils and conditions. These may be related to the seeds or to other factors and are part of the within-subject variation that we cannot explain.

There are times however where in the data there are multiple sources of random variation. For example, data may be clustered in separate field or separate farms. This will provide an additional source of random variation that needs to be taken into account in the model. To do so the standard equation can be amended in the following way:

This is referred to as a random intercept model, where the random variation is split into a cluster specific variation u and the standard error term. A more complex form, that is normally used for repeated measures is the random slope and intercept model:

Where we add a new source of random variation v related to time T.

### Random Intercept Model for Clustered Data

In the following examples we will use the function lme in the package nlme, so please install and/or load the package first. For the dataset please look at the previous post.

Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describe some clusters in the data. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:

 lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)


The syntax is very similar to all the models we fitted before, with a general formula describing our target variable yield and all the treatments, which are the fixed effects of the model. Then we have the option random, which allows us to include an additional random component for the clustering factor rep. In this case the ~1 indicates that the random effect will be associated with the intercept.

Once again we can use the function summary to explore our results:

 > summary(lme1)
Linear mixed-effects model fit by REML
Data: dat
AIC   BIC  logLik
27648.36 27740.46 -13809.18

Random effects:
Formula: ~1 | rep
(Intercept) Residual
StdDev:  0.798407 13.3573

Fixed effects: yield ~ nf + bv * topo
Value Std.Error  DF  t-value p-value
(Intercept) 327.3304 14.782524 3428 22.143068    0
nfN1      3.9643 0.788049 3428  5.030561    0
nfN2      5.2340 0.790104 3428  6.624449    0
nfN3      5.4498 0.789084 3428  6.906496    0
nfN4      7.5286 0.789551 3428  9.535320    0
nfN5      7.7254 0.789111 3428  9.789976    0
bv      -1.4685 0.085507 3428 -17.173569    0
topoHT   -233.3675 17.143956 3428 -13.612232    0
topoLO   -251.9750 20.967003 3428 -12.017693    0
topoW    -146.4066 16.968453 3428 -8.628162    0
bv:topoHT   1.1945 0.097696 3428 12.226279    0
bv:topoLO   1.4961 0.123424 3428 12.121624    0
bv:topoW    0.7873 0.097865 3428  8.044485    0



We can also use the function Anovato display the ANOVA table:

 > Anova(lme2, type=c("III"))
Analysis of Deviance Table (Type III tests)

Response: yield
Chisq Df Pr(>Chisq)
(Intercept) 752.25 1 < 2.2e-16 ***
nf     155.57 5 < 2.2e-16 ***
bv     291.49 1 < 2.2e-16 ***
topo    236.52 3 < 2.2e-16 ***
year    797.13 1 < 2.2e-16 ***
bv:topo   210.38 3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



We might be interested in understanding if fitting a more complex model provides any advantage in terms of accuracy, compared with a model where no additional random effect is included. To do so we can compare this new model with mod6, which we created with the glsfunction and includes the same treatment structure.  We can do that with the function anova, specifying both models:

 > anova(lme1, mod6)
Model df   AIC   BIC  logLik  Test L.Ratio p-value
lme1   1 15 27648.36 27740.46 -13809.18
mod6   2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275



As you can see there is a decrease in AIC for the model fitted with lme, and the difference is significant (p-value below 0.05). Therefore this new model where clustering is accounted for is better than the one without an additional random effect, even though only slightly. In this case we would need to decide if fitting a more complex model (which is probably more difficult to explain to readers) is the best way.

Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R-Squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.
There are two important functions for this, both included in the package MuMIn:

 library(MuMIn)
> r.squaredLR(mod6)
[1] 0.5469906
[1] 0.5470721
> r.squaredGLMM(lme1)
R2m    R2c
0.5459845 0.5476009


The first function r.squaredLR can be used for GLS models and provides both and R-Squared and an Adjusted R-Squared. The second function, r.squaredGLMM, is specific for mixed-effects models and provides two measures: R2m and R2c. The first reports the R2 of the model with just fixed effects, while the second the R2 of the full model.
In this case we can see again that the R2 are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.

### Random Intercept and Slope for repeated measures

If we collected data at several time steps we are looking at a repeated measures analysis. The code to create such a model is the following:

 lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat)

summary(lme2)

Anova(lme2, type=c("III"))



The syntax is very similar to what we wrote before, except that now the random component includes both time and clusters. Again we can use summary and Anovato get more info about the model. We can also use again the function anova to compare this with the previous model:

 > anova(lme1, lme2)
Model df   AIC   BIC  logLik  Test L.Ratio p-value
lme1   1 15 27648.36 27740.46 -13809.18
lme2   2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001
Warning message:
In anova.lme(lme1, lme2) :
fitted objects with different fixed effects. REML comparisons are not meaningful.



From this output it is clear that the new model is better that the one before and their difference in highly significant.

We can extract only the effects for the random components using the function ranef:

 > ranef(lme2)
(Intercept)     year
R1 -0.3468601 -1.189799e-07
R2 -0.5681688 -1.973702e-07
R3  0.9150289 3.163501e-07



This tells us the changes in yield for each cluster and time step. We can also do the same for the fixed effects, and this will return the coefficients of the model:

 > fixef(lme2)
(Intercept)     nfN1     nfN2     nfN3     nfN4
-1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00
nfN5      bv    topoHT    topoLO     topoW
7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02
year   bv:topoHT   bv:topoLO   bv:topoW
5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01



To have an idea of their confidence interval we can use the function intervals:

 > intervals(lme2, which = "fixed")
Approximate 95% confidence intervals

Fixed effects:
lower     est.     upper
(Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04
nfN1     2.526139e+00 3.918006e+00 5.309873e+00
nfN2     3.736625e+00 5.132136e+00 6.527648e+00
nfN3     3.974809e+00 5.368513e+00 6.762216e+00
nfN4     6.070018e+00 7.464542e+00 8.859065e+00
nfN5     6.245584e+00 7.639337e+00 9.033089e+00
bv     -1.469793e+00 -1.318391e+00 -1.166989e+00
topoHT   -2.353450e+02 -2.049979e+02 -1.746508e+02
topoLO   -2.692026e+02 -2.321431e+02 -1.950836e+02
topoW    -1.436741e+02 -1.136168e+02 -8.355954e+01
year     5.414742e+00 5.818826e+00 6.222911e+00
bv:topoHT  8.547273e-01 1.027686e+00 1.200644e+00
bv:topoLO  1.165563e+00 1.383705e+00 1.601846e+00
bv:topoW   4.264933e-01 5.998379e-01 7.731826e-01
attr(,"label")
[1] "Fixed effects:"



## Syntax with lme4

Another popular package to perform mixed-effects models we could also use the package lme4 and the function lmer.
For example, to fit the model with random intercept (what we called lme1) we would use the following syntax in lme4:

 > lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)
>
> summary(lmer1)
Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ nf + bv * topo + (1 | rep)
Data: dat
REML criterion at convergence: 27618.4
Scaled residuals:
Min   1Q Median   3Q   Max
-3.4267 -0.7767 -0.1109 0.7196 3.6892
Random effects:
Groups  Name    Variance Std.Dev.
rep   (Intercept)  0.6375 0.7984
Residual       178.4174 13.3573
Number of obs: 3443, groups: rep, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 327.33043  14.78252 22.143
nfN1      3.96433  0.78805  5.031
nfN2      5.23400  0.79010  6.624
nfN3      5.44980  0.78908  6.906
nfN4      7.52862  0.78955  9.535
nfN5      7.72537  0.78911  9.790
bv      -1.46846  0.08551 -17.174
topoHT   -233.36750  17.14396 -13.612
topoLO   -251.97500  20.96700 -12.018
topoW    -146.40655  16.96845 -8.628
bv:topoHT   1.19446  0.09770 12.226
bv:topoLO   1.49609  0.12342 12.122
bv:topoW    0.78727  0.09786  8.044
Correlation matrix not shown by default, as p = 13 > 12.
Use print(x, correlation=TRUE) or
vcov(x)     if you need it


There are several differences between nlme and lme4 and I am not sure which is actually better. What I found is that probably lme4 is the most popular, but nlme is used for example to fit generalized addictive mixed effects models in the package mgcv. For this reason probably the best thing would be to know how to use both packages.

As you can see from the summary above, in this table there are no p-values, so it is a bit difficult to know which levels are significant for the model. We can solve this by installing and/or loading the package lmerTest. If we load lmerTest and run again the same function we would obtain the following summary table:

 > lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)
>
> summary(lmer1)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [
lmerMod]
Formula: yield ~ nf + bv * topo + (1 | rep)
Data: dat
REML criterion at convergence: 27618.4
Scaled residuals:
Min   1Q Median   3Q   Max
-3.4267 -0.7767 -0.1109 0.7196 3.6892
Random effects:
Groups  Name    Variance Std.Dev.
rep   (Intercept)  0.6375 0.7984
Residual       178.4174 13.3573
Number of obs: 3443, groups: rep, 3
Fixed effects:
Estimate Std. Error     df t value Pr(>|t|)
(Intercept) 327.33043  14.78252 3411.00000 22.143 < 2e-16 ***
nfN1      3.96433  0.78805 3428.00000  5.031 5.14e-07 ***
nfN2      5.23400  0.79010 3428.00000  6.624 4.03e-11 ***
nfN3      5.44980  0.78908 3428.00000  6.906 5.90e-12 ***
nfN4      7.52862  0.78955 3428.00000  9.535 < 2e-16 ***
nfN5      7.72537  0.78911 3428.00000  9.790 < 2e-16 ***
bv      -1.46846  0.08551 3428.00000 -17.174 < 2e-16 ***
topoHT   -233.36750  17.14396 3429.00000 -13.612 < 2e-16 ***
topoLO   -251.97500  20.96700 3430.00000 -12.018 < 2e-16 ***
topoW    -146.40655  16.96845 3430.00000 -8.628 < 2e-16 ***
bv:topoHT   1.19446  0.09770 3429.00000 12.226 < 2e-16 ***
bv:topoLO   1.49609  0.12342 3430.00000 12.122 < 2e-16 ***
bv:topoW    0.78727  0.09786 3430.00000  8.044 1.33e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation matrix not shown by default, as p = 13 > 12.
Use print(x, correlation=TRUE) or
vcov(x)     if you need it


As you can see now the p-values are showing and we can assess the significance for each term.

Clearly, all the functions we used above for the function lme can be used also with the package lme4 and lmerTest. For example, we can produce the anova table with the function anova or compute the R Squared with the function r.squaredGLMM.

When we are dealing with random slope and intercept we would use the following syntax:

 lmer2 = lmer(yield ~ nf + bv * topo + year + (year|rep), data=dat)


R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.