Repeated measures with perennial crops

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I want to discuss a concept that is often mistaken by some of my collegues. With all crops, we are used to repeating experiments across years to obtain multi-year data; the structure of the resulting dataset is always the same and it is exemplified in the box below, that refers to a multi-year genotype experiment with winter wheat.

rm(list = ls())
library(tidyverse)
library(nlme)
library(emmeans)
filePath <- "https://www.casaonofri.it/_datasets/WinterWheat.csv"
dataset <- read.csv(filePath)
dataset <- dataset %>%
  mutate(across(c(1:3, 5), .fns = factor))
head(dataset)
##   Plot Block Genotype Yield Year
## 1    2     1 COLOSSEO  6.73 1996
## 2  110     2 COLOSSEO  6.96 1996
## 3  181     3 COLOSSEO  5.35 1996
## 4    2     1 COLOSSEO  6.26 1997
## 5  110     2 COLOSSEO  7.01 1997
## 6  181     3 COLOSSEO  6.11 1997

We can see that we have a column for the blocks, a column for the experimental factor (the genotype, in this instance), a column for the year and a column for the response variable (the yield, in this instance).

If we intend to take the genotype, year and block effects as fixed, we can fit the correct model with the lm() function and the resulting ANOVA table looks like the following one.

mod1 <- lm(Yield ~ Year/Block + Genotype*Year, data = dataset)
anova(mod1)
## Analysis of Variance Table
## 
## Response: Yield
##               Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Year           6 159.279 26.5466 178.3996 < 2.2e-16 ***
## Genotype       7  11.544  1.6491  11.0824 2.978e-10 ***
## Year:Block    14   3.922  0.2801   1.8826   0.03738 *  
## Year:Genotype 42  27.713  0.6598   4.4342 6.779e-10 ***
## Residuals     98  14.583  0.1488                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we have made different experiments in different years (and with a different randomisation in each year, hopefully), the block effect can only be considered as nested within each year (‘Year/Block’ and not ’Year*Block’). As for the rest, the ANOVA table is very close to that for any types of two-factors factorial experiments.

The things change dramatically if we have a perennial crop, or a crop with a cycle lasting for more than one year, because yield measurements are taken repeatedly in the same plots across years. For this reason, even though the dataset looks very much like the previous one, it must be analysed in a totally different manner. Based on my experience as an editor of International Journals, I can say that several authors, still today, tend to forget this, resulting in several rejections or, at least, delays in publication.

Let’s consider the dataset below, that refers to the yield of lucerne genotypes in three consecutive years, taken from the same plots in a single experiment lasting for three years.

filePath <- "https://www.casaonofri.it/_datasets/alfalfa3years.csv"
dataset <- read.csv(filePath)
dataset <- dataset %>%
  mutate(across(c(1:3), .fns = factor))
head(dataset)
##   Block Genotype Year     Yield
## 1     1 4cascine 2006  6.631775
## 2     2 4cascine 2006  6.705397
## 3     3 4cascine 2006  6.499588
## 4     4 4cascine 2006  7.087686
## 5     1 4cascine 2007 14.964927
## 6     2 4cascine 2007 13.584865

If we analyse tha data as in the winter wheat example, we neglect that the observations are grouped within the same plots and, therefore, they are correlated. Consequently, the independence assumption is broken and it is no wonder that the manuscript must be rejected. What should we do, then?

First of all, we should build a new variable to uniquely identify each plot; it is easy, if we think that yield values taken for the same genotype in the same block must have been taken in the same plot.

# Refernce the plots
dataset$Plot <- dataset$Block:dataset$Genotype

Next, we can follow the usual rules to build the model (Piepho et al., 2004):

  1. Consider one single year and build the treatment model
  2. Consider one single year and build the block model
  3. Include the year factor into the model and combine the year with all the effects in the treatment model, by crossing or nesting as appropriate.
  4. Consider that the ‘plot’ factor in the block model references the randomisation units, i.e. those units which received the the genotypes by a randomisation process. Assign to this plot factor a random effect.
  5. Excluding the terms for randomisation units, nest the year in all the other terms in the block model.
  6. Combine random effects for randomisation units with the repeated factor, by using the colon operator, in order to derive the correct error terms to accommodate correlation structures.

The models at the different steps are as follows (with R notation):

  1. treatment model: YIELD ~ GENOTYPE
  2. block model: YIELD ~ BLOCK + BLOCK:PLOT
  3. treatment model: YIELD ~ GENOTYPE * YEAR
  4. block model: YIELD ~ BLOCK + (1|BLOCK:PLOT)
  5. block model: YIELD ~ BLOCK + BLOCK:YEAR + (1|BLOCK:PLOT)
  6. block model: YIELD ~ BLOCK + BLOCK:YEAR + (1|BLOCK:PLOT) + (1|BLOCK:PLOT:YEAR)

For analogy with the winter wheat experiment we take the block, year and genotype effects as fixed (but you can change this), so that the final model is:

YIELD ~ BLOCK + BLOCK:YEAR + GENOTYPE * YEAR + (1|BLOCK:PLOT) + (1|BLOCK:PLOT:YEAR)

where the last term does not need to be fitted in R, as it is the residual term, that is fitted by default. The resulting analysis (with ‘lme’) is:

mod2 <- lme(Yield ~ Block + Block:Year + Genotype*Year, 
           random = ~1|Plot,
           data = dataset)
anova(mod2)
##               numDF denDF   F-value p-value
## (Intercept)       1   114 27080.051  <.0001
## Block             3    57     2.139  0.1053
## Genotype         19    57     4.602  <.0001
## Year              2   114  2107.322  <.0001
## Block:Year        6   114     3.818  0.0017
## Year:Genotype    38   114     1.356  0.1115
emmeans(mod2, ~Genotype)
##  Genotype          emmean    SE df lower.CL upper.CL
##  4cascine           11.59 0.315 57    10.96     12.2
##  casalina           12.46 0.315 57    11.83     13.1
##  classe             11.59 0.315 57    10.96     12.2
##  costanza            9.89 0.315 57     9.26     10.5
##  dimitra            11.75 0.315 57    11.12     12.4
##  FDL0101            12.22 0.315 57    11.59     12.9
##  garisenda          11.76 0.315 57    11.13     12.4
##  LaBellaCampagnola  12.17 0.315 57    11.54     12.8
##  LaTorre            11.36 0.315 57    10.73     12.0
##  linfa              11.23 0.315 57    10.60     11.9
##  marina             11.76 0.315 57    11.13     12.4
##  Palladiana         10.89 0.315 57    10.26     11.5
##  PicenaGR           12.12 0.315 57    11.49     12.8
##  PR56S82            11.56 0.315 57    10.93     12.2
##  PR57Q53            11.70 0.315 57    11.07     12.3
##  prosementi         11.79 0.315 57    11.15     12.4
##  RivieraVicentina    9.98 0.315 57     9.35     10.6
##  robot              12.11 0.315 57    11.48     12.7
##  Selene             12.11 0.315 57    11.48     12.7
##  Zenith             11.94 0.315 57    11.31     12.6
## 
## Results are averaged over the levels of: Block, Year 
## Degrees-of-freedom method: containment 
## Confidence level used: 0.95

If we compare the above analyses with a ‘naive’ (wrong) analysis that neglects the repeated measures, we see big differences and, especially, we see that the SE for genotype means is much higher in the correct analysis.

mod3 <- lm(Yield ~ Year/Block + Genotype*Year, 
           data = dataset)
anova(mod3)
## Analysis of Variance Table
## 
## Response: Yield
##                Df  Sum Sq Mean Sq   F value    Pr(>F)    
## Year            2 2602.53 1301.27 1608.2407 < 2.2e-16 ***
## Genotype       19  104.27    5.49    6.7824 4.256e-13 ***
## Year:Block      9   21.80    2.42    2.9930  0.002449 ** 
## Year:Genotype  38   31.83    0.84    1.0351  0.424687    
## Residuals     171  138.36    0.81                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod3, ~Genotype)
##  Genotype          emmean   SE  df lower.CL upper.CL
##  4cascine           11.59 0.26 171    11.07     12.1
##  casalina           12.46 0.26 171    11.94     13.0
##  classe             11.59 0.26 171    11.08     12.1
##  costanza            9.89 0.26 171     9.38     10.4
##  dimitra            11.75 0.26 171    11.24     12.3
##  FDL0101            12.22 0.26 171    11.71     12.7
##  garisenda          11.76 0.26 171    11.24     12.3
##  LaBellaCampagnola  12.17 0.26 171    11.66     12.7
##  LaTorre            11.36 0.26 171    10.84     11.9
##  linfa              11.23 0.26 171    10.72     11.7
##  marina             11.76 0.26 171    11.25     12.3
##  Palladiana         10.89 0.26 171    10.38     11.4
##  PicenaGR           12.12 0.26 171    11.61     12.6
##  PR56S82            11.56 0.26 171    11.05     12.1
##  PR57Q53            11.70 0.26 171    11.19     12.2
##  prosementi         11.79 0.26 171    11.27     12.3
##  RivieraVicentina    9.98 0.26 171     9.47     10.5
##  robot              12.11 0.26 171    11.59     12.6
##  Selene             12.11 0.26 171    11.60     12.6
##  Zenith             11.94 0.26 171    11.42     12.5
## 
## Results are averaged over the levels of: Block, Year 
## Confidence level used: 0.95

I just want to conclude by saying that the above mixed model analyses regards the design as a sort of ‘split-plot in time’ and it is not necessarily correct, as it assumes that the within-plot correlation is the same for all pairs of observations, regardless of their distance in time. Further analyses might be necessary to assess whether serial correlation structures are necessary. But we’ll consider this in a future post.

Thanks for reading and happy coding!

Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
[email protected]


Reference

  1. Piepho, H.-P., Büchse, A., Richter, C., 2004. A Mixed Modelling Approach for Randomized Experiments with Repeated Measures. Journal of Agronomy and Crop Science 190, 230–247.
To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)