[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The CoViD-19 situation in Italy is little by little improving and I feel a bit more optimistic. It’s time for a new post! I will go back to a subject that is rather important for most agronomists, i.e. the selection of crop varieties.

All farmers are perfectly aware that crop performances are affected both by the genotype and by the environment. These two effects are not purely additive and they often show a significant interaction. By this word, we mean that a genotype can give particularly good/bad performances in some specific environmental situations, which we may not expect, considering its average behaviour in other environmental conditions. The Genotype by Environment (GE) interaction may cause changes in the ranking of genotypes, depending on the environment and may play a key role in varietal recommendation, for a given mega-environment.

GE interactions are usually studied by way of Multi-Environment Trials (MET), where experiments are repeated across several years, locations or any combinations of those. Traditional techniques of data analyses, such as two-way ANOVA, give very little insight on the stability/reliability of genotypes across environments and, thus, other specialised techniques are necessary to shed light on interaction effects. I have already talked about stability analyses in other posts, such as in this post about the stability variance or in this other post about the environmental variance. Now, I would like to propose some simple explanations about AMMI analysis. AMMI stands for: Additive Main effect Multiplicative Interaction and it has become very much in fashion in the last 20-25 years.

# A MET with faba bean

This experiment consists of 12 faba bean genotypes (well, it was, indeed, 6 genotypes in two sowing dates; but, let’s disregard this detail from now on) in four blocks, two locations and three years (six environments, in all). The dataset is online available as ‘fabaBean.csv’. It has been published by Stagnari et al. (2007).

First of all, let’s load the dataset and transform the block variable into a factor. Let’s also inspect the two-way table of means, together with the marginal means for genotypes and environments, which will be useful later. In this post, we will make use of the packages ‘dplyr’ (Wickham et al., 2020), ‘emmeans’ (Lenth, 2020) and ‘aomisc’; this latter is the companion package for this website and must have been installed as detailed in this page here.

# options(width = 70)

rm(list=ls())
# library(devtools)
# install_github("OnofriAndreaPG/aomisc")
library(reshape)
library(emmeans)
library(aomisc)

fileName <- "https://www.casaonofri.it/_datasets/fabaBean.csv"
dataset <- transform(dataset, Block = factor(Block),
Genotype = factor(Genotype),
Environment = factor(Environment))
##      Genotype Block Environment Yield
## 1    Chiaro_A     1       bad_1  4.36
## 2    Chiaro_P     1       bad_1  2.76
## 3 Collameno_A     1       bad_1  3.01
## 4 Collameno_P     1       bad_1  2.50
## 5    Palomb_A     1       bad_1  3.85
## 6    Palomb_P     1       bad_1  2.21
#
# Two-ways table of means
GEmedie <- cast(Genotype ~ Environment, data = dataset,
value = "Yield", fun=mean)
GEmedie
## 1     Chiaro_A 4.1050 2.3400 4.1250 4.6325 2.4100 3.8500
## 2     Chiaro_P 2.5075 1.3325 4.2025 3.3225 1.4050 4.3175
## 3  Collameno_A 3.2500 2.1150 4.3825 3.8475 2.2325 4.0700
## 4  Collameno_P 1.9075 0.8475 3.8650 2.5200 0.9850 4.0525
## 5     Palomb_A 3.8400 2.0750 4.2050 5.0525 2.6850 4.6675
## 6     Palomb_P 2.2500 0.9725 3.2575 3.2700 0.8825 4.0125
## 7      Scuro_A 4.3700 2.1050 4.1525 4.8625 2.1275 4.2050
## 8      Scuro_P 3.0500 1.6375 3.9300 3.7200 1.7475 4.5125
## 9    Sicania_A 3.8300 1.9450 4.5050 3.9550 2.2350 4.2350
## 10   Sicania_P 3.2700 0.9900 3.7300 4.0475 0.8225 3.8950
## 11   Vesuvio_A 4.1375 2.0175 4.0275 4.5025 2.2650 4.3225
## 12   Vesuvio_P 2.1225 1.1800 3.5250 3.0950 0.9375 3.6275
#
# Marginal means for genotypes
apply(GEmedie, 1, mean)
##    Chiaro_A    Chiaro_P Collameno_A Collameno_P    Palomb_A
##    3.577083    2.847917    3.316250    2.362917    3.754167
##    Palomb_P     Scuro_A     Scuro_P   Sicania_A   Sicania_P
##    2.440833    3.637083    3.099583    3.450833    2.792500
##   Vesuvio_A   Vesuvio_P
##    3.545417    2.414583
#
# Marginal means for environments
apply(GEmedie, 2, mean)
## 3.220000 1.629792 3.992292 3.902292 1.727917 4.147292
#
# Overall mean
mean(as.matrix(GEmedie))
## [1] 3.103264

What model could we possibly fit to the above data? The basic two-way ANOVA model is:

$Y_{ijk} = \mu + \gamma_{jk} + g_i + e_j + ge_{ij} + \varepsilon_{ijk} \quad \quad (1)$

where the yield $$Y$$ for given block $$k$$, environment $$j$$ and genotype $$i$$ is described as a function of the effects of blocks within environments ($$\gamma$$), genotypes ($$g$$), environments ($$e$$) and GE interaction ($$ge$$). The residual error term $$\varepsilon$$ is assumed to be normal and homoscedastic, with standard deviation equal to $$\sigma$$. Let’s also assume that both the genotype and environment effects are fixed: this is useful for teaching purposes and it is reasonable, as we intend to study the behaviour of specific genotypes in several specific environments.

The interaction effect $$ge$$, under some important assumptions (i.e. balanced data, no missing cells and homoscedastic errors), is given by:

$ge_{ij} = Y_{ij.} - \left( \mu + g_i + e_j \right) = Y_{ij.} - Y_{i..} - Y_{.j.} + \mu \quad \quad (2)$

where $$Y_{ij.}$$ is the mean of the combination between the genotype $$i$$ and the environment $$j$$, $$Y_{i..}$$ is the mean for the genotype $$i$$ and $$Y_{.j.}$$ is the mean for the environment $$j$$. For example, for the genotype ‘Chiaro_A’ in the environment ‘bad_1’, the interaction effect was:

4.1050 - 3.577 - 3.22 + 3.103
## [1] 0.411

We see that the interaction was positive, in the sense that ‘Chiaro_A’, gave 0.411 tons per hectare more than we could have expected, considering its average performances across environments and the average performances of all genotypes in ‘bad_1’.

More generally, the two-way table of interaction effects can be obtained by doubly centring the matrix of means, as shown in the following box.

GE <- as.data.frame(t(scale( t(scale(GEmedie, center=T,
scale=F)), center=T, scale=F)))
print(round(GE, 3))
## Chiaro_A     0.411  0.236 -0.341  0.256  0.208 -0.771
## Chiaro_P    -0.457 -0.042  0.466 -0.324 -0.068  0.426
## Collameno_A -0.183  0.272  0.177 -0.268  0.292 -0.290
## Collameno_P -0.572 -0.042  0.613 -0.642 -0.003  0.646
## Palomb_A    -0.031 -0.206 -0.438  0.499  0.306 -0.131
## Palomb_P    -0.308  0.005 -0.072  0.030 -0.183  0.528
## Scuro_A      0.616 -0.059 -0.374  0.426 -0.134 -0.476
## Scuro_P     -0.166  0.011 -0.059 -0.179  0.023  0.369
## Sicania_A    0.262 -0.032  0.165 -0.295  0.160 -0.260
## Sicania_P    0.361 -0.329  0.048  0.456 -0.595  0.058
## Vesuvio_A    0.475 -0.054 -0.407  0.158  0.095 -0.267
## Vesuvio_P   -0.409  0.239  0.221 -0.119 -0.102  0.169

Please, note that the overall mean for all elements in ‘GE’ is zero and the sum of squares is equal to a fraction of the interaction sum of squares in ANOVA (that is $$RMSE/r$$; where $$r$$ is the number of blocks).

mean(unlist(GE))
## [1] 6.914424e-18
sum(GE^2)
## [1] 7.742996
mod <- lm(Yield ~ Environment/Block + Genotype*Environment, data = dataset)
anova(mod)
## Analysis of Variance Table
##
## Response: Yield
##                       Df Sum Sq Mean Sq  F value    Pr(>F)
## Environment            5 316.57  63.313 580.9181 < 2.2e-16 ***
## Genotype              11  70.03   6.366  58.4111 < 2.2e-16 ***
## Environment:Block     18   6.76   0.375   3.4450 8.724e-06 ***
## Environment:Genotype  55  30.97   0.563   5.1669 < 2.2e-16 ***
## Residuals            198  21.58   0.109
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
30.97/4
## [1] 7.7425

# Decomposing the GE matrix

It would be nice to be able to give a graphical summary of the GE matrix; in this regard, we could think of using Principal Component Analysis (PCA) via Singular Value Decomposition (SVD). This has been shown by Zobel et al. (1988) and, formerly, by Gollob (1968). May I just remind you a few things about PCA and SVD? No overwhelming math detail, I promise!

Most matrices (and our GE matrix) can be decomposed as the product of three matrices, according to:

$X = U D V^T \quad \quad (3)$

where $$X$$ is the matrix to be decomposed, $$U$$ is the matrix of the first $$n$$ eigenvectors of $$XX^T$$, $$V$$ is the matrix of the first $$n$$ eigenvectors of $$X^T X$$ and $$D$$ is the diagonal matrix of the first $$n$$ singular values of $$XX^T$$ (or $$X^T X$$; it does not matter, they are the same).

Indeed, if we want to decompose our GE matrix, it is more clever (and more useful to our purposes), to write the following matrices:

$S_g = U D^{1/2} \quad \quad (4)$

and:

$S_e = V D^{1/2} \quad \quad (5)$

so that

$GE = S_g \, S_e^T \quad \quad (6)$

$$S_g$$ is the matrix of row-scores (genotype scores) and $$S_e$$ is the matrix of column scores (environment scores). Let me give you an empirical proof, in the box below. In order to find $$S_g$$ and $$S_e$$, I will use a mathematical operation that is known as Singular Value Decomposition (SVD):

U <- svd(GE)$u V <- svd(GE)$v
D <- diag(svd(GE)$d) Sg <- U %*% sqrt(D) Se <- V %*% sqrt(D) row.names(Sg) <- levels(dataset$Genotype)
dfr <- mod$df.residual ge.lsm <- emmeans(mod, ~Genotype:Environment) ge.lsm <- data.frame(ge.lsm)[,1:3] ## Second step on least square means This second step assumes that the residual variances for all environments are homogeneous. If so (we’d better check this), we can take the expected marginal means (‘ge.lsm’) and submit them to AMMI analysis, by using the ‘AMMImeans()’ function. The syntax is fairly obvious; we also pass to it the RMSE and its degrees of freedom. The resulting object can be explored, by using the appropriate slots. AMMIobj <- AMMImeans(yield = ge.lsm$emmean,
genotype = ge.lsm$Genotype, environment = ge.lsm$Environment,
MSE = RMSE, dfr = dfr)
#
AMMIobj$genotype_scores ## PC1 PC2 ## Chiaro_A -0.60710888 -0.383732821 ## Chiaro_P 0.55192742 0.026531045 ## Collameno_A 0.08444877 -0.542185666 ## Collameno_P 0.80677055 -0.065752971 ## Palomb_A -0.32130513 0.110117240 ## Palomb_P 0.28104959 0.345909298 ## Scuro_A -0.62638795 0.139185954 ## Scuro_P 0.22961347 0.076555540 ## Sicania_A -0.06286803 -0.323857285 ## Sicania_P -0.21433211 0.683296898 ## Vesuvio_A -0.43786742 -0.007914342 ## Vesuvio_P 0.31605973 -0.058152890 # AMMIobj$environment_scores
##               PC1         PC2
## pap_1 -0.66137357  0.51268429
## pap_2 -0.06863235 -0.62703224
## pap_3  0.84633965  0.56736492
#
round(AMMIobj$summary, 4) ## PC Singular_value PC_SS Perc_of_Total_SS cum_perc ## 1 1 2.3000 5.2902 68.3220 68.3220 ## 2 2 1.1785 1.3888 17.9364 86.2584 ## 3 3 0.8035 0.6456 8.3375 94.5959 ## 4 4 0.5119 0.2621 3.3846 97.9806 ## 5 5 0.3954 0.1564 2.0194 100.0000 ## 6 6 0.0000 0.0000 0.0000 100.0000 # round(AMMIobj$anova, 4)
##   PC     SS DF     MS       F P.value
## 1  1 5.2902 15 0.3527 12.9437  0.0000
## 2  2 1.3888 13 0.1068  3.9208  0.0000
## 3  3 0.6456 11 0.0587  2.1539  0.0184
## 4  4 0.2621  9 0.0291  1.0687  0.3876
## 5  5 0.1564  7 0.0223  0.8198  0.5718
## 6  6 0.0000  5 0.0000  0.0000  1.0000

In detail, we can retrieve the genotype and environment scores, the proportion of the GE variance explained by each component and the significance of PCs.

Just to show you, the box below reports the code for AMMI analysis on raw data. Please, note that this only works with balanced data.

AMMIobj2 <- AMMI(yield = dataset$Yield, genotype = dataset$Genotype,
environment = dataset$Environment, block = dataset$Block)

I agree, these functions are not very ambitious. However, they are simple enough to be usable and give reliable results, as long as the basic assumptions for the method are respected. You may also consider to explore other more comprehensive R packages, such as ‘agricolae’ (de Mendiburu, 2020).

Thank you for reading, so far, and… happy coding!

Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Borgo XX Giugno 74
I-06121 - PERUGIA

# Literature references

1. Annichiarico, P. (1997). Additive main effects and multiplicative interaction (AMMI) analysis of genotype-location interaction in variety trials repeated over years. Theoretical applied genetics, 94, 1072-1077.
2. Ariyo, O. J. (1998). Use of additive main effects and multiplicative interaction model to analyse multilocation soybean varietal trials. J. Genet. and Breed, 129-134.
3. Cornelius, P. L. (1993). Statistical tests and retention of terms in the Additive Main Effects and Multiplicative interaction model for cultivar trials. Crop Science, 33,1186-1193.
4. Crossa, J. (1990). Statistical Analyses of multilocation trials. Advances in Agronomy, 44, 55-85.
5. Gollob, H. F. (1968). A statistical model which combines features of factor analytic and analysis of variance techniques. Psychometrika, 33, 73-114.
6. Lenth R., 2020. emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.4.6. https://github.com/rvlenth/emmeans.
7. de Mendiburu F., 2020. agricolae: Statistical Procedures for Agricultural Research. R package version 1.3-2. https://CRAN.R-project.org/package=agricolae.
8. Onofri, A., Ciriciofolo, E., 2007. Using R to perform the AMMI analysis on agriculture variety trials. R NEWS 7, 14–19.
9. Stagnari F., Onofri A., Jemison J., Monotti M. (2006). Multivariate analyses to discriminate the behaviour of faba bean (Vicia faba L. var. minor) varieties as affected by sowing time in cool, low rainfall Mediterranean environments. Agronomy For Sustainable Development, 27, 387–397.
10. Hadley Wickham, Romain François, Lionel Henry and Kirill Müller, 2020. dplyr: A Grammar of Data Manipulation. R package version 0.8.5. https://CRAN.R-project.org/package=dplyr.
11. Zobel, R. W., Wright, M.J., and Gauch, H. G. (1988). Statistical analysis of a yield trial. Agronomy Journal, 388-393.