ANOVAs and Geomorph

[This article was first published on geomorph, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Within geomorph are several functions that perform analysis of variance (ANOVA), including
procD.lm()
procD.pgls()
advanced.procD.lm()
pairwiseD.test()
pairwise.slope.test()
trajectory,analysis()
bilat.symmetry()
plotAllometry()



Inherent in all of these functions is a common philosophy for ANOVA (although other philosophies exist).  The geomorph ANOVA philosophy is that: 
(1) resampling (randomization) procedures are used to generate empirical sampling distributions to assess significance of effects, 
(2) effect sizes are estimated as standard deviates from such sampling distributions, 
(3) sums of squares are calculated between nested models in sequential fashion, 
(4) the option to use appropriate exchangeable units under the null hypothesis is permitted and encouraged.  
This philosophy is most apparent in procD.lm, our most basic function, which is also the backbone for other functions.

The simplest way to explain the philosophy is to first describe how sums of squares (SS) are calculated (in geomorph and R or any other stats program).  The residual SS (RSS) of a linear model (also called the sum of squared error, SSE) is found as follows:

1)   Obtain residuals from a linear model, shape ~ X  (+ error), where X indicates some predictive (independent) variable (or set of variables) for the matrix of dependent variables, called shape.  This is a bit of simplification, and if one wishes to have more detail about linear models, Collyer et al. (2015, Heredity) is the main source on which this blog post is based.  The model error is a matrix of residuals.  Let’s call this matrix E, which has the same number of rows (observations) and columns (shape variables) as the matrix for shape.
2)   Find the sums of square and cross-products (SSCP) matrix as EEt, where the superscript, t, indicate matrix transposition.  This is a p × p symmetric matrix whose diagonal values are the squared distances for each observation for its predicted value.  In other words, the model shape ~ X indicates that to some extent, the variation in shape should be predicted by some variable (or set of variables), X, but this prediction is probably not perfectly precise.  The square distances are a measure – observation by observation – of the imprecision of the prediction of the linear model.  Sum these squared distances and we have RSS for the model.  By the way, if shape is a matrix of Procrustes residuals, then the squared distances are squared Procrustes distances in the space tangent to shape space.  Hence the name procD.lmProcrustes (squared) Distances from a linear model.

Thus, for any model, shape ~ X, we have a set of predicted shapes and a set of residuals. For any model, we can summarize the error of prediction by calculating RSS.  If we have two similar “nested” models – where “nested” indicates that all of the predictor variables in one model are contained in the other (along with other variables) – we can compare the RSS of the two models to determine if one model is “better”.  So, the fundamental component of ANOVA is to do a series of linear model comparisons, comparing the RSS between them to ascertain if the variables contained in one but missing in the other increase the predictive power of the linear model, thus reducing the error.  The SS of the “effect” is proportional to the reduction in RSS, for the effect describe by the change in variables.

For example, using the pupfish data in geomorph, let’s generate 4 linear model (fits) that increase in complexity, and find their RSS values:

library(geomorph)
data(pupfish)
shape <- two.d.array(pupfish$coords)

fit1 <- lm(shape ~ 1) # model contains just an intercept
fit2 <- lm(shape ~ log(pupfish$CS)) # allometric scaling of shape
fit3 <- lm(shape ~ log(pupfish$CS) + pupfish$Sex) # previous model
(fit2) + sexual dimorphism
fit4 <- lm(shape ~ log(pupfish$CS) * pupfish$Sex) # previous model (fit3) + interaction between sex and log(CS)

# A function for RSS

RSS <- function(fit) sum(diag(resid(fit)%*%t(resid(fit))))

# Apply to each model

RSS(fit1)
RSS(fit2)
RSS(fit3)
RSS(fit4)

#If done correctly, the results should be

> RSS(fit1)
[1] 0.05633287
> RSS(fit2)
[1] 0.04231359
> RSS(fit3)
[1] 0.03469854
> RSS(fit4)
[1] 0.03281081

What this illustrates is that by adding effects to the previous models, in sequence, starting with a model that contains only an intercept and ending with a model that contains log(CS), Sex, and log(CS)*Sex (in this particular sequence), the model error reduced each time.  Notice that if one were to subtract consecutive RSS values, this is the same SS found when using procD.lm.  Also notice that RSS(fit) is the “Total” SS and RSS(fit4) is the RSS for the “full” model from procD.lm.  
E.g.,

> RSS(fit1) – RSS(fit2)
[1] 0.01401928
> RSS(fit2) – RSS(fit3)
[1] 0.007615054
> RSS(fit3) – RSS(fit4)
[1] 0.001887732
> RSS(fit4)
[1] 0.03281081
> RSS(fit1)
[1] 0.05633287

> procD.lm(pupfish$coords ~ log(pupfish$CS) * pupfish$Sex)

Type I (Sequential) Sums of Squares and Cross-products

Randomization of Raw Values used

                            df       SS        MS     Rsq       F       Z P.value
log(pupfish$CS)              1 0.014019 0.0140193 0.24887 21.3638 10.4250   0.001
pupfish$Sex                  1 0.007615 0.0076151 0.13518 11.6045  5.9722   0.001
log(pupfish$CS):pupfish$Sex  1 0.001888 0.0018877 0.03351  2.8767  1.4912   0.103
Residuals                   50 0.032811 0.0006562                               
Total                       53 0.056333   

Thus, ANOVA is primarily this process of comparing model error.  There are obviously other parts.  The means square (MS) of each effect is the SS/df.  These values can be used to calculate F values as ratios of MS-effect to MS-error (or MS-random effect, if present in the model).  The R-squared values are effect SS divided by total SS.  These values are “transformations” of effect SS, and might have certain appeal.  However, in geomorph, the F values are merely descriptive where in some ANOVA programs, they are used to estimate P-values from parametric F-distributions.  Such an approach uses integration of parametric F probability density functions as a proxy for estimating the probability that the effect SS was observed by chance.  This approach involves (sometimes restrictive) assumptions and that the number of shape variables is far fewer than the number of observations.  In this age of “big data” and high performance computing, forcing parametric F-distributions onto analyses is often neither feasible nor needed nor advised.

The last two columns highlight greater flexibility in geomorph ANOVA functions.  There are two components to this flexibility: (1) how SS is estimated and (2) which units are exchangeable under null hypotheses.  Let’s address the first problem first.  “Sequential” SS, also called “type 1” SS, has already been described.  This type of SS is estimated through the process of sequentially adding model effects.  Another common type of SS in stats programs is “marginal” SS, also called type 3 SS.  This process does not start with a model with only an intercept and move “forward”; rather, it starts with a model with all desired effects and systematically removes each effect, calculating SS as the difference in RSS between the full model and all “reduced” models.  The benefit of type 3 SS is that the order of terms in the model is not important.  The drawback is that the sum of SS for effects and residuals is not the same as the total (notice in the ANOVA table above that the sum of the 4 SS values equal the 5th SS, which is the total).

These are the two main types of SS.  There are other types that pertain to more complex models (for example, type 2 SS is hybrid between type 1 and type 3 SS, and involves removing all interactions from compared models when evaluating “main” effects).  We will not get into the complexity of estimating SS through various paradigms, but geomorph can handle any type of SS estimation.  The user just needs to be aware that ANOVA is really a model comparison approach.  For example, when using the anova function in base R, there are two implementations.  One performs all model comparisons for models that could be nested within the input model; the other compares two specific models.  Consider these examples for univariate data:

fit5 <- lm(pupfish$CS ~ pupfish$Sex + pupfish$Pop) # models sexual dimorphism and population differences
fit6 <- lm(pupfish$CS ~ pupfish$Sex * pupfish$Pop) # models population variation in sexual dimorphism

#One can do ANOVA two ways:

> anova(fit6)
Analysis of Variance Table

Response: pupfish$CS
                        Df  Sum Sq Mean Sq F value    Pr(>F)   
pupfish$Sex              1 1643.81 1643.81 29.4294 1.688e-06 ***
pupfish$Pop              1 1379.72 1379.72 24.7013 8.243e-06 ***
pupfish$Sex:pupfish$Pop  1  121.11  121.11  2.1682    0.1472   
Residuals               50 2792.81   55.86                     

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> anova(fit5, fit6)
Analysis of Variance Table

Model 1: pupfish$CS ~ pupfish$Sex + pupfish$Pop
Model 2: pupfish$CS ~ pupfish$Sex * pupfish$Pop
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     51 2913.9                           
2     50 2792.8  1    121.11 2.1682 0.1472

The first ANOVA table performs a sequential SS calculation of all possible reduced nested models; the second ANOVA directly tests the interaction between Sex and Pop.  Notice that the SS for the interaction is the same in both ANOVAs.  Notice also that the latter approach reminds the user of the models compared, and the error produced by each one.  Either way, we ascertain that population variation in sexual dimorphism (for CS) is not significant.  Using the latter approach ANY SS type can be implemented, as long as one constructs appropriate model comparisons.  F values might not be accurate as output (which is why they are only descriptive in geomorph… more on this below).  But F values can always be recalculated easily from compiling all model comparisons desired.

The function advanced.procD.lm is our multivariate analog to the latter ANOVA approach.  The name is a bit of a misnomer, because it is more basic in that it does a single model comparison, as defined by the user.  (Other options make it more advanced though.). Here is an example for how one could perform marginal SS estimates and test them using advanced.procD.lm.

#ANOVA with RRPP

                            df      SSE        SS      F      Z     P
pupfish$Sex                 52 0.040553                             
log(pupfish$CS)+pupfish$Sex 51 0.034699 0.0058541 8.6044 6.4858 0.001

> advanced.procD.lm(fit3, ~ log(pupfish$CS))  # Sex effect

#ANOVA with RRPP

                            df      SSE        SS      F      Z     P
log(pupfish$CS)             52 0.042314                             
log(pupfish$CS)+pupfish$Sex 51 0.034699 0.0076151 11.193 8.0163 0.001

#Notice that the SS are different compared to

> procD.lm(fit3)

Type I (Sequential) Sums of Squares and Cross-products

Randomization of Raw Values used

                df       SS        MS     Rsq      F       Z P.value
log(pupfish$CS)  1 0.014019 0.0140193 0.24887 20.606 10.0845   0.001
pupfish$Sex      1 0.007615 0.0076151 0.13518 11.193  5.7721   0.001
Residuals       51 0.034699 0.0006804                              
Total           53 0.056333  

The SS for sex is the same because in both cases, it is evaluated against a model that contains only log(CS); however, the SS for log(CS) is different because in one case it is compared to a model with an intercept (sequential SS) and in the other case it is compared to a model with both log(CS) and Sex (marginal SS).  Although the choice of method is an important criterion, geomorph can accommodate any SS calculation approach.

The second component of flexibility is how one chooses exchangeable units under the null hypothesis.  For any model comparison, the null hypothesis is that RSS for both models is the same (i.e., SS effect = 0).  If one were to randomize row vectors of residuals and recalculate RSS, RSS would be the same before and after randomization.  Thus, residuals are exchangeable units (of the reduced model) under the null hypothesis, as the error remains constant.  However, if one randomizes residuals, adds them to predicted values, and re-estimates the error of the full model using these psuedorandom values from teh reduced model, the RSS of the full model will change, representing random outcomes while maintaining the null hypothesis.  The observed effect SS due to the full model – as one possible random outcome – can be compared to a distribution of many SS values.  It’s standard deviate in the distribution (Z score) is a measure of effects size (in standard deviations) and its percentile (P-value) is a probability of finding as large SS, by chance, if the null hypothesis were true.  When really low (usually 0.05 or less), we say the effect is “significant”.

This process of randomizing residuals is called the randomized residual permutation procedure (RRPP).  In most functions it is an option; in some functions (trajectory.analysis, plotallometry, advanced.procD.lm, procD.lm) is not offered as an option but is required for the analysis.  The other option is to randomize the shape data.  While we do not recommend this option, it is maintained because various other programs use this method and we hope that users find consistent results. 

Choice between RRPP and full randomization (RRPP = FALSE) is important, as the choice can have a big impact.

E.g,

> procD.lm(pupfish$coords ~ log(pupfish$CS) * pupfish$Sex, iter=999, RRPP = T)

Type I (Sequential) Sums of Squares and Cross-products

Randomized Residual Permutation Procedure used

                            df       SS        MS     Rsq       F       Z P.value
log(pupfish$CS)              1 0.014019 0.0140193 0.24887 21.3638 10.1528   0.001
pupfish$Sex                  1 0.007615 0.0076151 0.13518 11.6045  7.7452   0.001
log(pupfish$CS):pupfish$Sex  1 0.001888 0.0018877 0.03351  2.8767  2.4944   0.012
Residuals                   50 0.032811 0.0006562                               
Total                       53 0.056333       
                                   
> procD.lm(pupfish$coords ~ log(pupfish$CS) * pupfish$Sex, iter=999, RRPP = F)

Type I (Sequential) Sums of Squares and Cross-products

Randomization of Raw Values used

                            df       SS        MS     Rsq       F       Z P.value
log(pupfish$CS)              1 0.014019 0.0140193 0.24887 21.3638 10.0403   0.001
pupfish$Sex                  1 0.007615 0.0076151 0.13518 11.6045  5.8287   0.001
log(pupfish$CS):pupfish$Sex  1 0.001888 0.0018877 0.03351  2.8767  1.4898   0.113
Residuals                   50 0.032811 0.0006562                               
Total                       53 0.056333   

Notice that ignoring the alternative effects (by not using RRPP) meant failing to detect a significant interaction!

There are various other options that can be explored, including verbose output, pairwise comparisons, random iterations, etc.  Most important options account for the difference in functions, and are used after first making a determination with procD.lm for how to proceed, especially when analyses contain multiple groups.  Understanding how ANOVA works in geomorph and understanding how to interpret output will influence the other functions to use.

Mike

To leave a comment for the author, please follow the link and comment on their blog: geomorph.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)