**R – Aaron Schlegel**, and kindly contributed to R-bloggers)

MANOVA, or Multiple Analysis of Variance, is an extension of Analysis of Variance (ANOVA) to several dependent variables. The approach to MANOVA is similar to ANOVA in many regards and requires the same assumptions (normally distributed dependent variables with equal covariance matrices). This post will explore how MANOVA is performed and interpreted by analyzing the growth of six different apple tree rootstocks from 1918 to 1934 (Andrews and Herzberg 1985, pp. 357-360).

###### Multiple Analysis of Variance

In the MANOVA setting, each observation vector can have a model denoted as:

An ‘observation vector’ is a set of observations measured over several variables. With variables, becomes:

Thus, for the variable in in each vector , the model takes the form:

As before in ANOVA, the goal in multiple analysis of variance is to compare the groups to see if there are any significant differences. However, instead of a single variable, the comparisons will be made with the mean vectors of the samples. The null hypothesis can be formalized the same way in MANOVA:

With an alternative hypothesis that at least two are unequal. There are , where is the number of groups in the data, equalities that must be true for to be accepted.

###### MANOVA Between and Within Variation

The totals and means of the samples in the data are defined as:

- Total of the th sample:
- Overall total:
- Mean of the ith sample:
- Overall mean:

Similar to ANOVA, we are interested in partitioning the data’s total variation into variation between and within groups. In the case of ANOVA, this partitioning is done by calculating and ; however, in the multivariate case, we must extend this to encompass the variation in all the variables. Therefore, we must compute the between and within sum of squares for each possible comparison. This procedure results in the “hypothesis matrix” and “error matrix.”

The matrix is a square with the form:

Where the entries are equal to:

Thus, for example, the above equation for the entries and would take the form:

The error matrix is also and can be expressed similarly to .

With the entries equal to:

Therefore for corresponding entries and , the above equation is expressed as the following:

Once the and matrices are constructed, the mean vectors can be compared to determine if significant differences exist. There are several test statistics, of which the most common are Wilk’s lambda, Roy’s test, Pillai, and Lawley-Hotelling, that can be employed to test for significant differences. Each test statistic has specific properties and power and will be discussed in a future post. For now, the default Pillai test statistic from the `manova()`

function will suffice (and is the recommended statistic to use in most cases according to the documentation in `?summary(manova())`

).

The rootstock data were obtained from the companion FTP site of the book Methods of Multivariate Analysis by Alvin Rencher. The data contains four dependent variables as follows:

- trunk girth at four years (mm 100)
- extension growth at four years (m)
- trunk girth at 15 years (mm 100)
- weight of tree above ground at 15 years (lb 1000)

root <- read.table('ROOT.DAT', col.names = c('Tree.Number', 'Trunk.Girth.4.Years', 'Ext.Growth.4.Years', 'Trunk.Girth.15.Years', 'Weight.Above.Ground.15.Years')) root$Tree.Number <- as.factor(root$Tree.Number)

###### MANOVA in R

The `manova()`

function accepts a formula argument with the dependent variables formatted as a matrix and the grouping factor on the right of the `~`

.

dependent.vars <- cbind(root$Trunk.Girth.4.Years, root$Ext.Growth.4.Years, root$Trunk.Girth.15.Years, root$Weight.Above.Ground.15.Years)

Perform MANOVA and output a summary of the results.

root.manova <- summary(manova(dependent.vars ~ root$Tree.Number)) root.manova ## Df Pillai approx F num Df den Df Pr(>F) ## root$Tree.Number 5 1.3055 4.0697 20 168 1.983e-07 *** ## Residuals 42 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The resultant MANOVA model reports a Pillai test statistic of and a p-value below , thus is rejected and it is concluded there are significant differences in the means.

Although the test rejected , we can test each variable individually with an ANOVA test. The `aov()`

function can output tests on individual variables when wrapped in a `summary()`

call.

summary(aov(dependent.vars ~ root$Tree.Number)) ## Response 1 : ## Df Sum Sq Mean Sq F value Pr(>F) ## root$Tree.Number 5 0.07356 0.0147121 1.931 0.1094 ## Residuals 42 0.31999 0.0076187 ## ## Response 2 : ## Df Sum Sq Mean Sq F value Pr(>F) ## root$Tree.Number 5 4.1997 0.83993 2.9052 0.0243 * ## Residuals 42 12.1428 0.28911 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Response 3 : ## Df Sum Sq Mean Sq F value Pr(>F) ## root$Tree.Number 5 6.1139 1.22279 11.969 3.112e-07 *** ## Residuals 42 4.2908 0.10216 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Response 4 : ## Df Sum Sq Mean Sq F value Pr(>F) ## root$Tree.Number 5 2.4931 0.49862 12.158 2.587e-07 *** ## Residuals 42 1.7225 0.04101 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only the first variable, trunk girth at four years, reports a p-value above 0.05. Thus it is concluded there are significant differences in the means for the variables except for trunk girth at four years amongst the six groups.

###### Replicating Multiple Analysis of Variance in R

For a deeper understanding of how MANOVA is calculated, we can replicate the results of the `manova()`

function by computing the and matrices as mentioned above. To calculate these matrices, first split the data into a list by group and find the mean vectors of each group. The sample sizes of each group, and the total mean vector will also be required.

root.group <- split(root[,2:5], root$Tree.Number) root.means <- sapply(root.group, function(x) { apply(x, 2, mean) }, simplify = 'data.frame') root.means ## 1 2 3 4 5 ## Trunk.Girth.4.Years 1.137500 1.157500 1.107500 1.09750 1.08000 ## Ext.Growth.4.Years 2.977125 3.109125 2.815250 2.87975 2.55725 ## Trunk.Girth.15.Years 3.738750 4.515000 4.455000 3.90625 4.31250 ## Weight.Above.Ground.15.Years 0.871125 1.280500 1.391375 1.03900 1.18100 ## 6 ## Trunk.Girth.4.Years 1.036250 ## Ext.Growth.4.Years 2.214625 ## Trunk.Girth.15.Years 3.596250 ## Weight.Above.Ground.15.Years 0.735000

n <- dim(root)[1] / length(unique(root$Tree.Number)) total.means <- colMeans(root[,2:5])

The and matrices can be computed with the following code.

H = matrix(data = 0, nrow = 4, ncol = 4) for (i in 1:dim(H)[1]) { for (j in 1:i) { H[i,j] <- n * sum((root.means[i,] - total.means[i]) * (root.means[j,] - total.means[j])) H[j,i] <- n * sum((root.means[j,] - total.means[j]) * (root.means[i,] - total.means[i])) } } E = matrix(data = 0, nrow = 4, ncol = 4) for (i in 1:dim(E)[1]) { for (j in 1:i) { b <- c() for (k in root.group) { a <- sum((k[,i] - mean(k[,i])) * (k[,j] - mean(k[,j]))) b <- append(b, a) } E[i,j] <- sum(b) E[j,i] <- sum(b) } }

Verify the results by comparing the computed matrices to the output of `summary(manova())`

.

root.manova$SS[1] ## $`root$Tree.Number` ## [,1] [,2] [,3] [,4] ## [1,] 0.07356042 0.5373852 0.3322646 0.208470 ## [2,] 0.53738521 4.1996619 2.3553885 1.637108 ## [3,] 0.33226458 2.3553885 6.1139354 3.781044 ## [4,] 0.20847000 1.6371084 3.7810437 2.493091

H ## [,1] [,2] [,3] [,4] ## [1,] 0.07356042 0.5373852 0.3322646 0.208470 ## [2,] 0.53738521 4.1996619 2.3553885 1.637108 ## [3,] 0.33226458 2.3553885 6.1139354 3.781044 ## [4,] 0.20847000 1.6371084 3.7810438 2.493091

root.manova$SS[2] ## $Residuals ## [,1] [,2] [,3] [,4] ## [1,] 0.3199875 1.696564 0.5540875 0.217140 ## [2,] 1.6965637 12.142790 4.3636125 2.110214 ## [3,] 0.5540875 4.363612 4.2908125 2.481656 ## [4,] 0.2171400 2.110214 2.4816562 1.722525

E ## [,1] [,2] [,3] [,4] ## [1,] 0.3199875 1.696564 0.5540875 0.217140 ## [2,] 1.6965637 12.142790 4.3636125 2.110214 ## [3,] 0.5540875 4.363613 4.2908125 2.481656 ## [4,] 0.2171400 2.110214 2.4816563 1.722525

The Pillai test statistic is denoted as and defined as:

Where represents the th nonzero eigenvalue of . Thus we can manually calculate the Pillai statistic with either of the following:

vs <- sum(diag(solve(E + H) %*% H)) # Will be used later in the post to find approximate F-statistic vs ## [1] 1.305472

sum((eigen(solve(E) %*% H)$values) / (1 + eigen(solve(E) %*% H)$values)) ## [1] 1.305472

Which is the same as the `manova()`

function output. The Pillai statistic is then used to determine the significance of differences of the mean vectors by comparing it to the critical value . The critical Pillai value is found by computing , , and which are also employed in Roy’s test (the Pillai test is an extension of Roy’s test). The values are defined as:

An approximate F-statistic can be found with the following equation:

k <- length(unique(root$Tree.Number)) p <- length(root[,2:5]) vh <- k - 1 ve <- dim(root)[1] - k

Where is the degrees of freedom for the hypothesis and is the degrees of freedom for the error.

s <- min(vh, p) m <- .5 * (abs(vh - p) - 1) N <- .5 * (ve - p - 1) f.approx <- ((2 * N + s + 1) * vs) / ((2 * m + s + 1) * (s - vs)) f.approx ## [1] 4.069718

root.manova$stats[,3][1] ## root$Tree.Number ## 4.069718

The calculated approximate F-statistic matches what was reported in the `manova()`

function.

###### Summary

This post explored the extension of ANOVA to multiple dependent variables known as multiple analysis of variance, or MANOVA for short, and how to perform the procedure with built-in R functions and manual computations. The concepts of ANOVA are extended and generalized to encompass variables, and thus the intuition and logic behind ANOVA also apply to the multivariate case. Future posts will examine more topics related to MANOVA including additional test statistics, unbalanced (unequal sample sizes) approaches and two-way classification.

###### References

Andrews, D. F., and Herzberg, A. M. (1985), Data, New York: Springer-Verlag.

Rencher, A. (n.d.). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.

The post Multiple Analysis of Variance (MANOVA) appeared first on Aaron Schlegel.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Aaron Schlegel**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...