I’ve been reading about RLQ analysis, also known as the fourth corner method, for analyzing relationships between environmental characteristics and species traits. I was interested because I thought I might be using RLQ analysis to answer a specific set of questions (I’m not). However, I was still curious about it, and I wanted to know how it works.
For those of you have haven’t heard of it, RLQ analysis is a method by which one can uncover how the environment filters certain species traits. For example, you can determine whether a particular environment selects for species with rapid growth rates, high reproductive output, or whatever trait you choose to measure. It accomplishes this by, more or less, linking a description of the environment to species traits by measurements of species abundances.
You start with three data tables: The R matrix is a site x environment table: sites are rows and columns are environmental descriptors. The L matrix is a site x species table, where rows are sites and columns are abundances of specific species. The Q matrix is a species x trait table, where rows are species and columns are biological traits of those species. What RLQ analysis does, simply, is makes a new matrix that I’ll call V, which is a environment x traits matrix, and you then perform your standard PCA-esque eigendecomposition on that. Although not technically correct, RLQ analysis is simply thought of as nothing more than PCA on the matrix V. The vast majority of the work is actually in constructing V.
The ‘ade4′ package in R can do RLQ analyses. First, you do a principle components analysis on both R and Q and a correspondance analysis on L. You then pass these analyses to the rlq() function. To figure out how RLQ works, I took apart the rlq() function, then several secondary functions called by rlq(). It turns out, the PCA and CA on the R, L, and Q matrices aren’t actually used, you can do the whole thing by hand without those preliminary analyses.
So, without further ado, here’s how RLQ analysis works (NOTE: I’ve verified my results with those from the rlq() function to make sure they match:
library(ade4) #### READ IN THE DATA AND CLEAN IT UP #### traits <- read.csv('traits.csv') env <- read.csv('environment.csv') species <- read.csv('species.csv') species <- species[,-c(1:2)] env <- env[,-c(1:2)] traits <- traits[,-1] rownames(traits) <- colnames(species)
RLQ analysis operates on the matrix RLQ, which is calculated as R‘ D_site L D_species Q, where D_site and D_species are diagonal matrices of the row and column weights from the species matrix. As shown below, this is the same as R’ P Q, where P is the centered probability matrix
First, we have a site by species matrix, N, of raw abundances
N <- species
Convert N to a relativized species matrix P, where p_ij = n_ij / n++, where n++ is the total number of individuals (sum of the entire N matrix)
P <- N/sum(N)
Now divide each observation by its row weight (p_i+) and column weight (p_j+). The row and column weights are simply the sum of the observations in a row divided by the sum of the matrix (and similarly for columns). Once done, this gives p_ij = p_ij/(p_i+ p+j+)
row.w <- apply(P, MAR=1, function(x) sum(x)/sum(P)) col.w <- apply(P, MAR=2, function(x) sum(x)/sum(P)) P <- sweep(P, MAR=1, row.w, '/') P <- sweep(P, MAR=2, col.w, '/')
Next, subtract 1 from each observation, givin p_ij = p_ij(p_i+ p_j+) – 1, which equals (p_ij – p_i+p_j+)/(p_i+p_j+).
P <- P-1
This IS the chi-distance matrix used in correspondance analysis. You can verify this by checking the table from the dudi.coa function. They are the same.
However, we only want the centered matrix, we need to remove the weights in the denominator. Create diagonal matrices D_site and D_species of the row and column weights respectively. Then pre- and post-multiply the matrix P. This will yield a matrix L, where l_ij = p_ij – p_i+ p_j+
D_site <- diag(row.w) D_species <- diag(col.w) L <- D_site %*% as.matrix(P) %*% D_species
Now make the R’LQ matrix. First, center and standardize the columns of R and Q. The center is taken as the WEIGHTED average where the weights are the row weights (for the environment matrix) and species weights (for the species matrix).
# Calculate the weighted average for each trait and site traitAvg <- apply(traits, MAR=2, function(x) sum( x*col.w )/sum(col.w) ) envAvg <- apply(env, MAR=2, function(x) sum(x*row.w)/sum(row.w)) traitCent <- sweep(traits, 2, traitAvg) envCent <- sweep(env, 2, envAvg)
Calculate the weighted standard deviation. Since the values are now in deviations from the mean, the weighted variance is sum(x^2w) / sum(weights), and the standard deviation is the square root of this.
traitSD <- apply(traitCent, MAR=2, function(x) sqrt(sum(x^2 * col.w)/sum(col.w))) envSD <- apply(envCent, MAR=2, function(x) sqrt(sum(x^2 * row.w)/sum(row.w))) traitScale <- sweep(traitCent, MAR=2, traitSD, '/') envScale <- sweep(envCent, MAR=2, envSD, '/') R <- as.matrix(envScale) Q <- as.matrix(traitScale)
Next, V is just the R’ L Q product. This is actually the correlation matrix between the environment traits and the species traits, mediated by species abundances.
V <- t(R) %*% L %*% Q round(V, 3)
This is identical to the matrix operated on by the rlq() command. You can check this with by examining the table ($tab) returned by the rlq() function.
Next, get the cross-product matrix because the correlation matrix is not guaranteed to be either square or symmetric
Z <- crossprod(V, V)
The rest is a standard PCA-like eigen decomposition of V.
eigVals <- eigen(Z)$values eigVecs <- eigen(Z)$vectors sum(eigVals) ## THE SPECIES TRAIT LOADINGS ON EACH AXIS ARE THE EIGENVECTORS traitLoad <- data.frame(eigVecs) rownames(traitLoad) <- colnames(V) colnames(traitLoad) <- paste('Axis', 1:length(eigVals)) ## THE ENVIRONMENTAL TRAIT SCORES ARE CALCULATED EXACTLY AS IN PCA envScores <- data.frame(V %*% eigVecs) names(envScores) <- paste('Axis', 1:length(eigVals))
This gives the biplot: