**Analysis with Programming**, and kindly contributed to R-bloggers)

To engage on this, it might be better if we group these variables into two and study the relationship between these sets of variables. Such statistical procedure can be done using the canonical correlation analysis (CCA). An example of this on health sciences (from Reference 2) is variables related to exercise and health. On one hand you have variables associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can run, the amount of weight lifted on bench press, the number of push-ups per minute, etc. But you also might have health variables such as blood pressure, cholesterol levels, glucose levels, body mass index, etc. So two types of variables are measured and the relationships between the exercise variables and the health variables are to be studied.

### Methodology

Mathematically we have the following procedures:

- Divide the random variables into two groups, and assign these to the following random vectors: \begin{equation}\nonumber \mathbf{X} = [X_1,X_2,\cdots, X_p]^T\;\text{and}\;\mathbf{Y} = [Y_1,Y_2,\cdots, Y_q]^T \end{equation}
- Analogous to principal component analysis (PCA), we aim to find a linear combination \begin{equation}\nonumber \begin{aligned} U_1 = &\mathbf{a}_1^T\mathbf{X} = a_{11}X_1 + a_{12}X_2+\cdots + a_{1p}X_p\\ U_2 = &\mathbf{a}_2^T\mathbf{X} = a_{21}X_1 + a_{22}X_2+\cdots + a_{2p}X_p\\ &\qquad\quad\qquad\vdots\qquad\qquad\vdots\\ U_p = &\mathbf{a}_p^T\mathbf{X} = a_{p1}X_1 + a_{p2}X_2+\cdots + a_{pp}X_p \end{aligned} \end{equation} and \begin{equation}\nonumber \begin{aligned} V_1 = &\mathbf{b}_1^T\mathbf{Y}=b_{11}Y_1 + b_{12}Y_2+\cdots + b_{1q}Y_q\\ V_2 = &\mathbf{b}_2^T\mathbf{Y}=b_{21}Y_1 + b_{22}Y_2+\cdots + b_{2q}Y_q\\ &\qquad\quad\qquad\vdots\qquad\qquad\vdots\\ V_q = &\mathbf{b}_q^T\mathbf{Y}=b_{q1}Y_1 + b_{q2}Y_2+\cdots + b_{qq}Y_q\\ \end{aligned} \end{equation} that will maximize the correlation \begin{equation}\nonumber Corr(U_i,V_i)=\frac{Cov(U_i,V_i)}{\sqrt{Var(U_i)}\sqrt{Var{V_i}}},\quad i=1,2\cdots,n \end{equation} where $n = \min{(p, q)}$.
- The first pair canonical variables is defined by \begin{equation}\nonumber Corr(U_1, V_1)=\rho_1=\sqrt{\rho_1^2}, \end{equation} where $\rho_1$, the first canonical correlation, is the square root of the highest of the eigenvalues, $\rho_1^2\geq \rho_2^2\geq \cdots \geq \rho_n^2$, which is the eigenvalues of the matrix $\mathbf{\Sigma}_{XX}^{-1/2}\mathbf{\Sigma}_{XY}\mathbf{\Sigma}_{YY}^{-1}\mathbf{\Sigma}_{XY}^{T}\mathbf{\Sigma}_{XX}^{-1/2}$, where $\mathbf{\Sigma}_{XX}$ is the variance-covariance of $\mathbf{X}$; $\mathbf{\Sigma}_{YY}$ is the variance-covariance of $\mathbf{Y}$; and $\mathbf{\Sigma}_{XY}$ is the covariance matrix of the random vector $\mathbf{XY}$. So that the second pair canonical variable is given by \begin{equation}\nonumber Corr(U_2, V_2)=\rho_2=\sqrt{\rho_2^2}, \end{equation} and so on.

For more detailed theory of CCA, please refer to Reference 1 and 2 below. To continue, let’s apply this methodology on an image. We will use the Grass data from (Bajorski, 2012), and do analysis on it using R. Below is the proper description of the data.

### Data

Grass data is a spectral image of 64 by 64 pixels, grass texture. Each pixel is represented by a spectral reflectance curve in 42 spectral bands with reflectance given in percent.

### Analysis

To begin, let’s display the data in an image form:

The code generates the first 12 spectral bands of the data, where we observe a significant change on brightness of the twelfth band compared to the first band. The signature of all pixels across these bands is shown below:

Investigating on the above plot tells us that it seems almost all bands are correlated; that is, if the reflectance of a given pixel on $i$th band (increases or decreases), the $j$th band, $i\neq j$, is also expected to (increase or decrease); except on bands 30 and 31 where seems to be no clear pattern on it. But that’s subjective, we cannot tell exactly because there are 4096 signatures (lines in the plot) that will likely to overlap other important informations. So to see properly the relationship between all variables, here is the correlation matrix of all the spectral bands,

The cyan colour engulfing almost 60 percent of the region indicates higher correlation between the corresponding spectral bands. But the fuchsia colour that is pronounced in the plot tells us low correlation between those bands. Now let’s divide this data into two, from 42 bands we can have two equal sets of variables (each with 21 dimensions). But for purpose of illustration, we’ll consider unequal sets of variables, say the first 15 bands is classified as first group and the remaining bands 16 – 42 be the second group, hence $p=15$ and $q=27$. So that there are $\min(p,q)=n=15$ pairs of canonical variables. And applying CCA we have,

The above numerical output returned is actually the $n=15$ canonical correlations. And as we can see, the first five canonical correlations are very large implying that the linear combinations we obtain on the first five canonical variables were highly correlated to each other. For subsequent correlations, similar way of interpretation can be done. Next, we’ll examine the coefficients of the first five canonical variables to see which bands is highly explained by the above canonical correlations. The `cancor`

function returns the following components:

`cor`

– correlations;`xcoef`

– estimated coefficients for the x variables;`ycoef`

– estimated coefficients for the y variables;`xcenter`

– the values used to adjust the x variables; and,`ycenter`

– the values used to adjust the x variables.

We are interested on `xcoef`

and `ycoef`

, and so the plot of the coefficients of the first three $i$s of $U_i$s and $V_i$s random variables is shown below,

A closer look on the plot of the coefficients of the first three $U_i$s random variables, shows us fluctuations of loadings between negative and positive values, so that the $U_1,U_2,$ and $U_3$ are a contrast of the spectral bands. And a similar situation is also observed on the plot of the coefficients of the first three $V_i$s random variables, and because of that we cannot further tell for a more specific interpretation on these bands.

### Test of Canonical Dimension

The dimension of the canonical variates above is $n = 15$, let’s check if all these are statistically significant. We’ll use the CCP (Significance Tests for Canonical Correlation Analysis) R package, which contains `p.asym`

function that will do the job for us.

Above output tells us that with 0.05 level of significance, only the first 13 canonical dimensions are significant out of 15.

For more on CCA using R, please check Reference 3. If you want to perform it on SAS, you might want to check Reference 2, and for more on imaging I suggest Reference 1.

### Reference

- Bajorski, P. (2012).
*Statistics for Imaging, Optics, and Photonics*. John Wiley & Sons, Inc. - Stat 505 – Applied Multivariate Statistical Analysis.
*Lesson 8: Canonical Correlation Analysis*. Eberly College of Science, Pennsylvania State University (Penn State). (accessed January 2, 2015) - R Data Analysis Examples: Canonical Correlation Analysis. UCLA: Statistical Consulting Group. From http://www.ats.ucla.edu/stat/r/dae/canonical.htm (accessed January 4, 2015)

**leave a comment**for the author, please follow the link and comment on their blog:

**Analysis with Programming**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...