**Strange Attractors » R**, and kindly contributed to R-bloggers)

In a recent LinkedIn conversation, the topic of correlation between multiple financial indices was raised. While the actual details are not relevant, the discussion reminded me of one of the concerns I have whenever multivariate correlation is used—how to populate the correlation matrix.

First, some background. Unfortunately, most financial random variables are **not** normally distributed—they are more severe and have thicker tails than the normal distribution does. When dealing with a joint distribution of multiple random variables, with each of the marginals being thin-tailed, the problem is compounded as the joint distribution of those many thin-tailed marginals has no chance at being thick tailed. When dealing with financial variables, another family of multivariate copula should usually be considered.

Nevertheless, assuming we are dealing with an elliptical copula (thicker tailed than the normal or not), the correlation matrix needs to be populated. When people discuss correlation, they almost always mean linear correlation, or, more precisely, the Pearson product-moment correlation coefficient. This correlation serves naturally for the normal and multivariate normal distributions. However, linear correlation is not necessarily the best metric when dealing with copulas. The Pearson product-moment correlation is notoriously sensitive to outliers . Moreover, it really isn’t even a true measure of concordance . According to Scarsini’s axioms , if variables are absolutely co-monotonic, their measure of concordance must be 100%. Now consider the case where we have two vectors of variables, \(\vec{X}\) and \(\vec{Y}\). If we let \(X_i = Y_i\) then the Pearson correlation coefficient is 100%, as it should be. Now define a new variable \(Z = \ln(X)\). The natural logarithm is a strictly increasing function, so if \(X\) increases, \(Z\) *must* increase. Nevertheless, the Pearson correlation is not 100% as since the increase is not linear, showing that the Pearson correlation is not a true measure of concordance.

When dealing with copulæ, we want a measure of correlation that is a true measure of concordance and not unduly affected by outliers. There are two other common measures of correlation which have both of these qualities: Spearman’s \(\rho\) and Kendall’s \(\tau\). Spearman’s correlation can be thought of as the linear correlation of the ranks of the data, as opposed to their values. Kendall’s rank correlation can be thought of as the percentage increase that the set of concordant pairs have over the set of discordant pairs—which can be negative if the set of discordant pairs is larger. Of the two, Kendall’s \(\tau\) is more frequently encountered when dealing with copulæ as there is a direct functional relationship between its value and that of both the generating function of Archimedean copulæ and the correlation of any elliptical copula, which both the multivariate normal and multivariate t copulæ are members . The relationship for elliptical copulæ is \(\tau = \frac{2}{\pi}\arcsin \rho\), so given the Kendall \(\tau\) value we can calculate the needed correlation as \(\rho = \sin\left(\frac{\pi}{2}\tau\right)\). This allows us to calculate pairwise Kendall \(\tau\) values for each of the variables and convert them to the corresponding \(\rho\) for use in the elliptical copula we choose.

This leads us to another potential problem—it is not always the case that the matrix composed of the pairwise converted Kendall \(\tau\) values is itself a valid correlation matrix. Correlation matrices have to be positive semidefinite. A correlation matrix is simply a scaled covariance matrix and the latter *must* be positive semidefinite as the variance of a random variable must be non-negative.

There are a number of ways to adjust these matrices so that they are positive semidefinite. The method I tend to use is one based on eigenvalues. This method has better properties than simpler shrinking methods and is easier to apply than scaling methods, all of which are described and discussed in . The eigenvalue method decomposes the pseudo-correlation matrix into its eigenvectors and eigenvalues and then achieves positive semidefiniteness by making all eigenvalues greater or equal to 0. If truly positive definite matrices are needed, instead of having a floor of 0, the negative eigenvalues can be converted to a small positive number. Afterwards, the matrix is recomposed via the old eigenvectors and new eigenvalues, and then scaled so that the diagonals are all 1′s. A simple R function which reads in a pseudo-correlation matrix and returns a positive semidefinite correlation matrix after adjusting the eigenvalues and rescaling is:

CorrectCM < - function(CM) { n <- dim(var(CM))[1L] E <- eigen(CM) CM1 <- E$vectors %*% tcrossprod(diag(pmax(E$values, 0), n), E$vectors) Balance <- diag(1/sqrt(diag(CM1))) CM2 <- Balance %*% CM1 %*% Balance return(CM2) }

To see it in action, here is a pseudo-correlation matrix:

## [,1] [,2] [,3] [,4] [,5] ## [1,] 1.0000 -0.15548 -0.15639 0.17273 0.7062 ## [2,] -0.1555 1.00000 -0.16063 0.07662 0.1895 ## [3,] -0.1564 -0.16063 1.00000 0.08714 0.3326 ## [4,] 0.1727 0.07662 0.08714 1.00000 -0.5027 ## [5,] 0.7062 0.18945 0.33256 -0.50269 1.0000

The initial eigenvalues are:

## $values ## [1] 1.8123 1.1800 1.1569 0.9974 -0.1465 ## ## $vectors ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.549247 0.63648 -0.04933 0.1147 -0.5269 ## [2,] 0.006179 -0.03264 0.67403 -0.6952 -0.2475 ## [3,] 0.164849 -0.36532 -0.67875 -0.5254 -0.3203 ## [4,] -0.329193 0.67247 -0.28180 -0.4493 0.3977 ## [5,] 0.750164 -0.09036 0.05606 -0.1600 0.6327

After applying the correction, the correlation matrix is now:

## [,1] [,2] [,3] [,4] [,5] ## [1,] 1.0000 -0.13310 -0.1281 0.13765 0.6263 ## [2,] -0.1331 1.00000 -0.1473 0.06122 0.1611 ## [3,] -0.1281 -0.14725 1.0000 0.06720 0.2922 ## [4,] 0.1376 0.06122 0.0672 1.00000 -0.4476 ## [5,] 0.6263 0.16112 0.2922 -0.44759 1.0000

and its eigenvalues are:

## $values ## [1] 1.729e+00 1.148e+00 1.141e+00 9.824e-01 2.887e-15 ## ## $vectors ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.546460 0.55609 -0.32051 0.1084 -0.5269 ## [2,] -0.002108 0.26674 0.63673 -0.6812 -0.2437 ## [3,] -0.171375 -0.61762 -0.44239 -0.5416 -0.3164 ## [4,] 0.338972 0.48589 -0.53656 -0.4534 0.3944 ## [5,] -0.746395 -0.04542 0.09076 -0.1590 0.6382

One of these days I’d like to implement the scaling technique discussed in and compare the results. I’d be interested in hearing if anyone else has some experience or anecdotes about adjusting correlation matrices.

**leave a comment**for the author, please follow the link and comment on their blog:

**Strange Attractors » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...