Drowning in a glass of water: variance-covariance and correlation matrices

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the easiest tasks in R is to get correlations between each pair of variables in a dataset. As an example, let’s take the first four columns in the ‘mtcars’ dataset, that is available within R. Getting the variances-covariances and the correlations is straightforward.

data(mtcars)
matr <- mtcars[,1:4]

#Covariances
cov(matr)
##              mpg        cyl       disp        hp
## mpg    36.324103  -9.172379  -633.0972 -320.7321
## cyl    -9.172379   3.189516   199.6603  101.9315
## disp -633.097208 199.660282 15360.7998 6721.1587
## hp   -320.732056 101.931452  6721.1587 4700.8669
#Correlations
cor(matr)
##             mpg        cyl       disp         hp
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475
## disp -0.8475514  0.9020329  1.0000000  0.7909486
## hp   -0.7761684  0.8324475  0.7909486  1.0000000

It’s really a piece of cake! Unfortunately, a few days ago I had a covariance matrix without the original dataset and I wanted the corresponding correlation matrix. Although this is an easy task as well, at first I was stuck, because I could not find an immediate solution… So I started wondering how I could make it.

Indeed, having the two variables X and Y, their covariance is:

\[cov(X, Y) = \sum\limits_{i=1}^{n} {(X_i - \hat{X})(Y_i - \hat{Y})}\]

where \(\hat{Y}\) and \(\hat{X}\) are the means for each variable. The correlation is:

\[cor(X, Y) = \frac{cov(X, Y)}{\sigma_x \sigma_y} \]

where \(\sigma_x\) and \(\sigma_y\) are the standard deviations for X and Y.

The opposite relationship is clear:

\[ cov(X, Y) = cor(X, Y) \sigma_x \sigma_y\]

Therefore, converting from covariance to correlation is pretty easy. For example, take the covariance between ‘cyl’ and ‘mpg’ above (-9.172379), the correlation is:

-633.097208 / (sqrt(36.324103) * sqrt(15360.7998))
## [1] -0.8475514

On the reverse, if we have the correlation (-0.8521620), the covariance is

-0.8475514 * sqrt(36.324103) * sqrt(15360.7998)
## [1] -633.0972

My covariance matrix was pretty large, so I started wondering how I could perform this task altogether. What I had to do was to take each element in the covariance matrix and divide it by the square root of the diagonal elements in the same column and in the same row (see below).

This is easily done by matrix multiplication. I need a square matrix where the standard deviations for each variable are repeated along the rows:

V <- cov(matr)
SM1 <- matrix(rep(sqrt(diag(V)), 4), 4, 4)
SM1
##            [,1]       [,2]       [,3]       [,4]
## [1,]   6.026948   6.026948   6.026948   6.026948
## [2,]   1.785922   1.785922   1.785922   1.785922
## [3,] 123.938694 123.938694 123.938694 123.938694
## [4,]  68.562868  68.562868  68.562868  68.562868

and another one where they are repeated along the columns

SM2 <- matrix(rep(sqrt(diag(V)), each = 4), 4, 4)

Now I can take my covariance matrix (V) and simply multiply these three matrices as follows:

V * 1/SM1 * 1/SM2
##             mpg        cyl       disp         hp
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475
## disp -0.8475514  0.9020329  1.0000000  0.7909486
## hp   -0.7761684  0.8324475  0.7909486  1.0000000

Indeed, there is not even the need to use ‘rep’ when we create SM1, as R will recycle the elements as needed.

Going from correlation to covariance can be done similarly:

R <- cor(matr)
R / (1/SM1 * 1/SM2)
##              mpg        cyl       disp        hp
## mpg    36.324103  -9.172379  -633.0972 -320.7321
## cyl    -9.172379   3.189516   199.6603  101.9315
## disp -633.097208 199.660282 15360.7998 6721.1587
## hp   -320.732056 101.931452  6721.1587 4700.8669

This is an easy task, but it got me stuck for a few minutes…

Lately, I finally discovered that there is (at least) one function in R taking care of the above task; it is the ‘cov2cor()’ function in the ‘nlme’ package.

library(nlme)
cov2cor(V)
##             mpg        cyl       disp         hp
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475
## disp -0.8475514  0.9020329  1.0000000  0.7909486
## hp   -0.7761684  0.8324475  0.7909486  1.0000000

It is really easy to get drown in a glass of water!

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)