bigcor: Large correlation matrices in R

February 22, 2013
By

(This article was first published on Rmazing, and kindly contributed to R-bloggers)

As I am working with large gene expression matrices (microarray data) in my job, it is sometimes important to look at the correlation in gene expression of different genes. It has been shown that by calculating the Pearson correlation between genes, one can identify (by high \varrho values, i.e. > 0.9) genes that share a common regulation mechanism such as being induced/repressed by the same transcription factors:

http://www.jbc.org/content/279/17/17905.long

I had an idea. How about using my microarray data of gene expression of 40000 genes in 28 samples and calculate the correlation between all 40000 genes (variables). For this blog entry, we will use simulated data but the message is the same. So let’s create a 40000 x 28 matrix with random normal values and calculate all correlations:

MAT <- matrix(rnorm(40000 * 28), nrow = 28)
cor(MAT)

“Fehler in cor(MAT) : kann Vektor der Länge 1600000000 nicht allozieren”

Ok, R can’t allocate the correlation matrix. No wonder actually:
Each number of class “double” will hold 32 bytes. For a 40000 x 40000 correlation matrix we have
(32 * 40000 * 40000)/1024/1024/1024 = 47.68 GB (!!) which definitely is more than most people’s RAM.

I came across the ‘ff’ package and here we can preallocate an empty 40000 x 40000 matrix:

corMAT <- ff(vmode = "single", dim = c(40000, 40000))

Nice. However, since cor doesn’t calculate a correlation matrix this size, what to do?
Answer: All-combinations-blockwise correlation (Hu?).
With a few sleepless nights and a little help (http://stackoverflow.com/questions/14911489/large-covariance-matrix-in-r) a new function was born: bigcor.

What does it do?
1) It splits a matrix with N columns into equal size blocks of A_1 ... A_nB_1 ... B_n, etc. Block size can be defined by user, 1000-2000 is a good value because cor can handle this quite quickly.
2) For all combinations of blocks, the correlation matrix is calculated, so A/A, A/B, B/B etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function:

bigcor <- function(x, nblocks = 10, verbose = TRUE, ...)
{
library(ff, quietly = TRUE)
NCOL <- ncol(x)

## test if ncol(x) %% nblocks gives remainder 0
if (NCOL %% nblocks != 0) stop("Choose different 'nblocks' so that ncol(x) %% nblocks = 0!")

## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
corMAT <- ff(vmode = "single", dim = c(NCOL, NCOL))

## split column numbers into 'nblocks' groups
SPLIT <- split(1:NCOL, rep(1:nblocks, each = NCOL/nblocks))

## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)

## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
if (verbose) cat("Block", COMB[1], "with Block", COMB[2], "\n")
flush.console()
COR <- cor(MAT[, G1], MAT[, G2], ...)
corMAT[G1, G2] <- COR
corMAT[G2, G1] <- t(COR)
COR <- NULL
}

gc()
return(corMAT)
}

It is advised to choose ‘nblocks’ so that blocksize is max. 2000 – 5000. It should also be so that number of colums divided by ‘nblocks’ gives a remainder of 0, so with 10 colums, either 2 or 5 (10 %% 5 = 0, 10 %% 2 = 0), but the function stops if this is not the case.
An example with a small matrix:

> MAT <- matrix(rnorm(8 * 10), nrow = 10)
> res <- bigcor(MAT, nblocks = 2)
Block 1 with Block 1 
Block 1 with Block 2 
Block 2 with Block 2 
> res
ff (open) single length=64 (64) dim=c(8,8) dimorder=c(1,2)
             [,1]         [,2]         [,3]         [,4]         [,5]         [,6]         [,7]         [,8]
[1,]  1.000000000 -0.046641227  0.324235857  0.262680113 -0.665063083 -0.004026082 -0.606948912 -0.192466438
[2,] -0.046641227  1.000000000  0.506007075  0.440847754  0.380443692 -0.748274207 -0.396612495  0.124028035
[3,]  0.324235857  0.506007075  1.000000000  0.472342789 -0.095279060 -0.391388297 -0.298102200  0.412145644
[4,]  0.262680113  0.440847754  0.472342789  1.000000000 -0.001338097 -0.515028238 -0.566384733  0.500462592
[5,] -0.665063083  0.380443692 -0.095279060 -0.001338097  1.000000000 -0.162814990  0.173953563  0.235193089
[6,] -0.004026082 -0.748274207 -0.391388297 -0.515028238 -0.162814990  1.000000000  0.564979374 -0.124034598
[7,] -0.606948912 -0.396612495 -0.298102200 -0.566384733  0.173953563  0.564979374  1.000000000  0.186116025
[8,] -0.192466438  0.124028035  0.412145644  0.500462592  0.235193089 -0.124034598  0.186116025  1.000000000

Similar to original 'cor':
> cor(MAT)
             [,1]        [,2]        [,3]         [,4]         [,5]         [,6]       [,7]       [,8]
[1,]  1.000000000 -0.04664123  0.32423586  0.262680111 -0.665063068 -0.004026082 -0.6069489 -0.1924664
[2,] -0.046641225  1.00000000  0.50600710  0.440847754  0.380443686 -0.748274237 -0.3966125  0.1240280
[3,]  0.324235855  0.50600710  1.00000000  0.472342793 -0.095279062 -0.391388298 -0.2981022  0.4121456
[4,]  0.262680111  0.44084775  0.47234279  1.000000000 -0.001338097 -0.515028219 -0.5663847  0.5004626
[5,] -0.665063068  0.38044369 -0.09527906 -0.001338097  1.000000000 -0.162814989  0.1739536  0.2351931
[6,] -0.004026082 -0.74827424 -0.39138830 -0.515028219 -0.162814989  1.000000000  0.5649793 -0.1240346
[7,] -0.606948889 -0.39661248 -0.29810221 -0.566384734  0.173953557  0.564979348  1.0000000  0.1861160
[8,] -0.192466433  0.12402803  0.41214564  0.500462599  0.235193095 -0.124034600  0.1861160  1.0000000

Timings for a 20000 x 20000 matrix:
> MAT <- matrix(rnorm(20000 * 10), nrow = 10)
> system.time(res <- bigcor(MAT, nblocks = 10))
Block 1 with Block 1 
Block 1 with Block 2 
Block 1 with Block 3 
Block 1 with Block 4 
Block 1 with Block 5 
Block 1 with Block 6 
Block 1 with Block 7 
Block 1 with Block 8 
Block 1 with Block 9 
Block 1 with Block 10 
Block 2 with Block 2 
Block 2 with Block 3 
Block 2 with Block 4 
Block 2 with Block 5 
Block 2 with Block 6 
Block 2 with Block 7 
Block 2 with Block 8 
Block 2 with Block 9 
Block 2 with Block 10 
Block 3 with Block 3 
Block 3 with Block 4 
Block 3 with Block 5 
Block 3 with Block 6 
Block 3 with Block 7 
Block 3 with Block 8 
Block 3 with Block 9 
Block 3 with Block 10 
Block 4 with Block 4 
Block 4 with Block 5 
Block 4 with Block 6 
Block 4 with Block 7 
Block 4 with Block 8 
Block 4 with Block 9 
Block 4 with Block 10 
Block 5 with Block 5 
Block 5 with Block 6 
Block 5 with Block 7 
Block 5 with Block 8 
Block 5 with Block 9 
Block 5 with Block 10 
Block 6 with Block 6 
Block 6 with Block 7 
Block 6 with Block 8 
Block 6 with Block 9 
Block 6 with Block 10 
Block 7 with Block 7 
Block 7 with Block 8 
Block 7 with Block 9 
Block 7 with Block 10 
Block 8 with Block 8 
Block 8 with Block 9 
Block 8 with Block 10 
Block 9 with Block 9 
Block 9 with Block 10 
Block 10 with Block 10 
       User      System verstrichen 
      38.98        7.54       50.28 

Roughly a minute. I think that's o.k.
Comments welcome (improvements, speed, memory management, 
parallel execution...)

Cheers,
Andrej

Filed under: General Tagged: big matrix, cor, correlation

To leave a comment for the author, please follow the link and comment on his blog: Rmazing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.