bigcor: Large correlation matrices in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As I am working with large gene expression matrices (microarray data) in my job, it is sometimes important to look at the correlation in gene expression of different genes. It has been shown that by calculating the Pearson correlation between genes, one can identify (by high values, i.e. > 0.9) genes that share a common regulation mechanism such as being induced/repressed by the same transcription factors:
http://www.jbc.org/content/279/17/17905.long
I had an idea. How about using my microarray data of gene expression of 40000 genes in 28 samples and calculate the correlation between all 40000 genes (variables). For this blog entry, we will use simulated data but the message is the same. So let’s create a 40000 x 28 matrix with random normal values and calculate all correlations:
MAT <- matrix(rnorm(40000 * 28), nrow = 28) cor(MAT)
“Fehler in cor(MAT) : kann Vektor der Länge 1600000000 nicht allozieren”
Ok, R can’t allocate the correlation matrix. No wonder actually:
Each number of class “double” will hold 32 bytes. For a 40000 x 40000 correlation matrix we have
(32 * 40000 * 40000)/1024/1024/1024 = 47.68 GB (!!) which definitely is more than most people’s RAM.
I came across the ‘ff’ package and here we can preallocate an empty 40000 x 40000 matrix:
corMAT <- ff(vmode = "single", dim = c(40000, 40000))
Nice. However, since cor doesn’t calculate a correlation matrix this size, what to do?
Answer: All-combinations-blockwise correlation (Hu?).
With a few sleepless nights and a little help (http://stackoverflow.com/questions/14911489/large-covariance-matrix-in-r) a new function was born: bigcor.
What does it do?
1) It splits a matrix with N columns into equal size blocks of , , etc. Block size can be defined by user, 1000-2000 is a good value because cor can handle this quite quickly.
2) For all combinations of blocks, the correlation matrix is calculated, so A/A, A/B, B/B etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.
This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.
The function:
bigcor <- function(x, nblocks = 10, verbose = TRUE, ...) { library(ff, quietly = TRUE) NCOL <- ncol(x) ## test if ncol(x) %% nblocks gives remainder 0 if (NCOL %% nblocks != 0) stop("Choose different 'nblocks' so that ncol(x) %% nblocks = 0!") ## preallocate square matrix of dimension ## ncol(x) in 'ff' single format corMAT <- ff(vmode = "single", dim = c(NCOL, NCOL)) ## split column numbers into 'nblocks' groups SPLIT <- split(1:NCOL, rep(1:nblocks, each = NCOL/nblocks)) ## create all unique combinations of blocks COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT)) COMBS <- t(apply(COMBS, 1, sort)) COMBS <- unique(COMBS) ## iterate through each block combination, calculate correlation matrix ## between blocks and store them in the preallocated matrix on both ## symmetric sides of the diagonal for (i in 1:nrow(COMBS)) { COMB <- COMBS[i, ] G1 <- SPLIT[[COMB[1]]] G2 <- SPLIT[[COMB[2]]] if (verbose) cat("Block", COMB[1], "with Block", COMB[2], "\n") flush.console() COR <- cor(MAT[, G1], MAT[, G2], ...) corMAT[G1, G2] <- COR corMAT[G2, G1] <- t(COR) COR <- NULL } gc() return(corMAT) }
It is advised to choose ‘nblocks’ so that blocksize is max. 2000 – 5000. It should also be so that number of colums divided by ‘nblocks’ gives a remainder of 0, so with 10 colums, either 2 or 5 (10 %% 5 = 0, 10 %% 2 = 0), but the function stops if this is not the case.
An example with a small matrix:
> MAT <- matrix(rnorm(8 * 10), nrow = 10) > res <- bigcor(MAT, nblocks = 2) Block 1 with Block 1 Block 1 with Block 2 Block 2 with Block 2 > res ff (open) single length=64 (64) dim=c(8,8) dimorder=c(1,2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.000000000 -0.046641227 0.324235857 0.262680113 -0.665063083 -0.004026082 -0.606948912 -0.192466438 [2,] -0.046641227 1.000000000 0.506007075 0.440847754 0.380443692 -0.748274207 -0.396612495 0.124028035 [3,] 0.324235857 0.506007075 1.000000000 0.472342789 -0.095279060 -0.391388297 -0.298102200 0.412145644 [4,] 0.262680113 0.440847754 0.472342789 1.000000000 -0.001338097 -0.515028238 -0.566384733 0.500462592 [5,] -0.665063083 0.380443692 -0.095279060 -0.001338097 1.000000000 -0.162814990 0.173953563 0.235193089 [6,] -0.004026082 -0.748274207 -0.391388297 -0.515028238 -0.162814990 1.000000000 0.564979374 -0.124034598 [7,] -0.606948912 -0.396612495 -0.298102200 -0.566384733 0.173953563 0.564979374 1.000000000 0.186116025 [8,] -0.192466438 0.124028035 0.412145644 0.500462592 0.235193089 -0.124034598 0.186116025 1.000000000 Similar to original 'cor': > cor(MAT) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.000000000 -0.04664123 0.32423586 0.262680111 -0.665063068 -0.004026082 -0.6069489 -0.1924664 [2,] -0.046641225 1.00000000 0.50600710 0.440847754 0.380443686 -0.748274237 -0.3966125 0.1240280 [3,] 0.324235855 0.50600710 1.00000000 0.472342793 -0.095279062 -0.391388298 -0.2981022 0.4121456 [4,] 0.262680111 0.44084775 0.47234279 1.000000000 -0.001338097 -0.515028219 -0.5663847 0.5004626 [5,] -0.665063068 0.38044369 -0.09527906 -0.001338097 1.000000000 -0.162814989 0.1739536 0.2351931 [6,] -0.004026082 -0.74827424 -0.39138830 -0.515028219 -0.162814989 1.000000000 0.5649793 -0.1240346 [7,] -0.606948889 -0.39661248 -0.29810221 -0.566384734 0.173953557 0.564979348 1.0000000 0.1861160 [8,] -0.192466433 0.12402803 0.41214564 0.500462599 0.235193095 -0.124034600 0.1861160 1.0000000 Timings for a 20000 x 20000 matrix: > MAT <- matrix(rnorm(20000 * 10), nrow = 10) > system.time(res <- bigcor(MAT, nblocks = 10)) Block 1 with Block 1 Block 1 with Block 2 Block 1 with Block 3 Block 1 with Block 4 Block 1 with Block 5 Block 1 with Block 6 Block 1 with Block 7 Block 1 with Block 8 Block 1 with Block 9 Block 1 with Block 10 Block 2 with Block 2 Block 2 with Block 3 Block 2 with Block 4 Block 2 with Block 5 Block 2 with Block 6 Block 2 with Block 7 Block 2 with Block 8 Block 2 with Block 9 Block 2 with Block 10 Block 3 with Block 3 Block 3 with Block 4 Block 3 with Block 5 Block 3 with Block 6 Block 3 with Block 7 Block 3 with Block 8 Block 3 with Block 9 Block 3 with Block 10 Block 4 with Block 4 Block 4 with Block 5 Block 4 with Block 6 Block 4 with Block 7 Block 4 with Block 8 Block 4 with Block 9 Block 4 with Block 10 Block 5 with Block 5 Block 5 with Block 6 Block 5 with Block 7 Block 5 with Block 8 Block 5 with Block 9 Block 5 with Block 10 Block 6 with Block 6 Block 6 with Block 7 Block 6 with Block 8 Block 6 with Block 9 Block 6 with Block 10 Block 7 with Block 7 Block 7 with Block 8 Block 7 with Block 9 Block 7 with Block 10 Block 8 with Block 8 Block 8 with Block 9 Block 8 with Block 10 Block 9 with Block 9 Block 9 with Block 10 Block 10 with Block 10 User System verstrichen 38.98 7.54 50.28 Roughly a minute. I think that's o.k. Comments welcome (improvements, speed, memory management, parallel execution...) Cheers, Andrej
Filed under: General Tagged: big matrix, cor, correlation
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.