Multiple cores in R, revisited

August 10, 2011
By

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

The bigmemory package in combination with doMC provides at least a partial solution for sharing a large data set across multiple cores in R. With this solution you can work on the same matrix using several threads. It is also a very scalable solution. I've used this on files of several Gb. The limitation is that all the values in the matrix needs to be of the same type (typically integer).

The following code reads in a bed-like file with numerical values for chromosome (1:24) and strand (1,0), processes the file by parallelizing on chromosomes, and returns the values as a list. Note the use of the descriptor to identify the shared object. Any change on the shared object will immediately be visible for all processes.

library(bigmemory)
library(doMC)
registerDoMC(cores=24)


bigtab <- read.big.matrix(filename, sep="\t" col.names=c('chr','start','end','strand'),
type='integer', shared=FALSE)

descriptor <- describe(bigtab)

result <- foreach(chr = seq(1,24)) %dopar% {
tab <- attach.big.matrix(descriptor)
tab.chr <- tab[tab[,'chr'] == chr,]
# Do some stuff with these values
# and return result
}


Cool, huh?

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.