Multiple cores in R, revisited

[This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The bigmemory package in combination with doMC provides at least a partial solution for sharing a large data set across multiple cores in R. With this solution you can work on the same matrix using several threads. It is also a very scalable solution. I’ve used this on files of several Gb. The limitation is that all the values in the matrix needs to be of the same type (typically integer).

The following code reads in a bed-like file with numerical values for chromosome (1:24) and strand (1,0), processes the file by parallelizing on chromosomes, and returns the values as a list. Note the use of the descriptor to identify the shared object. Any change on the shared object will immediately be visible for all processes.

library(bigmemory)
library(doMC)
registerDoMC(cores=24)


bigtab <- read.big.matrix(filename, sep="\t" col.names=c(‘chr’,’start’,’end’,’strand’),
type=’integer’, shared=FALSE)

descriptor <- describe(bigtab)

result <- foreach(chr = seq(1,24)) %dopar% {
tab <- attach.big.matrix(descriptor)
tab.chr <- tab[tab[,'chr'] == chr,]
# Do some stuff with these values
# and return result
}


Cool, huh?

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)