Parallelism via “parSapply”

[This article was first published on Quintuitive » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In an earlier post, I used mclapply to kick off parallel R processes and to demonstrate inter-process synchronization via the flock package. Although I have been using this approach to parallelism for a few years now, I admit, it has certain important disadvantages. It works only on a single machine, and also, it doesn’t work on Windows.

Hence, to test the flock package on Windows, I had to resort to the alternative implementation. I liked it quite a bit and learned a few things on the way, so here it is.

require(RSQLite)
require(parallel)
require(flock)

db.path = "C:/ttt.sqlite"
lock.name = "C:/file.lock"

con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
df <- data.frame(value=0)
dbWriteTable(con, "test", df, overwrite=T)
dbDisconnect(con)

write.one.value <- function(val) {
   # Take an exclusive lock
   ll = lock(lock.name)
   
   # The critical section code
   con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
   dbWriteTable(con, "test", data.frame(value=val), append=TRUE)
   dbDisconnect(con)
   
   # Release the lock
   unlock(ll)
}

write.values = function(cores, db.path, lock.name) {
   if(cores > 1) {
      cl = makeCluster(cores)
      
      # Load the packages into all slave processes
      clusterEvalQ(cl=cl, library(RSQLite))
      clusterEvalQ(cl=cl, library(flock))
      
      # Make variables visible in the work-horse function.
      clusterExport(cl, c("db.path", "lock.name"))
      
      tt = parSapply(cl=cl, 1:1000, write.one.value)
      
      stopCluster(cl)
   } else {
      environment(write.one.value) = environment()
      tt = lapply(1:1000, write.one.value)
   }
}

write.values(1, db.path, lock.name)

The interesting piece is write.values. First, it shows a neat R feature, namely, how to make variables visible in the callee without passing everything as arguments. It also shows how to branch between single core and single machine execution (the lapply branch) and a parallel execution (yes, via parSapply one can spawn process even across multiple machines, depends on the setup of the cluster). Why is that important? Often the errors returned from the single core (lapply) version are more clear and meaningful than the errors returned from the parallel version. In other words, until things are stable and bugs are out – the single core version is essential.

To leave a comment for the author, please follow the link and comment on their blog: Quintuitive » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)