Parallelism via “parSapply”

December 13, 2014
By

(This article was first published on Quintuitive » R, and kindly contributed to R-bloggers)

In an earlier post, I used mclapply to kick off parallel R processes and to demonstrate inter-process synchronization via the flock package. Although I have been using this approach to parallelism for a few years now, I admit, it has certain important disadvantages. It works only on a single machine, and also, it doesn’t work on Windows.

Hence, to test the flock package on Windows, I had to resort to the alternative implementation. I liked it quite a bit and learned a few things on the way, so here it is.

require(RSQLite)
require(parallel)
require(flock)

db.path = "C:/ttt.sqlite"
lock.name = "C:/file.lock"

con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
df <- data.frame(value=0)
dbWriteTable(con, "test", df, overwrite=T)
dbDisconnect(con)

write.one.value <- function(val) {
   # Take an exclusive lock
   ll = lock(lock.name)
   
   # The critical section code
   con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
   dbWriteTable(con, "test", data.frame(value=val), append=TRUE)
   dbDisconnect(con)
   
   # Release the lock
   unlock(ll)
}

write.values = function(cores, db.path, lock.name) {
   if(cores > 1) {
      cl = makeCluster(cores)
      
      # Load the packages into all slave processes
      clusterEvalQ(cl=cl, library(RSQLite))
      clusterEvalQ(cl=cl, library(flock))
      
      # Make variables visible in the work-horse function.
      clusterExport(cl, c("db.path", "lock.name"))
      
      tt = parSapply(cl=cl, 1:1000, write.one.value)
      
      stopCluster(cl)
   } else {
      environment(write.one.value) = environment()
      tt = lapply(1:1000, write.one.value)
   }
}

write.values(1, db.path, lock.name)

The interesting piece is write.values. First, it shows a neat R feature, namely, how to make variables visible in the callee without passing everything as arguments. It also shows how to branch between single core and single machine execution (the lapply branch) and a parallel execution (yes, via parSapply one can spawn process even across multiple machines, depends on the setup of the cluster). Why is that important? Often the errors returned from the single core (lapply) version are more clear and meaningful than the errors returned from the parallel version. In other words, until things are stable and bugs are out – the single core version is essential.

To leave a comment for the author, please follow the link and comment on their blog: Quintuitive » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)