Parallelization: Speed up Functions in a Package

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well I bought a new computer a month back (i7 8GB memory). Finally more than one core and a chance to try parallelization. I saw this blog post a while back and was intrigued and was further intriqued when I saw that plyr/reshape2 has some paralellization capabilities(LINK). Let me say up front this is my first experience so there may be better ways but it sped up my code by over four times.

parallel computing

Let me warn you now, when I first read the A No BS Guide to the Basics of Parallelization in R I tried to see how many cores I had on my computer (this shows my ignorance; which may be of comfort to some of you, others will stop reading this blog post immediately). 1 is the loneliest number especially if you’re attempting to run on multiple cores.

Suggestion if you type detectCores() and see 1 you can’t run code in parallel, at least not by running it on different cores of your machine.

Background (skip this if you are short on time)
I’m working on a package (qdap) and have a function (pos) that takes a long time to run. It is basically finding parts of speech by sentence (each sentence is a cell and there are thousands of them). I rely on openNLP for the pos tagging but the whole process is time consuming. I figured perfect time to try this parallelization out.

I skimmed the Task View for parallel computing and knew I was out of my league and decided to just focus on my problem not the whole parallelization concept. Back to wrathematics bog post and I discovered my silly Windows machine was not compatible with mcapply but saw hope with the clusterApply(). Using ?clusterApply
I saw parLapply said it was a parallel version of lapply. I like lapply and dicided that was what I’d go with.

Working with parallel coding in functions (skip to here)
These are the two major problems/differences I encountered with parLapply over lapply inside a function:

    1. You need to pass/export the functions and variables you’ll be needing in the parLapply using makeCluster & clusterExport. See Andy Garcia’s helpful response to my question about this (LINK)
    2. You have to specify the envir argument of clusterExport as envir=environment(). See GSee’s helpful response to my question about this (LINK)
    3.  
      Below is an example of taking a non parallel function and making it run in parallel:

       detectCores()  #make sure you have > 1 core
      
      nonpar.test <- function(text.var, gc.rate=10){ 
          ntv <- length(text.var)
          require(parallel)
          pos <-  function(i) {
              paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
          }
          lapply(seq_len(ntv), function(i) {
                  x <- pos(text.var[i])
                  if (i%%gc.rate==0) gc()
                  return(x)
              }
          )
      }
      
      nonpar.test(rep("I wish I ran in parallel.", 20))
      
      par.test <- function(text.var, gc.rate=10){ 
          ntv <- length(text.var)
          require(parallel)
          pos <-  function(i) {
              paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
          }
      #======================================
          cl <- makeCluster(mc <- getOption("cl.cores", 4))
          clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"), 
              envir=environment())
          parLapply(cl, seq_len(ntv), function(i) {
      #======================================
                  x <- pos(text.var[i])
                  if (i%%gc.rate==0) gc()
                  return(x)
              }
          )
      }
      
      par.test(rep("I wish I ran in parallel.", 20))

      Notice that lines 28-31 (between the #==== lines) is all that changes. Once you get it down working with parLapply is pretty easy.

      Note:
      It doesn’t always make sense to run in parallel as it takes time to make the cluster. In the pos I added parallel as an argument because for smaller text vectors running in parallel doesn’t make sense (it’s slower).

      Wonderings and future direction:
      The pos function I have in qdap uses a progress bar. Currently I couldn’t make a progress bar work with parLapply but it’s less of a need because it was so much faster.

      Benchmarking (1 run)

      > system.time(pos(rajSPLIT$dialogue, parallel=T))
         user  system elapsed 
         2.35    0.08  199.53 
      
      > system.time(pos(rajSPLIT$dialogue, progress.bar =F))
         user  system elapsed 
       816.61   16.74  833.47

      This is benchmarked using the rajSPLIT$dialogue which is the text from Romeo and Juliet, a data set in qdap. This consists of 2151 rows or 23,943 words.

      Hopefully this blog post is useful to those learning some parallelization. Check out Task View , the Documentation for the Parallel package and the Vignette for the parallel package.

      If you have suggestions for improvement, links, or help on getting a progress bar with parLapply please leave a comment.


      To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

      R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
      Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)