Site icon R-bloggers

parallelsugar: An implementation of mclapply for Windows

[This article was first published on Nathan VanHoudnos » rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An easy way to run R code in parallel on a multicore system is with the mclapply() function. Unfortunately, mclapply() does not work on Windows machines because the mclapply() implementation relies on forking and Windows does not support forking.

Previously, I published a hackish solution that implemented a fake mclapply() for Windows users with one of the Windows compatible parallel R strategies. You can find further details here.

Due to positive user feedback, I have wrapped that script into a simple R package: parallelsugar.

Installation

Step 0: If you do not already have devtools installed, install it using the instructions here. Note that for the purposes of this package, installing Rtools is not necessary.

Step 1: Install parallelsugar directly from my GitHub repository using install_github('nathanvan/parallelsugar'). For the purposes of this package, you may ignore the error about Rtools (unless you already have it installed, in which case the warning will not appear.)

> library(devtools)
WARNING: Rtools is required to build R packages, but is not currently
installed.
   ... snip ...
> install_github('nathanvan/parallelsugar')
Downloading github repo nathanvan/parallelsugar@master
Installing parallelsugar
  ... snip ...
* DONE (parallelsugar)

Usage examples

Basic Usage

On Windows, the following line will take about 40 seconds to run because by default, mclapply from the parallel package is implemented as a serial function on Windows systems.

library(parallel) 

system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
##    user  system elapsed 
##    0.00    0.00   40.06 

If we load parallelsugar, the default implementation of parallel::mclapply, which used fork based clusters, will be overwritten by parallelsugar::mclapply, which is implemented with socket clusters. The above line of code will then take closer to 10 seconds.

library(parallelsugar)
## 
## Attaching package: ‘parallelsugar’
## 
## The following object is masked from ‘package:parallel’:
## 
##     mclapply

system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
##    user  system elapsed 
##    0.04    0.08   12.98 

Use of global variables and packages

By design, parallelsugar approximates a fork based cluster — every object that is within scope to the master R process is copied over to the processes on the other sockets. This implies that

Be warned!

## Load a package 
library(Matrix)

## Define a global variable
a.global.variable <- Matrix::Diagonal(3)

## Define a global function 
wait.then.square <- function(xx){
  ## Wait for 5 seconds
  Sys.sleep(5);
  ## Square the argument
  xx^2 
}

## Check that it works with plain lapply
serial.output <- lapply( 1:4, function(xx) {
      return( wait.then.square(xx) + a.global.variable )
    }) 

## Test with the modified mclapply  
par.output <- mclapply( 1:4, function(xx) {
      return( wait.then.square(xx) + a.global.variable )
    })

## Are they equal? 
all.equal( serial.output, par.output )
## [1] TRUE

Request for feedback and help

I put this together because it helped to solve a specific problem that I was having. If it solves your problem, please let me know. If it needs to be modified to solve your problem, please either

To leave a comment for the author, please follow the link and comment on their blog: Nathan VanHoudnos » rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.