Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An easy way to run R code in parallel on a multicore system is with the mclapply() function. Unfortunately, mclapply() does not work on Windows machines because the mclapply() implementation relies on forking and Windows does not support forking.
Previously, I published a hackish solution that implemented a fake mclapply() for Windows users with one of the Windows compatible parallel R strategies. You can find further details here.
Due to positive user feedback, I have wrapped that script into a simple R package: parallelsugar.
Installation
Step 0: If you do not already have devtools installed, install it using the instructions here. Note that for the purposes of this package, installing Rtools is not necessary.
Step 1: Install parallelsugar directly from my GitHub repository using install_github('nathanvan/parallelsugar'). For the purposes of this package, you may ignore the error about Rtools (unless you already have it installed, in which case the warning will not appear.)
> library(devtools)
WARNING: Rtools is required to build R packages, but is not currently
installed.
... snip ...
> install_github('nathanvan/parallelsugar')
Downloading github repo nathanvan/parallelsugar@master
Installing parallelsugar
... snip ...
* DONE (parallelsugar)
Usage examples
Basic Usage
On Windows, the following line will take about 40 seconds to run because by default, mclapply from the parallel package is implemented as a serial function on Windows systems.
library(parallel)
system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
## user system elapsed
## 0.00 0.00 40.06
If we load parallelsugar, the default implementation of parallel::mclapply, which used fork based clusters, will be overwritten by parallelsugar::mclapply, which is implemented with socket clusters. The above line of code will then take closer to 10 seconds.
library(parallelsugar)
##
## Attaching package: ‘parallelsugar’
##
## The following object is masked from ‘package:parallel’:
##
## mclapply
system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
## user system elapsed
## 0.04 0.08 12.98
Use of global variables and packages
By design, parallelsugar approximates a fork based cluster — every object that is within scope to the master R process is copied over to the processes on the other sockets. This implies that
- you can quickly run out of memory, and
- you can waste a lot of time copying over unnecessary objects hanging
around in your R session.
Be warned!
## Load a package
library(Matrix)
## Define a global variable
a.global.variable <- Matrix::Diagonal(3)
## Define a global function
wait.then.square <- function(xx){
## Wait for 5 seconds
Sys.sleep(5);
## Square the argument
xx^2
}
## Check that it works with plain lapply
serial.output <- lapply( 1:4, function(xx) {
return( wait.then.square(xx) + a.global.variable )
})
## Test with the modified mclapply
par.output <- mclapply( 1:4, function(xx) {
return( wait.then.square(xx) + a.global.variable )
})
## Are they equal?
all.equal( serial.output, par.output )
## [1] TRUE
Request for feedback and help
I put this together because it helped to solve a specific problem that I was having. If it solves your problem, please let me know. If it needs to be modified to solve your problem, please either
- open an issue on GitHub, or
- even better, fork, fix, and issue a pull request.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
