Let’s be Faster and more Parallel in R with doParallel package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently and only recently, I have been exposed to large data structures, objects like data frames that are as big as 100MB in size (if you don’t know, you can find out the size of an object with object.size(one_object)
command). When you come from another background to R, you are mostly used to for loops or foreach loops, however I have come across the beauty of expressiveness of lapply
loops. In this blog post, I show you some options to boost performance of loops in R.
Case: Calculating Prime Numbers
Let’s imagine a we are computing prime numbers of 10 to 100000 using the following function:
getPrimeNumbers <- function(n) { n <- as.integer(n) if(n > 1e6) stop("n too large") primes <- rep(TRUE, n) primes[1] <- FALSE last.prime <- 2L for(i in last.prime:floor(sqrt(n))) { primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE last.prime <- last.prime + min(which(primes[(last.prime+1):n])) } which(primes) }
Note that the function is taken from: http://stackoverflow.com/questions/3789968/generate-a-list-of-primes-in-r-up-to-a-certain-number
Now let's compare performance in each loop type and how they perform:
for vs. lapply
Let's look at lapply
:
index <- 10:100000 result <- lapply(index, getPrimeNumbers(prime))
This is how you can do it using a for
loop:
result <- c() index <- 10:100000 for (i in index) { result[[i]] <- getPrimeNumbers(i) }
you might also agree that it makes your code much more beautiful. The apply function is slower in R than native for or for each loops. For example, the for loop finished in 55.4708 seconds in average of 10 runs, while lapply
did the same in 57.00911.
But can it be better? I thought not, I was complaining a lot that R is slow and etc. etc. and it is to be honest, but there are rooms for improvements, so let's see.
doParallel::parLapply
Now let's go multi-code:
library(doParallel) no_cores <- detectCores() - 1 registerDoParallel(cores=no_cores) cl <- makeCluster(no_cores, type="FORK") result <- parLapply(cl, 10:10000, getPrimeNumbers) stopCluster(cl)
The same loop took only 19.38573 in 10 runs. Now, remember that detectCores()
finds how many cores you have on your CPU and just to be safe from any RStudio crashing, I used one less core. Also, make sure to invoke stopCluster
so you free-up resources.
doParallel::foreach
library(doParallel) no_cores <- detectCores() - 1 cl <- makeCluster(no_cores, type="FORK") registerDoParallel(cl) result <- foreach(i=10:10000) %dopar% getPrimeNumbers(i)
doParallel.foreach is very fast. The loop only took 14.87837 seconds on average of 10 runs!
doParallel::mclapply
The last function I am going to show-case is the easy-to-use but not-very-impressive, mclapply
:
library(doParallel) cores <- detectCores() - 1 mclapply(10:10000, getPrimeNumbers, mc.cores=cores)
Although you don't need to create clusters like other functions of doParallel
, it runs on average around 42.62276 seconds, slightly better than for loop while using more loops but worse than doParallel::foreach
or doParallel::parLapply
.
Results
Now let's visualize the result using the amazing ggplot2
so that we can see it in a more humanly understandable ways:
loopMethods <- list(c('for', 'lapply', 'doParallel::parLapply', 'doParallel::foreach', 'doParallel::mclapply')) runtime <- list(c(55.4708, 57.00911, 19.38573, 14.87837, 42.62276)) result <- do.call(rbind, Map(data.frame, A=a, B=b)) colnames(result) <- c('loop type', 'runtime (sec)') ggplot(result, aes(x = `loop type`, y = `runtime (sec)`)) + theme_bw() + geom_bar(stat = "identity")
Here's how it looks:
Let's Clear Out The Confusion
The reason for using doParallel package is that the older parallel
package, parallelization was not working on Windows. doParallel package is trying to make it happen on all platforms: UNIX, LINUX and WINDOWS, so it's a very good wrapper.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.