Simple Parallel randomForest with doMC package

[This article was first published on Command-Line Worldview - John Dennison, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been exploring how to speed up some of my R scripts and have started reading about some amazing corners of R. My first weapon was the Rcpp and RcppArmadillo package. These are wonderful tools and even for someone that has never written c++ before, there are enough to examples and documentation to get started. I hope to share some of these approaches soon. The second and the topic of this missive is parallelization through the package doMC.

When answering a question on stats.stackoverflow, I got to share a simple use case with the randomForest library. I wanted to share that here as well. First a little background on why this works. randomForest is an ensemble technique that trains a group of decision trees on random subset of a training set( a group of trees, forest get it). A new record is presented to this group and what ever class the preponderance of underlying decision trees choose is the classification. For this example, one of the most important aspects of this algorithm is that each decision tree is independently trained on a random subset of variables and records. When in the parallel mindset independent is next godliness, what it means is that you can spread the training of a forest over multiple cores/machines. Below is a simple one machine example.

library("doMC")
library("randomForest")
data(iris)
 
registerDoMC(4) #number of cores on the machine
darkAndScaryForest <- foreach(y=seq(10), .combine=combine ) %dopar% {
   set.seed(y) # not really needed
   rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
}

I passed the randomForest combine function(which will stitch independently trained forests together)  to the similarly named .combine parameter( which controls the function on the output of the loop. The down side is you get no OOB error rate or more tragically variable importance. But this will speed up your training. 

Re-approaching problems with parallelization in mind is a tough problem but i have seen some very real improvements in my speed by doing so. 

To leave a comment for the author, please follow the link and comment on their blog: Command-Line Worldview - John Dennison.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)