I have been exploring how to speed up some of my R scripts and have started reading about some amazing corners of R. My first weapon was the Rcpp and RcppArmadillo package. These are wonderful tools and even for someone that has never written c++ before, there are enough to examples and documentation to get started. I hope to share some of these approaches soon. The second and the topic of this missive is parallelization through the package doMC.
When answering a question on stats.stackoverflow, I got to share a simple use case with the randomForest library. I wanted to share that here as well. First a little background on why this works. randomForest is an ensemble technique that trains a group of decision trees on random subset of a training set( a group of trees, forest get it). A new record is presented to this group and what ever class the preponderance of underlying decision trees choose is the classification. For this example, one of the most important aspects of this algorithm is that each decision tree is independently trained on a random subset of variables and records. When in the parallel mindset independent is next godliness, what it means is that you can spread the training of a forest over multiple cores/machines. Below is a simple one machine example.
I passed the randomForest combine function(which will stitch independently trained forests together) to the similarly named .combine parameter( which controls the function on the output of the loop. The down side is you get no OOB error rate or more tragically variable importance. But this will speed up your training.
Re-approaching problems with parallelization in mind is a tough problem but i have seen some very real improvements in my speed by doing so.