There’s been some discussion on the kaggle forums and on a few blogs about various ways to parallelize random forests, so I thought I’d add my thoughts on the issue.
Here’s my version of the ‘parRF’ function, which is based on the elegant version in the foreach vignette:
This function works very simply: you pass it a vector of mtry values, and it fits a random forest using each of those values and returns the combined result. You can all pass any additional parameters you want (like ntree) to the randomForest function.
I think this functions provides 2 improvements over previous implementations. #1 is you can use any parallel backend you want. doRedis is my current favorite, as it’s cross-platform and fault-tolerant and let’s me commandeer idle laptops around the house/office when a random forest is taking too long to fit. #2 is the argument .inorder=FALSE in the foreach function, which provides a small performance improvement as it lets R combine the random forests as they finish, rather than forcing R to combine them in the order they start.
Lets say you want a random forest with 5000 trees. The default value for ntree is 500, so we use rep(4,10) as the argument for the function.
Maybe we’re unsure of the optimal mtry value, and want combine 2 ensembles of 2500 trees. Then we use the argument c(rep(3,5),rep(4,5)). This gives us 2500 trees with mtry=3 and 2500 with mtry=4. I like to think of this as a sort of meta-ensemble of decision trees, but I’ve yet to see it improve my predictive accuracy.
At the very least, this can help with those damn ‘out of memory’ errors I’ve been getting on my laptop when fitting random forests to large datasets.