by Mike Bowles
In two previous posts, A Thumbnail History of Ensemble Methods and Ensemble Packages in R, Mike Bowles — a machine learning expert and serial entrepreneur — laid out a brief history of ensemble methods and described a few of the many implementations in R. In this post Mike takes a detailed look at the Random Forests implementation in the RevoScaleR package that ships with Revolution R Enterprise.
Revolution Analytics' rxDForest() function provides an ideal tool for developing ensemble models on very large data sets. It allows the data scientist to do prototyping on a single CPU version of the random forest algorithm and to then shift with relative ease to a multi-core version for generating a higher-performance model on an extremely large data set. It’s convenient that the single-CPU and multiple-CPU versions operate on the same data, have many of the same input parameters and deliver the same types of performance summaries and analyses. Revolution Anaytics is one of a very small number of companies offering a true multi-core version of Random Forests. (I only know of one other)*.
The computationally intensive part of ensemble methods is training binary decision trees, and the computationally intense part of training a binary tree is split-point determination. Binary trees comprise a number of binary decisions that are of the form (attributeX < some number). The nodes in the tree pose this binary question and its answer determines whether an example goes to the left or right out of the node. To train the binary tree every possible split point for every attribute has to be tried in order to pick the best one. It’s easy to see why this split-point selection process consumes so much time — particularly on very large data sets. In a standard tree formulation (CART for example) the number of tests is equal to the number of points in the data set (not the number of examples (rows), the number of rows times the number of attributes (columns). This issue has been the subject of research for the last ten or so years.
The Google PLANET paper discusses the sensible idea of approximating the split-point selection process by aggregating points into bins instead of checking every possible value. More recent researchers have developed methods for generating approximate data histograms on streaming data. These methods are well-suited to the map-reduce environment and implemented in the Revolution Analytics version of binary decision trees and Random Forests. Their incorporation makes the computation faster and introduces a “binning” parameter that may be unfamiliar to long-time users of single-CPU versions of random forests.
The screen shots below the input and output for running the Revolution Analytics multi-cpu random forests on AWS. The software is being run through a server version of RStudio. Two CPUs are included in the cluster for building trees. The first screen shot shows the code input for building a predictive model on the UC Irvine data set for red wine taste scores. The code is the figure shows how little change is required for running Revolution Analytics’ multi-core version versus using one of the random forest packages available through CRAN.
The next two screen shots show some of the familiar output that’s available. The first plot shows the oob-prediction error as a function of the number of trees in the ensemble. The second plot gives variable importance. Those familiar with the wine taste data set will recognize that alcohol is correctly identified as the most significant feature for predicting wine taste.
* Editor's Note: H20 from 0xdata contains a multi-core Random Forest implementation.