[This article is adapted and expanded from presentations at Nvidia’s GTC conference in March, as well as UseR! in July, 2014.]
The Random Forest algorithm is a well-known tool for data analysis. It
yields robust predictive models for vastly different data sets, serving as a
sort of Swiss Army Knife in the data scientist’s toolkit. Given the need to
accommodate ever-larger data sets, scalable implementations are essential.
There are many opportunities to accelerate the algorithm but, in order to
appreciate the tradeoffs in implementation, an in-depth presentation of the
algorithm is in order.
The algorithm, in a nutshell
For expository purposes, we refer to the data as a (design) matrix with predictor-specific observations organized as columns. The rows of the matrix, then, correspond to observations on a given sample, across all observed predictors. The response, that is, the dependent variable being modelled, is a vector with as many components as there are samples.
A training phase first creates a series of decision trees, typically numbering in the hundreds. These are binary trees, each non-terminal node specifying a splitting predictor as well as a predicate to be evaluated on that predictor’s value, known as a splitting criterion. The terminal nodes specify score values. A tree derives a score for a given sample by walking from the root to a terminal according to the value of the splitting decision at each non-terminal. That is, the branch direction at a given node is determined by applying this predicate to the value of the splitting predictor for the sample. For a given sample, the value predicted by an individual tree is the score of the terminal reached. The value predicted by the entire forest is either the mean tree score for that sample, when the response is continuous, or the plurality of tree scores, when the response is categorical.
Two types of splitting criterion are employed, depending on the data type of the splitting predictor. If the predictor, say “x”, is continuous, then the predicate evaluates a partial order relation, such as “Is x less than or equal to 7.3?”. If the predictor, say “c”, is categorical with, for example, four categories, then the predicate evaluates a subset membership relation, such as “Is c one of the values 1, 2 or 4?”. In either case, the splitting criteria are Boolean-valued and so determine the sense of each branch taken as the tree is walked.
Each tree is constructed, or trained, from a separately-chosen subset of the original samples. These subsets are determined by random sampling, either with or without replacement, in a process known as bagging. Hence the training process begins on separate, independently-drawn submatrices of the original design matrix, each submatrix consisting of a different collection of rows. Note that if sampling is performed with replacement, it is likely that some rows will be selected more than once.
Decision tree construction
Training works by subdividing a sample set into two distinct subsets, recursively, until subdivision is no longer profitable. A root node is first created. Corresponding to the root, but not itself a part of the decision tree, is the collection of bagged samples from which training is initialized for that tree. It is this collection which is recursively subdivided as the tree is built. When a node is found to split – that is, when a subdivision is profitable – two subnodes are created on the decision tree and the corresponding splitting criterion and predictor are recorded on that node.
There are several reasons why it may not be profitable to split a node. The simplest reason is that there may be no nontrivial subdivisions of the sample set remaining. This may occur if the node corresponds to a single sample or multiple copies of the same sample. Another reason stems from heuristics: the user may have decided, a priori, that little useful predictive information is to be gained by splitting nodes with fewer than some number of distinct samples. Finally, it may be the case that the splitting algorithm itself could not identify a split capable of maximizing the information gain. More on this later.
What is the connection between node splitting and subdividing the samples? Recall that the process of walking a completed decision tree involves not merely asking a series of questions, but asking questions which are conditioned on answers to previous questions. Intuitively, the process of training a decision tree proceeds similarly. Not only is a criterion formulated according to the data (i.e., samples) at hand, but a subsequent criterion is formulated according to the data on which it is predicated. In other words, the samples available to a true- or false-sense subnode are precisely those samples for which the parent node’s splitting criterion is, respectively, true or false. Not only does a tree node split then but, in a very real sense, so does the collection of samples associated with it. Every branch taken in the decision tree implicitly divides the data into two subsets – a “bipartition” – either one conditioned on the sense of the branch taken.
How does a collection of samples lead to a splitting criterion? The idea is to collect the values of the response for each sample and then try to find bipartitions of this collection which maximize some measure of information gradient, typically Gini gain. That is, the algorithm tries to find a way to divide the samples into two distinct subsets over which the corresponding response values are somehow maximally different. Not every bipartition is evaluated, though. For one thing, the number of bipartitions grows exponentially with the number of distinct samples. More importantly, though, the algorithm is concerned only with bipartitions which are certain functions of the underlying predictor values at these samples, that is, criteria based on these values. For continuous predictors, the criteria are based on order relations and for discrete predictors they are based on set membership, as already described.
What role do the predictors play? At every node a random collection of predictors is selected. Bipartitions of the response values are evaluated on these predictors, over the set of samples corresponding to the node. The predictor (if any) maximizing the positive gain over this collection becomes the splitting predictor and the criterion at which the gain is maximal becomes the splitting criterion. It is worth pointing out that the randomly-sampled collection of predictors will vary at each node. That is, this collection is not fixed, but rather is sampled independently at each node.
Given a collection of available samples and predictors for a node, though, how is a splitting criterion actually determined? For a continuous predictor, the bipartitions arise from considering the samples in the respective predictor order and, at each position, dividing the ordered set into left and right components. For a discrete (categorical) predictor, the bipartitions consist exactly of all ways of dividing the available predictor values into distinct left and right components. The “winning” bipartition, then, indicates not only the splitting predictor and criterion but the left and right inheritance of samples, as well. In fact, the key challenge in accelerating the algorithm lies in economically maintaining the respective predictor-based orderings as the sample set itself splits into more, but smaller, subsets.
What happens to nodes that do not split? Besides being tagged as terminals in the decision tree, these nodes record the scores for their associated sample sets. Recall that score values for regression trees are means. The score for a regression tree terminal, then, is the mean response value of all associated samples. Similarly, the score for a classification tree terminal is the plurality categorical response value among all associated samples, with ties broken randomly.
Prediction and validation
The resulting forest is validated using the unsampled (out-of-bag) observations. The forest can also be saved and reapplied to separate data sets for prediction. Both validation and prediction apply the tree-walking procedure described above, but with different goals and over different data. Prediction applies the forest to a set of observations, usually different from the original, the goal being to predict an unknown response. Validation, on the other hand, applies the forest to the original observations, with an additional twist that each tree operates on a separate collection of samples. The goal of validation is to see how well the forest “predicts” the already-known response over the original data. The validation quality is often reported by a summary statistic, such as an R-squared value or a confusion matrix.