Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

GTC conference in March, as well as UseR! in July, 2014.]

The Random Forest algorithm is a well-known tool for data analysis. It
yields robust predictive models for vastly different data sets, serving as a
sort of Swiss Army Knife in the data scientist’s toolkit. Given the need to
accommodate ever-larger data sets, scalable implementations are essential.
There are many opportunities to accelerate the algorithm but, in order to
appreciate the tradeoffs in implementation, an in-depth presentation of the
algorithm is in order.

## The algorithm, in a nutshell

For expository purposes, we refer to the data as a (design) matrix with
predictor-specific observations organized as columns. The rows of the matrix,
then, correspond to observations on a given sample, across all observed
predictors. The response, that is, the dependent variable being
modelled, is a vector with as many components as there are samples.

## Decision trees

A training phase first creates a series of decision trees, typically
numbering in the hundreds. These are binary trees, each non-terminal node
specifying a splitting predictor as well as a predicate to be
evaluated on that predictor’s value, known as a splitting criterion.
The terminal nodes specify score values. A tree derives a score for a given
sample by walking from the root to a terminal according to the value of the
splitting decision at each non-terminal. That is, the branch direction at a
given node is determined by applying this predicate to the value of the
splitting predictor for the sample. For a given sample, the value predicted by
an individual tree is the score of the terminal reached. The value predicted by
the entire forest is either the mean tree score for that sample, when the
response is continuous, or the plurality of tree scores, when the response is
categorical.

Two types of splitting criterion are employed, depending on the data type of
the splitting predictor. If the predictor, say "x", is continuous, then the
predicate evaluates a partial order relation, such as "Is x less than
or equal to 7.3?". If the predictor, say "c", is categorical with, for example,
four categories, then the predicate evaluates a subset membership relation,
such as "Is c one of the values 1, 2 or 4?". In either case, the
splitting criteria are Boolean-valued and so determine the sense of each branch
taken as the tree is walked.

Each tree is constructed, or trained, from a separately-chosen subset of the
original samples. These subsets are determined by random sampling, either with
or without replacement, in a process known as bagging. Hence the
training process begins on separate, independently-drawn submatrices of the
original design matrix, each submatrix consisting of a different collection of
rows. Note that if sampling is performed with replacement, it is likely that
some rows will be selected more than once.

## Decision tree construction

Training works by subdividing a sample set into two distinct subsets,
recursively, until subdivision is no longer profitable. A root node is first
created. Corresponding to the root, but not itself a part of the decision tree,
is the collection of bagged samples from which training is initialized for that
tree. It is this collection which is recursively subdivided as the tree is
built. When a node is found to split – that is, when a subdivision is
profitable – two subnodes are created on the decision tree and the
corresponding splitting criterion and predictor are recorded on that node.

There are several reasons why it may not be profitable to split a node. The
simplest reason is that there may be no nontrivial subdivisions of the sample
set remaining. This may occur if the node corresponds to a single sample or
multiple copies of the same sample. Another reason stems from heuristics: the
user may have decided, a priori, that little useful predictive information is
to be gained by splitting nodes with fewer than some number of distinct
samples. Finally, it may be the case that the splitting algorithm itself could
not identify a split capable of maximizing the information gain. More on this
later.

What is the connection between node splitting and subdividing the samples?
Recall that the process of walking a completed decision tree involves not
merely asking a series of questions, but asking questions which are conditioned
on answers to previous questions. Intuitively, the process of training a
decision tree proceeds similarly. Not only is a criterion formulated according
to the data (i.e., samples) at hand, but a subsequent criterion is formulated
according to the data on which it is predicated. In other words, the samples
available to a true- or false-sense subnode are precisely those samples for
which the parent node’s splitting criterion is, respectively, true or false.
Not only does a tree node split then but, in a very real sense, so does the
collection of samples associated with it. Every branch taken in the decision
tree implicitly divides the data into two subsets – a "bipartition" – either
one conditioned on the sense of the branch taken.

How does a collection of samples lead to a splitting criterion? The idea is
to collect the values of the response for each sample and then try to find
bipartitions of this collection which maximize some measure of information
gradient, typically Gini gain. That is, the algorithm tries to find a way to
divide the samples into two distinct subsets over which the corresponding
response values are somehow maximally different. Not every bipartition is
evaluated, though. For one thing, the number of bipartitions grows
exponentially with the number of distinct samples. More importantly, though,
the algorithm is concerned only with bipartitions which are certain functions
of the underlying predictor values at these samples, that is, criteria based on
these values. For continuous predictors, the criteria are based on order
relations and for discrete predictors they are based on set membership, as

What role do the predictors play? At every node a random collection of
predictors is selected. Bipartitions of the response values are evaluated on
these predictors, over the set of samples corresponding to the node. The
predictor (if any) maximizing the positive gain over this collection becomes
the splitting predictor and the criterion at which the gain is maximal becomes
the splitting criterion. It is worth pointing out that the randomly-sampled
collection of predictors will vary at each node. That is, this collection is
not fixed, but rather is sampled independently at each node.

Given a collection of available samples and predictors for a node, though,
how is a splitting criterion actually determined? For a continuous predictor,
the bipartitions arise from considering the samples in the respective
predictor order and, at each position, dividing the ordered set into
left and right components. For a discrete (categorical) predictor, the
bipartitions consist exactly of all ways of dividing the available
predictor values into distinct left and right components. The
"winning" bipartition, then, indicates not only the splitting predictor and
criterion but the left and right inheritance of samples, as well. In fact, the
key challenge in accelerating the algorithm lies in economically maintaining
the respective predictor-based orderings as the sample set itself splits into
more, but smaller, subsets.

What happens to nodes that do not split? Besides being tagged as terminals
in the decision tree, these nodes record the scores for their associated sample
sets. Recall that score values for regression trees are means. The score for a
regression tree terminal, then, is the mean response value of all associated
samples. Similarly, the score for a classification tree terminal is the
plurality categorical response value among all associated samples, with ties
broken randomly.

## Prediction and validation

The resulting forest is validated using the unsampled (out-of-bag)
observations. The forest can also be saved and reapplied to separate data sets
for prediction. Both validation and prediction apply the tree-walking procedure
described above, but with different goals and over different data. Prediction
applies the forest to a set of observations, usually different from the
original, the goal being to predict an unknown response. Validation, on the
other hand, applies the forest to the original observations, with an additional
twist that each tree operates on a separate collection of samples. The goal of
validation is to see how well the forest "predicts" the already-known response
over the original data. The validation quality is often reported by a summary
statistic, such as an R-squared value or a confusion matrix.