**Mood Stochastic**, and kindly contributed to R-bloggers)

[This article is adapted and expanded from presentations at Nvidia’s

*GTC* conference in March, as well as *UseR!* in July, 2014.]

The Random Forest algorithm is a well-known tool for data analysis. It

yields robust predictive models for vastly different data sets, serving as a

sort of Swiss Army Knife in the data scientist’s toolkit. Given the need to

accommodate ever-larger data sets, scalable implementations are essential.

There are many opportunities to accelerate the algorithm but, in order to

appreciate the tradeoffs in implementation, an in-depth presentation of the

algorithm is in order.

## The algorithm, in a nutshell

For expository purposes, we refer to the data as a (design) matrix with

predictor-specific observations organized as columns. The rows of the matrix,

then, correspond to observations on a given sample, across all observed

predictors. The *response*, that is, the dependent variable being

modelled, is a vector with as many components as there are samples.

## Decision trees

A training phase first creates a series of decision trees, typically

numbering in the hundreds. These are binary trees, each non-terminal node

specifying a *splitting predictor* as well as a predicate to be

evaluated on that predictor’s value, known as a *splitting criterion*.

The terminal nodes specify score values. A tree derives a score for a given

sample by walking from the root to a terminal according to the value of the

splitting decision at each non-terminal. That is, the branch direction at a

given node is determined by applying this predicate to the value of the

splitting predictor for the sample. For a given sample, the value predicted by

an individual tree is the score of the terminal reached. The value predicted by

the entire forest is either the mean tree score for that sample, when the

response is continuous, or the plurality of tree scores, when the response is

categorical.

Two types of splitting criterion are employed, depending on the data type of

the splitting predictor. If the predictor, say "x", is continuous, then the

predicate evaluates a partial order relation, such as "Is *x* less than

or equal to 7.3?". If the predictor, say "c", is categorical with, for example,

four categories, then the predicate evaluates a subset membership relation,

such as "Is *c* one of the values 1, 2 or 4?". In either case, the

splitting criteria are Boolean-valued and so determine the sense of each branch

taken as the tree is walked.

Each tree is constructed, or trained, from a separately-chosen subset of the

original samples. These subsets are determined by random sampling, either with

or without replacement, in a process known as *bagging*. Hence the

training process begins on separate, independently-drawn submatrices of the

original design matrix, each submatrix consisting of a different collection of

rows. Note that if sampling is performed with replacement, it is likely that

some rows will be selected more than once.

## Decision tree construction

Training works by subdividing a sample set into two distinct subsets,

recursively, until subdivision is no longer profitable. A root node is first

created. Corresponding to the root, but not itself a part of the decision tree,

is the collection of bagged samples from which training is initialized for that

tree. It is this collection which is recursively subdivided as the tree is

built. When a node is found to split – that is, when a subdivision is

profitable – two subnodes are created on the decision tree and the

corresponding splitting criterion and predictor are recorded on that node.

There are several reasons why it may not be profitable to split a node. The

simplest reason is that there may be no nontrivial subdivisions of the sample

set remaining. This may occur if the node corresponds to a single sample or

multiple copies of the same sample. Another reason stems from heuristics: the

user may have decided, a priori, that little useful predictive information is

to be gained by splitting nodes with fewer than some number of distinct

samples. Finally, it may be the case that the splitting algorithm itself could

not identify a split capable of maximizing the information gain. More on this

later.

What is the connection between node splitting and subdividing the samples?

Recall that the process of walking a completed decision tree involves not

merely asking a series of questions, but asking questions which are conditioned

on answers to previous questions. Intuitively, the process of training a

decision tree proceeds similarly. Not only is a criterion formulated according

to the data (i.e., samples) at hand, but a subsequent criterion is formulated

according to the data on which it is predicated. In other words, the samples

available to a true- or false-sense subnode are precisely those samples for

which the parent node’s splitting criterion is, respectively, true or false.

Not only does a tree node split then but, in a very real sense, so does the

collection of samples associated with it. Every branch taken in the decision

tree implicitly divides the data into two subsets – a "bipartition" – either

one conditioned on the sense of the branch taken.

How does a collection of samples lead to a splitting criterion? The idea is

to collect the values of the response for each sample and then try to find

bipartitions of this collection which maximize some measure of information

gradient, typically Gini gain. That is, the algorithm tries to find a way to

divide the samples into two distinct subsets over which the corresponding

response values are somehow maximally different. Not every bipartition is

evaluated, though. For one thing, the number of bipartitions grows

exponentially with the number of distinct samples. More importantly, though,

the algorithm is concerned only with bipartitions which are certain functions

of the underlying predictor values at these samples, that is, criteria based on

these values. For continuous predictors, the criteria are based on order

relations and for discrete predictors they are based on set membership, as

already described.

What role do the predictors play? At every node a random collection of

predictors is selected. Bipartitions of the response values are evaluated on

these predictors, over the set of samples corresponding to the node. The

predictor (if any) maximizing the positive gain over this collection becomes

the splitting predictor and the criterion at which the gain is maximal becomes

the splitting criterion. It is worth pointing out that the randomly-sampled

collection of predictors will vary at each node. That is, this collection is

not fixed, but rather is sampled independently at each node.

Given a collection of available samples and predictors for a node, though,

how is a splitting criterion actually determined? For a continuous predictor,

the bipartitions arise from considering the samples in the respective

*predictor* order and, at each position, dividing the ordered set into

left and right components. For a discrete (categorical) predictor, the

bipartitions consist exactly of all ways of dividing the available

*predictor* values into distinct left and right components. The

"winning" bipartition, then, indicates not only the splitting predictor and

criterion but the left and right inheritance of samples, as well. In fact, the

key challenge in accelerating the algorithm lies in economically maintaining

the respective predictor-based orderings as the sample set itself splits into

more, but smaller, subsets.

What happens to nodes that do not split? Besides being tagged as terminals

in the decision tree, these nodes record the scores for their associated sample

sets. Recall that score values for regression trees are means. The score for a

regression tree terminal, then, is the mean response value of all associated

samples. Similarly, the score for a classification tree terminal is the

plurality categorical response value among all associated samples, with ties

broken randomly.

## Prediction and validation

The resulting forest is validated using the unsampled (*out-of-bag*)

observations. The forest can also be saved and reapplied to separate data sets

for prediction. Both validation and prediction apply the tree-walking procedure

described above, but with different goals and over different data. Prediction

applies the forest to a set of observations, usually different from the

original, the goal being to predict an unknown response. Validation, on the

other hand, applies the forest to the original observations, with an additional

twist that each tree operates on a separate collection of samples. The goal of

validation is to see how well the forest "predicts" the already-known response

over the original data. The validation quality is often reported by a summary

statistic, such as an R-squared value or a confusion matrix.

**leave a comment**for the author, please follow the link and comment on their blog:

**Mood Stochastic**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...