Version 0-1.8 of the Rborist implementation of the Random
Forest (TM) algorithm is now available from CRAN. Although most changes involve
refactoring to accommodate future updates, there are several bug fixes and
enhancements worth mentioning.
New option maxLeaf allows a limit to be set on the number of
terminal nodes (i.e., leaves) in each trained tree. In order to not to
introduce behavior dependent upon the training algorithm, leaves are pruned
from fully-trained trees.
A few users had reported premature termination of their R sessions. All
reproducible instances were found to originate from uncaught exceptions thrown
from glue code. These are now caught.
An error in splitting sparse representations resulted in egregious behavior
when employing autocompression. This error has been repaired.
There has been a redoubling of effort to ensure regular data-access
patterns, particularly within nested loops. While decision-tree training is at
heart an exercise in random memory access, there are tangible benefits to
keeping irregular accesses to a minimum. In particular, we continue to see
improvements to both the Arborist’s performance and its scalability as this
focus is maintained.
On the now well-known flight-delay data set, for example,
Rborist achieves parity with Xgboost, the
acknowledged leader in performance, as the size of the training set increases.
Experiments were performed on an 8-core, single-node I7 having 32GB of DDR3
RAM. Two sets of timings were recorded for Rborist, one
employing factors to represent categorical predictors, the other employing a
one-hot encoding. Xgboost, which employs a one-hot encoding, was run with the
default fast (“approximate”) execution mode. Comparative timings are given in
the table below, with the AUC value of prediction in parentheses:
Finally, we note the passing of an unreported bug. As Zhao et al., note in
their 2017, JSS paper, the initial release of Rborist, 0-1.0,
“fails even on the small example of ParallelForest”, a census of household
incomes. Although identified in 2015, the failure was never reported and we
remain unclear as to its provenance. Lamentable as this state of affairs is,
suffice it to say that version 0-1.8, training the example as a two-category
classification model, completes in approximately 25 seconds and infers the test
outcomes with an AUC value of 0.9.
Thanks go out to a number of people who have suggested new features,
isolated bugs and corrected documentation. These include Ryan Ballantine, Brent
Crossman, Chris Kennedy, Mark Juarez and Pantelis Hadjipantelis.