Version 0-1.8 of the Rborist implementation of the Random Forest (TM) algorithm is now available from CRAN. Although most changes involve refactoring to accommodate future updates, there are several bug fixes and enhancements worth mentioning.
New option maxLeaf allows a limit to be set on the number of terminal nodes (i.e., leaves) in each trained tree. In order to not to introduce behavior dependent upon the training algorithm, leaves are pruned from fully-trained trees.
A few users had reported premature termination of their R sessions. All reproducible instances were found to originate from uncaught exceptions thrown from glue code. These are now caught.
An error in splitting sparse representations resulted in egregious behavior when employing autocompression. This error has been repaired.
There has been a redoubling of effort to ensure regular data-access patterns, particularly within nested loops. While decision-tree training is at heart an exercise in random memory access, there are tangible benefits to keeping irregular accesses to a minimum. In particular, we continue to see improvements to both the Arborist’s performance and its scalability as this focus is maintained.
On the now well-known flight-delay data set, for example, Rborist achieves parity with Xgboost, the acknowledged leader in performance, as the size of the training set increases. Experiments were performed on an 8-core, single-node I7 having 32GB of DDR3 RAM. Two sets of timings were recorded for Rborist, one employing factors to represent categorical predictors, the other employing a one-hot encoding. Xgboost, which employs a one-hot encoding, was run with the default fast (“approximate”) execution mode. Comparative timings are given in the table below, with the AUC value of prediction in parentheses:
Finally, we note the passing of an unreported bug. As Zhao et al., note in their 2017, JSS paper, the initial release of Rborist, 0-1.0, “fails even on the small example of ParallelForest”, a census of household incomes. Although identified in 2015, the failure was never reported and we remain unclear as to its provenance. Lamentable as this state of affairs is, suffice it to say that version 0-1.8, training the example as a two-category classification model, completes in approximately 25 seconds and infers the test outcomes with an AUC value of 0.9.
Thanks go out to a number of people who have suggested new features, isolated bugs and corrected documentation. These include Ryan Ballantine, Brent Crossman, Chris Kennedy, Mark Juarez and Pantelis Hadjipantelis.