Rborist 0.1-6 now on CRAN

May 14, 2017
By

(This article was first published on Mood Stochastic, and kindly contributed to R-bloggers)

The latest release of the Rborist package, which provides an accelerated
implementation of the Random Forest (TM) algorithm, is available from CRAN.
Version 0.1-6 offers several notable improvements:

Sparse matrix representation

Sparse numeric dcgMatrix matrix objects are now accepted as input,
provided an intra-column encoding is employed. This representation is
particularly useful, for example, in the case of one-hot encodings.

Additionally, Rborist now autocompresses training data on a
per-predictor basis, compactly representing runs of arbitrary value. This
space-saving feature is most useful when training iteratively, using the
preFormat feature.

Pruned representation

A new option thinLeaves allows trained forests to be recorded in a
slender format, economizing on storage.

Vignette

A vignette has been provided to guide users through Rborist’s various
capabilities. It is hoped that this will invite more users to try the package
and make it easier to use.

Improved scalability

Particular attention has been paid to limiting data movement and exploiting
data locality. This has paid dividends in the ability of the implementation to
scale across larger data sets.

The graph below illustrates recent progress by comparing execution times of
Rborist with Xgboost on a flight-delay data
set. Xgboost is considered to be among the fastest open-source
packages implementing decision-tree methods. The flight-delay data, and
execution scripts, are hosted on Szilard Pafka’s benchm-ml
project on Github . One script was modified to extend the sample limit from 10 million
to 12.5 million rows, approximately the maximum available from the data.
Timings were performed on a two-socket Xeon server:

Flight.jpg

Of particular interest is the inflection point apparent near one million
rows. This is likely due to crossing a level of the memory hierarchy. That is,
more and more data must be accessed from outside the L1 cache. Although
Xgboost remains faster throughout this regime,
Rborist appears better able to handle the transition, and the
two are nearly even at 12.5 million rows, Additional testing will be needed to
learn how far these scaling trends extend.

Thanks go out to Chris Kennedy, Christopher Brown, Carlos Ortega and Tal
Galili, whose comments and contributions helped make this a successful
release.

To leave a comment for the author, please follow the link and comment on their blog: Mood Stochastic.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)