A First Look at rxDForest()

January 30, 2014

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph RIckert

Last July, I blogged about rxDTree() the RevoScaleR function for building classification and regression trees on very large data sets. As I explaned then, this function is an implementation of the algorithm introduced by Ben-Haim and Yom-Tov in their 2010 paper that builds trees on histograms of data and not on the raw data itself. This algorithm is designed for parallel and distributed computing. Consequently, rxDTree() provides the best performance when it is running on a cluster: either an Microsoft HPC cluster or a Linux LSF cluster.

rxDForest() (new with Revolution R Enterprise 7.0) uses rxDTree() to take the next logical step and implement a random forest type algorithm for building both classification and regression forests. Each tree of the ensemble constructed by rxDForest() is built with a bootstrap sample that uses about 2/3 of the original data. The data not used in builting a particular tree is used to make predictions with that tree. Each point of the original data set is fed through all of the trees that were built without it. The decision forest prediction for that data point is the statistical mode of the individual tree predictions. (For classification problems the prediction is a majority vote, for regression problems the prediction is the mean of the predictions.) 

Only a couple of parameters need to be set to fit a decision forest. nTree specifies the number of trees to grow and  mTry spedifies the number of variables to sample as split candidates at each tree node. Of course, many more parameters can be set to control the algorithm, including the parameters that control the underlying rxDTree() algorithm.

The following is a small example of the rxDForest() fucntion using the mortgage default dataset that can be downloaded from Revolution Analytic's website. Here are the first three lines of data.

  creditScore houseAge yearsEmploy ccDebt year default
1    615        10        5          2818 2000    0
2    780        34        5          3575 2000    0 
3    735        12        1          3184 2000    0 

The idea is to see if the variables creditScore, houseAge etc. are useful in predicting a default. The RevoScaleR R code in the file Download RxDForest  reads in the mortgage data, splits the data into a training file and a test file, uses rxDTree() to build a single tree (just to see what one looks like for this file) and plots the tree. Then rxDForest() is run against the training file to to build an ensemble model and this model run against the test file to make predictions. Finally, the code plots the ROC curve for the decision forest ensemble model.

Here is what the first few nodes of the tree looks like. (The full tree is printed at the bottom of the code in the file above.)

rxDTree(formula = form1, data = "mdTrain", maxDepth = 5)
File: C:\Users\Joe.Rickert\Documents\Revolution\RevoScaleR\mdTrain.xdf
Number of valid observations: 8000290
Number of missing observations: 0

Tree representation:
n= 8000290

node), split, n, deviance, yval
* denotes terminal node

1) root 8000290 39472.30000 4.958445e-03
2) ccDebt< 9085.5 7840182 21402.25000 2.737309e-03
4) ccDebt< 7844 7384170 8809.46500 1.194447e-03 

He is a plot of the right part of the tree drawn with RevoScaleR's creatTreeView() function that enables plot() to put the graph in your browser.




And, finally, here is the ROC curve for the decision Forest model. (The text output describing the model is also in the file containing the code.)


I plan to try rxDForest() out on a cluster with a bigger data set. When I do, I will let you know. 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)