by Joseph Rickert
The rxDTree() function included in the RevoScaleR package distributed with Revolution R Enterprise is an an example of a new class of algorithms that are being developed to deal with very large data sets. Although the particulars differ, what these algorithms have in common is the use of approximations, methods of summarizing or compressing data and built-in parallelism. I think that it is really interesting to see something as basic to modern statistics as a tree algorithm rejuvenated this way.
In the nearly thirty years since Breiman et al. introduced classification and regression trees they have become part of the foundation for modern nonparametric statistics, machine learning and data mining. The basic implementation of these algorithms in R’s rpart() function (recursive partitioning and regression trees) and elsewhere have proved to be adequate for many large scale, industrial strength data analysis problems. Nevertheless, today’s very large data sets (“Big Data”) present significant challenges for decision trees. In part this is due to the need to sort all the numerical attributes used in a model in order to determine the split points. One approach to dealing with the issue is to avoid sorting the raw data altogether by working with an approximation of the data.
In a 2010 paper, Ben-Haim and Yom-Tov introduce a novel algorithm along these lines by using histograms to build trees. This algorithm, explicitly designed for parallel computing, takes the approach of implementing horizontal parallelism: each processing node sees all of the variables for a subset (chunk) of the data. These “compute” nodes build histograms of the data and the master node integrates the histograms and builds the tree. The details of the algorithm, its behavior and performance characteristics are described in a second, longer paper by the same authors.
One potential downside of the approach is that since the algorithm only examines a limited number of split points (the boundaries of the histogram bins), for a given data set, it may produce a tree that is different from what rpart() would build. In practice though, this is not as bad as it sounds. Increasing the number of bins improves the accuracy of the algorithm. Moreover, Ben-Haim and Yom-Tov provide both an analytical argument and empirical results that show the error rate of trees built with their algorithm approaches the error rate of the standard tree algorithm.
rxDTree() is an implementation of the Ben-Haim and Yom-Tov algorithm designed for working with very large data sets in a distributed computing environment. Most of the parameters controlling the behavior of rxDTree() are similar to those of rpart(). However, rxDTree() provides an additional parameter: maxNumBins specifies the maximum number of bins to use in building histograms and hence, controls the accuracy of the algorithm. For small data sets where you can test it out, specifying a large number of bins will enable rxDTree() to produce exactly the same results as rpart().
Because of the computational overhead involved with the histogram building mechanisms of rxDTree() you might expect it to be rather slow with small data. However, we have found that rxDTree performs well with respect to rpart() even for relatively small data sets. The following script gives some idea of the performance that can be expected from running on a reasonably complex data set. (All 59 explanatory variables are numeric.). The script reads in the segmentationData set from the caret package, replicates the data to produce a file containing 2,021,019 rows, specifies a model and then runs it using both rpart() and rxDTree().
########## BENCHMARKING rxDTree ON A CLUSTER ############# # # This script was created to show some simple benchmarks for the RevoScaleR # rxDTree function for building classification and regression trees on large data sets. # The benchmarks were run on a 5 node HPC cluster comprised of Intel 16 GB of RAM per node) # The script does the following: # 1. Fetch the 2,019 row by 61 columns segmentationData set from the caret package # 2. Set up a compute context to run the code on a Microsoft HPC Cluster # 3. Replicate the SegmentationData to create a file with 2,021,019 rows # 4. Set up the formula and other parameters for the model # 5. Run the rxDTree to build a classification model #------------------------------------------------------------------------------------------------- # Get SegmentationData from caret package library(caret) data(segmentationData) #dim:  2019 61 rxOptions(reportProgress = 0) # Set up comput Contect for HPC Cluster grxTestComputeContext <- RxHpcServer( dataPath = c("C:/data"), headNode="cluster-head2", #headNode="192.168.10.117", revoPath="C:\\Revolution\\R-Enterprise-Node-7.0\\R-2.15.3\\bin\\x64\\", shareDir=paste("AllShare\\", Sys.getenv("USERNAME"), sep=""), nodes = "COMPUTE11,COMPUTE12,COMPUTE13,COMPUTE10", consoleOutput=TRUE, autoCleanup=FALSE) rxSetComputeContext(grxTestComputeContext) # Replicate the data to build a larger file times <- 1000 # how many time to replicate the data set segmentationDataBig <- do.call(rbind, sapply(rep("segmentationData", times), as.name)) #elapsed time: 136.652 rxDataStep(inData = segmentationDataBig, outFile = "segmentationDataBig", overwrite = TRUE) # Build the formula and set parameters allvars <- names(segmentationData) xvars <- allvars[-c(1, 2, 3)] (form <- as.formula(paste("Class", "~", paste(xvars, collapse = "+")))) # cp <- 0.01 # Set the complecity parameter for the tree xval <- 0 # Don't do any cross validation maxdepth <- 5 # Set the maximum tree depth ##--------------------------------------------------------------------- # Run the model with rxDtree on the big xdf file system.time( dtree.model.xdf <- rxDTree(form, data = "segmentationDataBig", maxDepth = maxdepth, cp = cp, xVal = xval, blocksPerRead = 250) )["elapsed"] ##elapsed ## 50.52
On an Intel Quad core, 64-bit, system Q9300 with 2.5GHz processors and 8 GB of RAM the elapsed time for rpart()to build the tree was 312.27 seconds, while rxDTree() which took advantage of the parallelism possible with four cores ran in 71.39 seconds. The real payoff, however, comes from running rxDTree() on a cluster. The same model run on 5 node, Microsoft HPC assembled from similar commodity hardware (4 cores per node, each running at 3.20 GHz with16GB of RAM) took 50.52 seconds to build a tree from data stored in a binary .xdf file. We expect that elapsed time will scale linearly with the number of rows.
One last observation about the script is the use of the functions RxHpcServer() and rxSetComputeContext(). Together these provide a concrete example of how the distributed computing infrastructure of RevoScaleR preserves R's ease of use and working efficiency in a "Big Data" environment. The first function defines the "compute context" to be a Microsoft HPC cluster. The second function tells parallel external memory algorithms (PEMAs), like rxDTree, to use this cluster to execute. So, by merely changing one line of code PEMAs can be tested on a PC and then run on a cluster or other production environment. For more details on rxDTree have a look at Big Data Decision Trees with R.