Parallelizing Random Forests in R with BatchJobs and OpenLava
By: Gord Sissons and Feng Li
In his series of blogs about machine learning, Trevor Stephens focuses on a survival model from the Titanic disaster and provides a tutorial explaining how decision trees tend to over-fit models yielding anomalous predictions.
How do we build a better predictive model? The answer as Trevor observes, is to grow a whole forest of decision trees, let the models grow as deep as they will, and let these randomized models vote on the outcome. It turns out that a large collection of imperfect models leads to a more accurate result as imperfections in individual models tend to cancel each other. In predicting survivability of the Titanic disaster, Trevor shows how randomForest can be used in skilled hands to build a more accurate predictive model.
Getting lost in the woods
For many problems, learning algorithms like random forest can take a long time to execute. This is especially true when we run models repeatedly to find optimal approaches and parameters. If models take hours or days to execute, it is easy to get lost in the woods. Long-running models make it difficult and time consuming to find the optimal approach to solving a complex problem.
The answer is to parallelize execution. Thankfully packages like the R’s machine learning library (mlr) help us parallelize machine learning algorithms like random forests without a lot of difficulty. We can leverage approaches like multi-core parallelism using parallel (now part of base R), socket and MPI clustering using snow, and workload managers deployed on high-performance computing environments using BatchJobs.
Bernd Bischl is one of the key developers of mlr, BatchJobs and parallelMap. He provides an example of how mlr and parallel can be used with socket level parallelism to run a resampling algorithm to assess the performance of a machine learning algorithm. We want to show how we can scale calculations larger, and collapse run-times beyond what can be achieved using multi-core or socket parallelism using an open source workload manager. Adding a workload manager offers many benefits including performance, scalability, manageability and reliability. It allows multiple users to share even transient cloud-based clusters efficiently.
First – a more “grown up” dataset
Datasets like iris are often used for experimentation because they are small and easy to work with, but to show parallel acceleration we need a larger dataset that would yield longer run times. We chose the adult data set from the machine learning repository at the University of California at Irvine. Don’t get the wrong impression from the name – The adult data set is PG friendly, and appropriate for this fine blog and its classy readership. The adult dataset represents an extract from the 1994 US census database. With 32,563 rows it is hardly huge, but it is considerably larger than our titanic passenger list (891 rows) or the even smaller iris dataset. It includes a range of demographic information for a set of citizens including education, race, gender, marital status and can be used for a variety of purposes including building models to predict key measures like income, a statistic of interest not only to data scientists but to government entities and marketers as well.
Running the model without parallelization
First, we consider an example that runs a random forest machine learning computation with re-sampling. For our purposes we are less interested in the science. We are more interested in showing how we can parallelize and accelerate the computation (while making sure that our parallelized result gets to the same answer of course!)
If you don’t have the mlr and randomForest packages loaded, you will need to install them in your R environment:
You’ll also want to get the dataset itself. We installed the US census adult dataset in the directory ~/examples/data.
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The R script loads the dataset into a dataframe called adult, and performs the random forest calculation as shown below.
library("mlr") setwd("~/examples") lrn = makeLearner("classif.randomForest") adult = read.table("data/adult.test", sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week","country", "income"), fill=F,strip.white=T ) adult.task = makeClassifTask(data = adult, target = "education") rdesc = makeResampleDesc("CV", iters = 3) system.time(res
We create a learner object called lrn that will use the randomForest learner. The makeClassifTask is used to create a classification task using the dataframe containing the census data. Our target variable is arbitrarily set to education. We then create a re-sampling strategy. We use cross-validation as our re-sampling method (similar to the example referenced previously) and iterate three times. We then call re-sample to fit models passing the learner object, the task and the re-sampling strategy and measuring the elapsed time for execution.
Finally we run the model above.
The results are shown below as they appear in the R console in R Studio. Note that our three iterations took 353 seconds or almost six minutes to complete. Obviously more iterations or a larger dataset would increase the run-time of the model further.
[Resample] cross-validation iter: 1 [Resample] cross-validation iter: 2 [Resample] cross-validation iter: 3 [Resample] Result: mmce.test.mean=0.0157 user system elapsed 352.430 0.577 352.989 > res Resample Result Task: adult Learner: classif.randomForest mmce.aggr: 0.02 mmce.mean: 0.02 mmce.sd: 0.00 >
It’s pretty easy to see how a larger model, more iterations or a different choice of methods could result in unacceptably long run-times. We could use multi-core or socket level parallelism, but we’d ideally like to take advantage of as much computing resource as we can put our hands on at run-time on a distributed cluster.
Fortunately, parallelMap is now directly integrated into mlr, and this makes scaling to parallel back-ends seamless. Our choice of back-end is parameterized so we can write algorithms once and choose the parallel back-end depending on the resources we have available when we run the model. To illustrate this, we re-run the same model, but instead of running the model on a single node, we run it on a clustered environment running OpenLava, an open-source Platform LSF compatible workload manager now supported by BatchJobs.
The R script is modified to take advantage of the OpenLava clustered environment with the required changes highlighted in the code below. We include two additional libraries – parallelMap and BatchJobs. After setting our working directory, we setup the batch environment for the OpenLava job.
After this, parallelizing our job is literally as simple as calling parallelStartBatchJobs() before the rest of the code and parallelStop() at the end of the code fragment. The mlr library functions interact with parallelMap and BatchJobs to handle the rest automatically recognizing opportunities for parallelism and cleaving the execution into multiple batch jobs.
library("parallelMap") library("BatchJobs") library("mlr") setwd("~/examples") conf = BatchJobs:::getBatchJobsConf() conf$cluster.functions = makeClusterFunctionsOpenLava("../batch.tmpl") storagedir = getwd() parallelStartBatchJobs(storagedir = storagedir) lrn = makeLearner("classif.randomForest") adult = read.table("data/adult.test", sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week","country", "income"), fill=FALSE,strip.white=T ) adult.task = makeClassifTask(data = adult, target = "education") rdesc = makeResampleDesc("CV", iters = 3) system.time(res parallelStop()
Running the parallel version of this script we see the output below.
The output when we re-run this same script shown below: > source('~/examples/resampling-ml-with-parallelization.R') Loading required package: BBmisc Sourcing configuration file: '/home/gord/.BatchJobs.R' Starting parallelization in mode=BatchJobs-OpenLava. Storing objects in files for BatchJobs slave jobs: .mlr.slave.options Mapping in parallel: mode = BatchJobs; cpus = NA; elements = 3. SubmitJobs |+ | 0% (00:00:00)SubmitJobs |+ | 0% (00:00:00)SubmitJobs |+++++++++++ | 33% (00:00:00)SubmitJobs |++++++++++++++++++++++++++++ | 67% (00:00:00)SubmitJobs |++++++++++++++++++++++++++++++++++++++++++++ | 100% (00:00:00) Waiting |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:0] |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:3] |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:3] |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:3] |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:3] |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:3] |+ | 0% (00:00:00)Waiting [S:3 D:0 E:0 R:3] |+ | 0% (00:00:00)Waiting [S:1 D:2 E:0 R:1] |+++++++++++++++++++++++ | 67% (00:03:00)Waiting [S:0 D:3 E:0 R:0] |+++++++++++++++++++++++++++++++++++++ | 100% (00:00:00) [Resample] Result: mmce.test.mean=0.00654 user system elapsed 0.819 0.077 160.607 Stopped parallelization. All cleaned up. > res Resample Result Task: adult Learner: classif.randomForest mmce.aggr: 0.01 mmce.mean: 0.01 mmce.sd: 0.00 >
This is a simple example, but readers will see the value. While the example above involved three jobs, we could just as easily handle 30 or 300 jobs with no additional complexity on the part of the R programmer.
Despite the overhead of scheduling the simulation as three discrete batch jobs we managed to reduce the execution time from 353 seconds to just 161 seconds – over twice as fast. As we scale larger, we can expect the relative overhead of using the batch scheduler to diminish allowing us to approach more linear scaling as the number of jobs and cluster nodes increase. Stay tuned for more posts on this topic.
Also – notice the marked decrease in load on our system (shown as user and system time relative to elapsed time in the output above). Despite the elapsed time of the calculation the actual user and system CPU time on our R-studio host is minimal. This is because the OpenLava scheduler dispatched work to other hosts in the cluster leaving us free to do additional work in R Studio.
Behind the scenes, we can monitor execution on the cluster using the OpenLava bjobs command to list the status of jobs in the OpenLava workload manager. Documentation OpenLava is available here.
[[email protected] data]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 773 gord RUN normal master master *65f0-1 Apr 1 08:28 774 gord RUN normal master master *65f0-2 Apr 1 08:28 775 gord RUN normal master master *65f0-3 Apr 1 08:28
For readers that want to try this example on their own, you can download a zip file containing copies of the scripts and relevant configuration files for BatchJobs here.
We’ve skipped over the details of how to configure OpenLava, but you can download the free installation RPM along with installation and configuration directions from Teraproc or download the OpenLava sources and build it yourself from github. If you want to deploy an R cluster in the cloud for free on Amazon EC2, you can visit Teraproc.COM and try the R cluster-as-a-service offering for free.
For many machine learning algorithms, long run-times on limited resources can get in the way of productivity. Parallelizing execution to an open-source back-end brings several benefits:
- We can reduce model run-time
- We can run additional more sophisticated models
- We can improve reliability, by letting long running jobs run under the control of a workload manager
- We can avoid tying up our desktop allowing us to do useful work while models run
- We can share clustered on premises or in the cloud among multiple users or workloads to improve efficiency