By Gabriel Vasconcelos
In this post I am going to discuss some features of Regression Trees an Random Forests. Regression Trees are know to be very unstable, in other words, a small change in your data may drastically change your model. The Random Forest uses this instability as an advantage through bagging (you can see details about bagging here) resulting on a very stable model.
The first question is how a Regression Tree works. Suppose, fore example, that we have the number of points scored by a set of basketball players and we want to relate it to the player’s weight an height. The Regression Tree will simply split the height-weight space and assign a number of points to each partition. The figure below shows two different representations for a small tree. In the left we have the tree itself and in the right how the space is partitioned (the blue line shows the first partition and the red lines the following partitions). The numbers in the end of the tree (and in the partitions) represent the value of the response variable. Therefore, if a basketball player is higher than 1.85 meters and weights more than 100kg it is expected to score 27 points (I invented this data =] ).
You might be asking how I chose the partitions. In general, in each node the partition is chosen through a simple optimization problem to find the best pair variable-observation based on how much the new partition reduces the model error.
What I want to illustrate here is how unstable a Regression Tree can be. The package
tree has some examples that I will follow here with some small modifications. The example uses computer CPUs data and the objective is to build a model for the CPU performance based on some characteristics. The data has 209 CPU observations that will be used to estimate two Regression Trees. Each tree will be estimate from a random re-sample with replacement. Since the data comes from the same place, it would be desirable to have similar results on both models.
library(ggplot2) library(reshape2) library(tree) library(gridExtra) data(cpus, package = "MASS") # = Load Data # = First Tree set.seed(1) # = Seed for Replication tree1 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209, 209, replace = TRUE), ]) plot(tree1); text(tree1)
# = Second Tree set.seed(10) tree2 = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[sample(1:209,209, replace=TRUE), ]) plot(tree2); text(tree2)
As you can see, the two trees are different from the start. We can use some figures to verify. First let us calculate the predictions of each model in the real data (not the re-sample). The first figure is a scatterplot of both predictions and the second figure shows their boxplots. Although the scatterplot shows some relation between the two predictions, it is far from good.
# = Calculate predictions pred = data.frame(p1 = predict(tree1, cpus) ,p2 = predict(tree2, cpus)) # = Plots g1 = ggplot(data = pred) + geom_point(aes(p1, p2)) g2 = ggplot(data = melt(pred)) + geom_boxplot(aes(variable, value)) grid.arrange(g1, g2, ncol = 2)
As mentioned before, the Random Forest solves the instability problem using bagging. We simply estimate the desired Regression Tree on many bootstrap samples (re-sample the data many times with replacement and re-estimate the model) and make the final prediction as the average of the predictions across the trees. There is one small (but important) detail to add. The Random Forest adds a new source of instability to the individual trees. Every time we calculate a new optimal variable-observation point to split the tree, we do not use all variables. Instead, we randomly select 2/3 of the variables. This will make the individual trees even more unstable, but as I mentioned here, bagging benefits from instability.
The question now is: how much improvement do we get from the Random Forest. The following example is a good illustration. I broke the CPUs data into a training sample (first 150 observations) and a test sample (remaining observations) and estimated a Regression Tree and a Random Forest. The performance is compared using the mean squared error.
library(randomForest) # = Regression Tree tree_fs = tree(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus[1:150, ]) # = Random Forest set.seed(1) # = Seed for replication rf = randomForest(log(perf) ~ syct + mmin + mmax + cach + chmin + chmax, data=cpus[1:150, ], nodesize = 10, importance = TRUE) # = Calculate MSE mse_tree = mean((predict(tree_fs, cpus[-c(1:150), ]) - log(cpus$perf)[-c(1:150)])^2) mse_rf = mean((predict(rf, cpus[-c(1:150), ]) - log(cpus$perf[-c(1:150)]))^2) c(rf = mse_rf, tree = mse_tree) ## rf tree ## 0.2884766 0.5660053
As you can see, the Regression Tree has an error twice as big as the Random Forest. The only problem is that by using a combination of trees any kind of interpretation becomes really hard. Fortunately, there are importance measures that allow us to at least know which variables are more relevant in the Random Forest. In our case, both importance measures pointed to the cache size as the most important variable.
importance(rf) ## %IncMSE IncNodePurity ## syct 22.60512 22.373601 ## mmin 19.46153 21.965340 ## mmax 24.84038 27.239772 ## cach 27.92483 33.536185 ## chmin 13.77196 13.352793 ## chmax 17.61297 8.379306
Finally, we can see how the model error decreases as we increase the number of trees in the Random Forest with the following code:
If you liked this post, you can find more details on Regression Trees and Random forest in the book Elements of Statistical learning, which can be downloaded direct from the authors page here.