Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos and Yuri Fonseca

We are happy to introduce our new machine learning method called Boosting Smooth Trees (BooST) (full article here). This model was a joint work with professors Marcelo Medeiros and Álvaro Veiga. The BooST uses a different type of regression tree that allows us to estimate the derivatives of very general nonlinear models. In other words, the model is differentiable and it has an analytical solution. The consequence is that now we can estimate partial effects of a characteristic on the response variable, which provide us much more interpretation than traditional importance measures.

The idea behind the BooST is to replace traditional Classification and Regression Trees (CART), which are not differentiable, by Smooth logistic trees. We show that with this adaptation the BooST is a consistent estimator of the model’s derivatives under some assumptions.

## Example

The example below shows that the BooST is very good to recover the derivatives of nonlinear functions. The data was generated from the following model: $\displaystyle y_i = \cos( \pi [x_{i,1}+x_{i,2}])+ \varepsilon_i$

where $x_{i,1} \sim N(0,1)$, $x_{i,2}$ is a Bernoulli with $p = 0.5$ and $\varepsilon_i \sim N(0,1)$. Note that this is not an easy problem. The function that generates the data is not monotonic, very smooth and nonlinear. The Figure below shows how the BooST estimates the model and its derivatives with respect to $x_{i,1}$. We simulated 1000 data points and forced it to have an R2 of 0.5 (half of the variation in the data is noise). ## Real data example

To give a more practical example, suppose we want to estimate the effects of price changes on sales without any prior knowledge of the demand function. It is natural to think that these effects will depend on the current price level and possible some characteristics of the product. The BooST will estimate the partial effects of price on sales conditional to the current (or any) price and any other controls we choose. We will write a post specifically on this example in the future.

Consider yet an other example where we want to estimate how the price of a house changes as we move on the latitude or longitude. This is precisely what we did in the figure below using data from house sales in Melbourne (Dataset scrapped from the web by Tony Pino). The figure shows that if we are south of the CBD and move north the prices increase a lot. However, if we keep moving north prices start to decrease as we move away from the CBD. ## Other considerations

The BooST is also very good for forecasting. We will come back to this topic with examples on future posts. Just keep in mind that if the number of characteristics is to big it may improve forecasting accuracy but you will probably loose some interpretation of the partial effects not because of the model itself but because of some theoretical problems you may have.

For now we have an implementation of the BooST in R, which is fully documented and straightforward to use. It can be downloaded and installed from:

library(devtools)
install_github("gabrielrvsc/BooST")


Note that this implementation is not very fast and it is not adequate for very big problems with to many variables. We already have a faster Julia implementation that  will be published soon. The even faster C++ version will take some time but it is also coming. In the next posts we will explore some empirical applications with codes to replicate our examples.

## References

Fonseca, Y.; Medeiros, M.; Vasconcelos; G.; Veiga, A. “BooST: Boosting Smooth Trees for Partial Effect Estimation in Nonlinear Regressions” arXiv preprint available at https://arxiv.org/abs/1808.03698 (2018).