**Freakonometrics » R-english**, and kindly contributed to R-bloggers)

Yesterday, I did upload a post where I tried to show that “standard” regression models where not performing bad. At least if you include splines (multivariate splines) to take into accound joint effects, and nonlinearities. So far, I do not discuss the possible high number of features (but with boostrap procedures, it is possible to assess something related to variable importance, that people from machine learning like).

But my post was not complete: I was simply plotting the prediction obtained by some model. And it “looked like” the regression was nice, but so were the random forrest, the -nearest neighbour and boosting algorithm. What if we compare those models on new data?

Here is the code to create all the models (I did include another one, some kind of benchmark, where no covariates are included), based on 1,000 simulated values

> n <- 1000 > set.seed(1) > rtf <- function(a1, a2) { sin(a1+a2)/(a1+a2) } > df <- data.frame(x1=(runif(n, min=1, max=6)), + x2=(runif(n, min=1, max=6))) > df$m <- rtf(df$x1, df$x2) > df$y <- df$m+rnorm(n,sd=.1) > model_cste <- lm(y~1,data=df) > p_cste <- function(x1,x2) predict(model_cste,newdata=data.frame(x1=x1,x2=x2)) > model_lm <- lm(y~x1+x2,data=df) > p_lm <- function(x1,x2) predict(model_lm,newdata=data.frame(x1=x1,x2=x2)) > library(mgcv) > model_bs <- gam(y~s(x1,x2),data=df) > p_bs <- function(x1,x2) predict(model_bs,newdata=data.frame(x1=x1,x2=x2)) > library(rpart) > model_cart <- rpart(y~x1+x2,data=df,method="anova") > p_cart <- function(x1,x2) predict(model_cart,newdata=data.frame(x1=x1,x2=x2),type="vector") > library(randomForest) > model_rf <- randomForest(y~x1+x2,data=df) > p_rf <- function(x1,x2) as.numeric(predict(model_rf,newdata= + data.frame(x1=x1,x2=x2),type="response")) > k <- 10 > p_knn <- function(x1,x2){ + d <- (df$x1-x1)^2+(df$x2-x2)^2 + return(mean(df$y[which(rank(d)<=k)])) + } > library(dismo) > model_gbm <- gbm.step(data=df, gbm.x = 1:2, gbm.y = 4, + family = "gaussian", tree.complexity = 5, + learning.rate = 0.01, bag.fraction = 0.5) GBM STEP - version 2.9 Performing cross-validation optimisation of a boosted regression tree model for y and using a family of gaussian Using 1000 observations and 2 predictors creating 10 initial models of 50 trees folds are unstratified total mean deviance = 0.0242 tolerance is fixed at 0 ntrees resid. dev. 50 0.0195 now adding trees... 100 0.017 150 0.0154 200 0.0145 250 0.0139

(etc)

1650 0.0123 fitting final gbm model with a fixed number of 1150 trees for y mean total deviance = 0.024 mean residual deviance = 0.009 estimated cv deviance = 0.012 ; se = 0.001 training data correlation = 0.804 cv correlation = 0.705 ; se = 0.013 elapsed time - 0.11 minutes > p_boost <- function(x1,x2) predict(model_gbm,newdata=data.frame(x1=x1,x2=x2),n.trees=1200)

To test those models on new data (that is the goal of predictive model actually, being able to build up a generalized model, that performs well on new data), generate another sample

> n <- 500 > df_new <- data.frame(x1=(runif(n, min=1, max=6)), x2=(runif(n, min=1, max=6))) > df_new$m <- rtf(df_new$x1, df_new$x2) > df_new$y <- df_new$m+rnorm(n,sd=.1)

And then compare the observed values with the predicted ones. For instance on a graph

> output_model <- function(p=Vectorize(p_knn)){ + plot(df_new$y,p(df_new$x1,df_new$x2),ylim=c(-.45,.45),xlim=c(-.45,.45),xlab="Observed",ylab="Predicted") + abline(a=0,b=1,lty=2,col="grey") + }

For the linear model, we get

> output_model(Vectorize(p_lm))

For the k-nearest neighbour, we get

> output_model(Vectorize(p_knn))

With our boosted model, we get

> output_model(Vectorize(p_boost))

And finally, with our bivariate splines, we get

> output_model(Vectorize(p_bs))

It is also possible to consider some distance, e.g. the standard distance,

> sum_error_2 <- function(name_model){ + sum( (df_new$y - Vectorize(get(name_model))(df_new$x1,df_new$x2))^2 ) + }

Here, we enter the name of the prediction function (not the R object, we’ll see soon why) as the parameter of our function. In order to have valid conclusion, why not geneate hundreds of new samples, and compure the distance on the error.

> L2 <- NULL > for(s in 1:100){ + n <- 500 + df_new <- data.frame(x1=(runif(n, min=1, max=6)), x2=(runif(n, min=1, max=6))) + df_new$m <- rtf(df_new$x1, df_new$x2) + df_new$y <- df_new$m+rnorm(n,sd=.1) + list_models <- c("p_cste","p_lm","p_bs","p_cart","p_rf","p_knn","p_boost") + L2_error <- sapply(list_models,sum_error_2) + L2 <- rbind(L2,L2_error) + }

To compare our predictors, use

> colnames(L2) <- substr(colnames(L2),3,nchar(colnames(L2))) > boxplot(L2)

Our linear regression model is not performing well (**lm**). Clearly. But we’ve seen that already yesterday. And our bivariate spline model is (**bs**). Actually, it is even performing better that all other models considered here (**rf**, **knn** and even **boost**).

boxplot(L2,ylim=c(4.5,6.2))

There were a lot of discussion following my previous post, on commentaries, as well as on Twitter. What I’ve seen is that there should be some kind of trade-off: interpretability (econometric models) against precision (machine learning). It is clearly not that simple. A simple regression model with splines can perform better than any machine learning algorithm, from what we’ve seen here.

**leave a comment**for the author, please follow the link and comment on their blog:

**Freakonometrics » R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...