**Data based investing**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### The machine learning models

*glmnet*model is just like the linear model, except it shrinks the coefficients according to a penalty to avoid overfitting. It therefore has a very low flexibility and also performs automated feature selection (except if the

*alpha*hyperparameter is exactly zero as in ridge regression).

*K-nearest-neighbors*makes its predictions by comparing the observation to similar observations.

*MARS*on the other hand

*takes into account nonlinearities in the data, and also considers interactions between the features.*

*XGBoost*is a tree model, which also takes into account both nonlinearities and interactions. It however improves each tree by building it based on the residuals of the previous tree (boosting), which may lead to better accuracies. Both MARS and

*SVM*(support vector machines) are really flexible and therefore may overfit quite easily, especially if the data size is small enough. The XGBoost model is also quite flexible but does not overfit easily since it performs regularization and pruning.

*caret*wrapper, and the optimal hyperparameters are chosen using time slicing, which is a cross validation technique that is suitable for time series. We will use five timeslices to capture as many periods while having enough observations in each of them. We will do the cross validation on training data consists of 70 percent of the data, while keeping the remaining 30 percent as a test set. The results are shown below:

### Results

**Click to enlarge images**

We will examine the model accuracies on the test set by using two measures; mean absolute error (MAE) and R-squared (R²). The results are shown in the table below:

Model | MAE | R² |
---|---|---|

Naive model | 5,16 % | – |

Ensemble | 2,15 % | 48,2 % |

GLMNET | 3,00 % | 29,7 % |

KNN | 3,37 % | 10,6 % |

MARS | 10,70 % | 90,2 % |

SVM | 10,80 % | 13,1 % |

XGBoost | 2,17 % | 60,1 % |

*MARS*and

*SVM*, behave wildly on the test set and show signs of overfitting. Both of them have mean absolute errors that are about twice as high when compared to the naïve model. Even though

*MARS*has a high R-squared, the mean absolute error is high. This is why you cannot trust R-squared alone.

*Glmnet*has quite plausible predictions until the year 2009, most likely because of the rapid growth of the P/E ratio.

*K-nearest-neighbors*has not reacted to the data too much but still achieves a quite low MAE. Out of the single models, the

*XGBoost*has performed the best. The ensemble model however has performed slightly better as measured by the MAE. It also seems to be the most stable model, which is expected since it combines the predictions of the other models.

*MARS*model uses just the CAPE ratio as a feature, while rest of the models use the features more evenly. Most of the models perform some sort of feature selection, which can also be seen from the plot.

### Future predictions

*XGBoost*, by inspecting the composition of the prediction. The current values of the features are mostly obtained from the sources listed in the first chapter, but also from Trading Economics and multpl.

Model | 10-year CAGR prediction |
---|---|

Ensemble | 2,20% |

GLMNET | 1,47 % |

KNN | 4,04% |

MARS | -9,85% |

SVM | 6,46% |

XGBoost | 8,86% |

The

*MARS*model is the most pessimistic, with a return prediction that is quite strongly negative. The model should however not be trusted too much since it uses only one variable and does not behave well on the test data. The

*XGBoost*model is surprisingly optimistic, with a prediction of almost nine percent per year. The prediction of the ensemble model is quite low but would be three percentage points higher without the

*MARS*model.

Let’s then look at the *XGBoost* model more closely by using the xgboostExplainer library. The resulting plot is a waterfall chart which shows the composition of a single prediction, in this case the predicted CAGR (plus one) for the next ten years. The high CAPE ratio reduces the predicted CAGR by seven percentage points, but the P/B ratio increases it by six percentage points. This is because the model contains interactions between the CAPE and P/B ratios. The effect of the interest rate level is just a bit positive at two percentage points, but the currently high P/E ratio reduces it back to the same level. The rest of the features have a very small effect on the prediction.

Be sure to follow me on Twitter for updates about new blog posts like this!

**leave a comment**for the author, please follow the link and comment on their blog:

**Data based investing**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.