Making accurate predictions using the vast amount of data produced by the stock markets and the economy itself is difficult. In this post we will examine the performance of five different machine learning models and predict the future ten-year returns for the S&P 500 using state of the art libraries such as caret, xgboostExplainer and patchwork. We will use data from Shiller, Goyal and BLS. The training data is between the years 1948 and 1991, and the test data set is from 1991 and only until 2009, because the target variable is lagged by ten years.
Different investing strategies tend to work at different times, and you should expect the accuracy of the model you are using to move in cycles; sometimes the connection with returns is very strong, and sometimes very weak. Value investing strategies are a great example of a strategy that has not really worked for the past twelve years (source, pdf). Spurious correlations are another cause of trouble, since for example two stocks might move in tandem by just random chance. This highlights the need for some manual feature selection of intuitive features.
We will use eight different predictors; P/E, P/D, P/B, the CAPE ratio, total return CAPE, inflation, unemployment rate and the 10-year US government bond rate. All five of the valuation measures are calculated for the entire S&P 500. Let’s start by inspecting the correlation clusters of the different predictors and the future ten-year return (without dividends), which is used as the target.
The different valuation measures are strongly correlated with each other as expected. All expect P/B have a very strong negative correlation with the future 10-year returns. CAPE and total return CAPE, which is a new measure that considers also reinvested dividends, are very strongly correlated with each other. Total return CAPE is also slightly less correlated with the future ten-year return than the normal CAPE.
The machine learning models
First, we will create a naïve model which predicts the future return to be same as the average return in the training set. After training the five models we will also make one ensemble model of them to see if it can reach a higher accuracy as any of the five models, which is usually the case.
The models we are going to use are quite different from each other. The glmnet model is just like the linear model, except it shrinks the coefficients according to a penalty to avoid overfitting. It therefore has a very low flexibility and also performs automated feature selection (except if the alpha hyperparameter is exactly zero as in ridge regression). K-nearest-neighbors makes its predictions by comparing the observation to similar observations. MARSon the other handtakes into account nonlinearities in the data, and also considers interactions between the features. XGBoostis a tree model, which also takes into account both nonlinearities and interactions. It however improves each tree by building it based on the residuals of the previous tree (boosting), which may lead to better accuracies. Both MARS and SVM (support vector machines) are really flexible and therefore may overfit quite easily, especially if the data size is small enough. The XGBoost model is also quite flexible but does not overfit easily since it performs regularization and pruning.
Finally, we have the ensemble model which simply gives the mean of the predictions of all the models. Ensemble models are a quite popular strategy in machine learning competitions to reach accuracies beyond the accuracy of any single model.
The models will be built using the caret wrapper, and the optimal hyperparameters are chosen using time slicing, which is a cross validation technique that is suitable for time series. We will use five timeslices to capture as many periods while having enough observations in each of them. We will do the cross validation on training data consists of 70 percent of the data, while keeping the remaining 30 percent as a test set. The results are shown below:
Click to enlarge images
The predictions are less accurate after the red line, which separates the training and test sets. The model has not seen the data on the right side of the line, so its accuracy can be thought as a proxy for how well the model would perform in the future.
We will examine the model accuracies on the test set by using two measures; mean absolute error (MAE) and R-squared (R²). The results are shown in the table below:
The two most flexible models, MARS and SVM, behave wildly on the test set and show signs of overfitting. Both of them have mean absolute errors that are about twice as high when compared to the naïve model. Even though MARS has a high R-squared, the mean absolute error is high. This is why you cannot trust R-squared alone. Glmnet has quite plausible predictions until the year 2009, most likely because of the rapid growth of the P/E ratio. K-nearest-neighbors has not reacted to the data too much but still achieves a quite low MAE. Out of the single models, the XGBoost has performed the best. The ensemble model however has performed slightly better as measured by the MAE. It also seems to be the most stable model, which is expected since it combines the predictions of the other models.
Let’s then look at the feature importances. They are calculated in different ways for the different model types but should still be somewhat comparable. The plotting is done using the library patchwork, which allows plotting to be done by just adding the plots together using a plus sign.
Upon closer inspection of the feature importances, we see that the MARS model uses just the CAPE ratio as a feature, while rest of the models use the features more evenly. Most of the models perform some sort of feature selection, which can also be seen from the plot.
Lastly, we will predict the next ten years in the stock market and compare the predictions of the different models. We will also look closer at the best performing single model, XGBoost, by inspecting the composition of the prediction. The current values of the features are mostly obtained from the sources listed in the first chapter, but also from Trading Economics and multpl.
10-year CAGR prediction
The MARS model is the most pessimistic, with a return prediction that is quite strongly negative. The model should however not be trusted too much since it uses only one variable and does not behave well on the test data. The XGBoost model is surprisingly optimistic, with a prediction of almost nine percent per year. The prediction of the ensemble model is quite low but would be three percentage points higher without the MARS model.
Let’s then look at the XGBoost model more closely by using the xgboostExplainer library. The resulting plot is a waterfall chart which shows the composition of a single prediction, in this case the predicted CAGR (plus one) for the next ten years. The high CAPE ratio reduces the predicted CAGR by seven percentage points, but the P/B ratio increases it by six percentage points. This is because the model contains interactions between the CAPE and P/B ratios. The effect of the interest rate level is just a bit positive at two percentage points, but the currently high P/E ratio reduces it back to the same level. The rest of the features have a very small effect on the prediction.
The benefit of predicting the returns of a single stock market is mostly limited to the fact that you can adjust your expectations for the future. However, predicting the returns of multiple stock markets and investing in the ones with the highest return predictions is most likely a very profitable strategy. Klement (2012) has shown that the CAPE ratio alone does a quite good job at predicting the returns of different stock markets. Adding more variables that are sensible to the model is likely to make the model more stable and perhaps better at predicting the outcome.
Be sure to follow me on Twitter for updates about new blog posts like this!
The R code used in the analysis can be found here.
To leave a comment for the author, please follow the link and comment on their blog: Data based investing.