Using CART for Stock Market Forecasting

February 28, 2014

(This article was first published on The R Trader » R, and kindly contributed to R-bloggers)

There is an enormous body of literature both academic and empirical about market forecasting. Most of the time it mixes two market features: Magnitude and Direction. In this article I want to focus on identifying the market direction only. The goal I set myself, is to identify market conditions when the odds are significantly biased toward an up or a down market. This post gives an example of how CART (Classification And Regression Trees) can be used in this context. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

1 – What is CART and why using it?

From, CART are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. There are many different implementations but they are all sharing a general characteristic and that’s what I’m interested in. From Wikipedia, “Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split”.

CART methodology exhibits some characteristics that are very well suited for market analysis:

  • Non parametric: CART can handle any type of statistical distributions
  • Non linear: CART can handle a large spectrum of dependency between variables (e.g., not limited to linear relationships)
  • Robust to outliers

There are various R packages dealing with Recursive Partitioning, I use here rpart for trees estimation and rpart.plot for trees drawing.

2 – Data & Experiment Design

Daily OHLC prices for most liquid ETFs from January 2000 to December 2013 extracted from Google finance. The in sample period goes from January 2000 to December 2010;  the rest of the dataset is the out of sample period. Before running any type of analysis the dataset has to be prepared for the task.

The target variable is the ETF weekly forward return defined as a two states of the world  outcome (UP or DOWN). If weekly forward return > 0 then the market in the UP state, DOWN state otherwise

The explanatory variables are a set of technical indicators derived from the initial daily OHLC dataset. Each indicator represents a well-documented market behavior.  In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome.

  • Volatility (VAR1): High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA).  If raw ATR > MA then VAR1 = 1, else VAR1 = -1.
  • Short term momentum (VAR2): The equity market exhibits short term momentum behavior  captured here by a 5 days simple moving averages (SMA). If  Price > SMA  then VAR2 = 1 else VAR2 = -1
  • Long term momentum (VAR3): The equity market exhibits long term momentum behavior  captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3  = -1
  • Short term reversal (VAR4): This is captured by the CRTDR which stands for Close Relative To Daily Range and calculated as following:  CRTDR = {Close - Low }/ {High - Low}. If CRTDR > 0.5, then VAR4 = 1 else VAR4 = -1
  • Autocorrelation regime (VAR5):  The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days  > 0 then VAR5 = 1 else VAR5 = -1

I put below a tree example with some explanations


In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0)  and  VAR4 >= 0 (CRTDR >= 0).  The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf).

There are many ways to use the above approach, I chose to estimate and combine all possible trees. From the in sample data, I collect all leaves from all possible trees and I gather them into a matrix. This is the “rules matrix”  giving the probability of next week beeing UP or DOWN.

3 – Results

I apply the rules in the above matrix to the out of sample data  (Jan 2011 – Dec 2013) and I compare the results to the real outcome. The problem with this approach is that a single point (week) can fall into several rules and even belong to UP and DOWN rules simultaneously. Therefore I apply a voting scheme. For a given week I sum up all the rules that apply to that week giving a +1 for an UP rule and -1 for a DOWN rule. If the sum is greater than 0 the week is classified as UP, if the sum is negative it’s a DOWN week and if the sum is equal to 0 there will be no position taken that week (return = 0)

The above methodology is applied to a set of very liquid ETFs. I plot below the out of sample equity curves along with the buy and hold strategy over the same period.


4 – Conclusion

Initial results seem encouraging even if the quality of the outcome varies greatly by instrument. However there is a huge room for improvement. I put below some directions for further analysis

  • Path optimality: The algorithm used here for defining the trees is optimal at each split but it doesn’t guarantee the optimality of the path. Adding a metric to measure the optimality of the path would certainly improve the above results.
  • Other variables: I chose the explanatory variables solely based on experience. It’s very likely that this choice is neither good nor optimal.
  • Backtest methodology: I used a simple In and Out of sample methodology. In a more formal backtest I would rather use a rolling or expanding window of in and out sample sub-periods (e.g., walk forward analysis)

As usual, any comments welcome


To leave a comment for the author, please follow the link and comment on their blog: The R Trader » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)