# Using CART for Stock Market Forecasting

**The R Trader » R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There is an enormous body of literature both academic and empirical about market forecasting. Most of the time it mixes two market features: Magnitude and Direction. In this article I want to focus on identifying the market direction only. The goal I set myself, is to identify market conditions when the odds are significantly biased toward an up or a down market. This post gives an example of how CART (Classification And Regression Trees)** **can be used in this context. Before I proceed the usual reminder: What I present in this post is just a toy example and not an invitation to invest. It’s not a finished strategy either but a research idea that needs to be further researched, developed and tailored to individual needs.

**1 – What is CART and why using it?**

From statistics.com,** **CART are a set of techniques for classification and prediction. The technique is aimed at producing rules that predict the value of an outcome (target) variable from known values of predictor (explanatory) variables. There are many different implementations but they are all sharing a general characteristic and that’s what I’m interested in. From Wikipedia, “Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring “best”. These generally measure the homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split”.

CART methodology exhibits some characteristics that are very well suited for market analysis:* *

: CART can handle any type of statistical distributions*Non parametric*: CART can handle a large spectrum of dependency between variables (e.g., not limited to linear relationships)*Non linear**Robust to outliers*

There are various R packages dealing with Recursive Partitioning, I use here **rpart** for trees estimation and **rpart.plot** for trees drawing.

**2 – Data & Experiment Design**

Daily OHLC prices for most liquid ETFs from January 2000 to December 2013 extracted from Google finance. The in sample period goes from January 2000 to December 2010; the rest of the dataset is the out of sample period. Before running any type of analysis the dataset has to be prepared for the task.

The target variable is the ETF weekly forward return defined as a two states of the world outcome (UP or DOWN). If weekly forward return > 0 then the market in the UP state, DOWN state otherwise

The explanatory variables are a set of technical indicators derived from the initial daily OHLC dataset. Each indicator represents a well-documented market behavior. In order to reduce the noise in the data and to try to identify robust relationships, each independent variable is considered to have a binary outcome.

: High volatility is usually associated with a down market and low volatility with an up market. Volatility is defined as the 20 days raw ATR (Average True Range) spread to its moving average (MA). If raw ATR > MA then VAR1 = 1, else VAR1 = -1.*Volatility (VAR1)*: The equity market exhibits short term momentum behavior captured here by a 5 days simple moving averages (SMA). If Price > SMA then VAR2 = 1 else VAR2 = -1*Short term momentum***(VAR2)**: The equity market exhibits long term momentum behavior captured here by a 50 days simple moving averages (LMA). If Price > LMA then VAR3 = 1 else VAR3 = -1*Long term momentum***(VAR3)***Short term reversal***(VAR4)**: The equity market tends to go through periods of negative and positive autocorrelation regimes. If returns autocorrelation over the last 5 days > 0 then VAR5 = 1 else VAR5 = -1*Autocorrelation regime (VAR5)*

I put below a tree example with some explanations

In the tree above, the path to reach node #4 is: VAR3 >=0 (Long Term Momentum >= 0) and VAR4 >= 0 (CRTDR >= 0). The red rectangle indicates this is a DOWN leaf (e.g., terminal node) with a probability of 58% (1 – 0.42). In market terms this means that if Long Term Momentum is Up and CRTDR is > 0.5 then the probability of a positive return next week is 42% based on the in sample sample data. 18% indicates the proportion of the data set that falls into that terminal node (e.g., leaf).

There are many ways to use the above approach, I chose to estimate and combine all possible trees. From the in sample data, I collect all leaves from all possible trees and I gather them into a matrix. This is the “rules matrix” giving the probability of next week beeing UP or DOWN.

**3 – Results**

I apply the rules in the above matrix to the out of sample data (Jan 2011 – Dec 2013) and I compare the results to the real outcome. The problem with this approach is that a single point (week) can fall into several rules and even belong to UP and DOWN rules simultaneously. Therefore I apply a **voting scheme**. For a given week I sum up all the rules that apply to that week giving a +1 for an UP rule and -1 for a DOWN rule. If the sum is greater than 0 the week is classified as UP, if the sum is negative it’s a DOWN week and if the sum is equal to 0 there will be no position taken that week (return = 0)

The above methodology is applied to a set of very liquid ETFs. I plot below the out of sample equity curves along with the buy and hold strategy over the same period.

**4 – Conclusion**

Initial results seem encouraging even if the quality of the outcome varies greatly by instrument. However there is a huge room for improvement. I put below some directions for further analysis

**Path optimality**: The algorithm used here for defining the trees is optimal at each split but it doesn’t guarantee the optimality of the path. Adding a metric to measure the optimality of the path would certainly improve the above results.**Other variables**: I chose the explanatory variables solely based on experience. It’s very likely that this choice is neither good nor optimal.**Backtest methodology**: I used a simple In and Out of sample methodology. In a more formal backtest I would rather use a rolling or expanding window of in and out sample sub-periods (e.g., walk forward analysis)

As usual, any comments welcome

**leave a comment**for the author, please follow the link and comment on their blog:

**The R Trader » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.