The case for data snooping

October 25, 2013
By

(This article was first published on Shifting sands, and kindly contributed to R-bloggers)

When we are backtesting automated trading systems, accidental data snooping or look forward errors are an easy mistake to make. The nature of the error in this context is making our predictions using the data we are trying to predict. Typically, it comes from a mistake with our calculations of time offsets somewhere.

However, it can be a useful tool. If we give our system perfect forward knowledge:

1) We establish an upper bound for performance.
2) We can get a quick read if something is worth pursuing further, and
3) It can help highlight other coding errors.

The first two are pretty closely related. If our wonderful model is built using the values it is trying to predict, and still performs no better than random guessing, it’s probably not worth the effort trying to salvage it.

The flip side is when it performs well, that will be as good as it will ever get.

There are two main ways it can help identifying errors. Firstly, if our subsequent testing on non-snooped data provides comparable performance, we probably have another look ahead bug lurking somewhere.


Secondly, things like having amazing accuracy yet still performing poorly is another sign of a bug lurking somewhere.

Example

I wanted to compare SVM models when trained with actual prices vs a series of log returns, using the rolling model code I put up earlier. As a baseline, I also added in a 200 day simple moving average model.

(S) Indicates snooped data

A few things strike me about this.

For the SMA system, peeking ahead by a day only provides a small increase in accuracy. Given the longer-term nature of the 200 day SMA this is probably to be expected.

For the SVM trained systems, the results are somewhat contradictory.

For the look forward models, training on price data had much lower accuracy than the log returns, and the log return model performed much better. Note that both could have achieved 100% accuracy by predicting its first column of training data.

However, when not snooping, the models trained on closing prices did much better than those trained on returns. I’m not 100% sure there isn’t still some bug lurking somewhere, but hey if the code was off it would’ve shown up in the forward tested results no?

Feel free to take a look and have a play around with the code, which is up here.

To leave a comment for the author, please follow the link and comment on his blog: Shifting sands.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.