Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Ilya Kipnis

In this post, I will demonstrate how to obtain, stitch together, and clean data for backtesting using futures data from Quandl. Quandl was previously introduced in the Revolutions Blog.  Functions I will be using can be found in my IK Trading package available on my github page.

With backtesting, it’s often times easy to get data for equities and ETFs. However, ETFs are fairly recent financial instruments, making it difficult to conduct long-running backtests (most of the ETFs in inception before 2003 are equity ETFs), and with equities, they are all correlated in some way, shape, or form to their respective index (S&P 500, Russell, etc.), and their correlations generally go to 1 right as you want to be diversified.

An excellent source of diversification is the futures markets, which contain contracts on instruments ranging as far and wide as metals, forex, energies, and more. Unfortunately, futures are not continuous in nature, and data for futures are harder to find.

Thanks to Quandl, however, there is some freely available futures data. The link can be found here

The way Quandl structures its futures is that it uses two separate time series: the first is the front month, which is the contract nearest expiry, and the second is the back month, which is the next contract. Quandl’s rolling algorithm can be found here.

In short, Quandl rolls in a very simple manner; however, it is also incorrect, for all practical purposes. The reason being is that no practical trader holds a contract to expiry. Instead, they roll said contracts sometime before the expiry of the front month, based on some metric.

This algorithm uses the open interest cross to roll from front to back month and then lags that by a day (since open interest is observed at the end of trading days), and then “rolls” back when the front month open interest overtakes back month open interest (in reality, this is the back month contract becoming the new front month contract). Furthermore, the algorithm does absolutely no adjusting to contract prices. That is, if the front month is more expensive than the back month, a long position would lose the roll premium and a short position would gain it. This is in order to prevent the introduction of a dominating trend bias. The reason that the open interest is chosen is displayed in the following graph:

This is the graph of the open interest of the front month of oil in 2000 (black time series), and the open interest of the back month contract in red. They cross under and over in repeatable fashion, making a good choice on when to roll the contract.
Let’s look at the code:

quandClean <- function(stemCode, start_date=NULL, verbose=FALSE, ...) {

The arguments to the function are a stem code, a start date, end date, and two print arguments (for debugging purposes). The stem code takes the form of CHRIS/<>_<>, such as “CHRIS/CME_CL” for oil.

Require(Quandl)
if(is.null(start_date)) {start_date <- Sys.Date()-365*1000}
if(is.null(end_date)) {end_date <- Sys.Date()+365*1000}
frontCode <- paste0(stemCode, 1)
backCode <- paste0(stemCode, 2)
front <- Quandl(frontCode, type="xts", start_date=start_date, end_date=end_date, ...)
interestColname <- colnames(front)[grep(pattern="Interest", colnames(front))]
front <- front[,c("Open","High","Low","Settle","Volume",interestColname)]
colnames(front) <- c("O","H","L","C","V","OI")
back <- Quandl(backCode, type="xts", start_date=start_date, end_date=end_date, ...)
back <- back[,c("Open","High","Low","Settle","Volume",interestColname)]
colnames(back) <- c("BO","BH","BL","BS","BV","BI") #B for Back

#combine front and back for comparison
both <- cbind(front,back)

This code simply fetches both futures contracts from Quandl and combines them into one xts. Although Quandl takes a type argument, I have programmed this function specifically for xts types of objects, since I will use xts-dependent functionality later.
Let's move along.

#impute NAs in open interest with -1
both$BI[is.na(both$BI)] <- -1
both$OI[is.na(both$OI)] <- -1
both$lagBI <- lag(both$BI)
both$lagOI <- lag(both$OI)

#impute bad back month open-interest prints —
#if it is truly a low quantity, it won't make a
#difference in the computation.

both$OI[both$OI==-1] <- both$lagOI[both$OI==-1]
both$BI[both$BI==-1] <- both$lagBI[both$BI==-1]

This is the first instance of countermeasures in the function taken to counteract messy data. This imputes any open interest NAs with the value -1, and then imputing the first NA after a non NA day with the previous day's open interest. Usually, days on which open interest is not available are days after which the contract is lightly traded, so the values that will be imputed in cases during which the contract was not traded will be negligible. However, imputing an NA value with a zero during the midst of heavy trading has the potential to display the wrong contract as the one with the higher open interest.

both$OIdiff <- both$OI - both$BI both$tracker <- NA

#the formal open interest cross from front to back
both$tracker[both$OIdiff < 0] <- -1
both$tracker <- lag(both$tracker)

#since we have to observe OI cross, we roll next day
#any time we're not on the back contract, we're on the front contract
both$tracker[both$OIdiff > 0] <- 1
both$tracker <- na.locf(both$tracker)

This code sets up the system for keeping track of which contract is in use. When the difference in open interest crosses under zero, that's the formal open interest cross, and we roll a day later. On the other hand, when the open interest difference crosses back over zero, that isn't a cross. That is the back month contract becoming the front month contract. For instance, assume that you rolled to the June contract in the third week of May. Quandl would display the June contract as the back contract in May, but come June, that June contract is now the front contract instead. So therefore, there is no lag on the computation in the second instance.

frontRelevant <- both[both$tracker==1, c(1:6)] backRelevant <- both[both$tracker==-1, c(7:12)]
colnames(frontRelevant) <- colnames(backRelevant) <- c("Open","High","Low","Close","Volume","OpenInterest")
relevant <- rbind(frontRelevant, backRelevant)
relevant[relevant==0] <- NA

# remove any incomplete days, print a message saying
# how many removed days
# print them if desired
instrument <- gsub("CHRIS/", "", stemCode)
relevant$Open[is.na(relevant$Open)] <- relevant$Close[(which(is.na(relevant$Open))-1)]
NAs <- which(is.na(relevant$Open) | is.na(relevant$High) | is.na(relevant$Low) | is.na(relevant$Close))
if(verbose) {
if(verbose) { message(paste(instrument, “had”, length(NAs), “incomplete days removed from data.”)) }

print(relevant[NAs,])
}
if(length(NAs) > 0) {
relevant <- relevant[-NAs,]
}

Using the previous tracker variable, the code is then able to compile the relevant data for the futures contract. That is, front contract when the front contract is more heavily traded, and vice versa.

This code uses xts-dependent functionality with the rbind call. In this instance, there are two separate streams: the front month stream, and the back month stream. Through the use of xts functionality, it's possible to merge the two streams indexed by time.

Next, the code imputes all NA open values with the close (settle) from the previous trading day. In the case that opens are the only missing field, I opted for this over removing the observation entirely. Next, any observation with a missing open, high, low, or close value gets removed. This is simply my personal preference, rather than attempting to take some form of liberty with imputing data to the highs, lows, and closes based on the previous day, or some other pattern thereof.

If verbose is enabled, the function will print the actual data removed.

ATR <- ATR(HLC=HLC(relevant))

#Technically somewhat cheating, but could be stated in terms of
#lag 2, 1,and 0.

#A spike is defined as a data point on Close that's more than
#5 ATRs away from both
the preceding and following day.
spikes <- which(abs((relevant$Close-lag(relevant$Close))/ATR$atr) > 5 & abs((relevant$Close-lag(relevant$Close, -1))/ATR$atr) > 5)
if(verbose) {
message(paste(instrument, “had”, length(spikes),”spike days removed from data.”))
print(relevant[spikes,])
}

if(length(spikes) > 0){
relevant <- relevant[-spikes,]
}
out <- relevant
return(out)
}

Finally, some countermeasures against spiky types of data. I define a spike as a price move in the closing price which is 5 ATRs (in this case, n=14) away in either direction from both the previous and next day. Spikes are removed. After this, the code is complete.

To put this into perspective visually, here is a plot of the 30-day Federal Funds rate (CHRIS/CME_FF), from 2008, demonstrating all the improvements my process makes to Quandl’s raw data in comparison to the front month continuous (current) contract.

The raw, front-month data is displayed in black (the long lines are missing data from quandl, displayed as zeroes, but modified in scale for the sake of the plot). The results of the algorithm are presented in blue.

At the very beginning, it’s apparent that the more intelligent rolling algorithm adapts to what would be the new contract prices sooner. Secondly, all of those long bars on which Quandl had missing data have been removed so as not to interfere with calculations. Lastly, at the very end, that downward “spike” in prices has also been dealt with, making for what appears to be a significantly more correct pricing series.

To summarize, here's what the code does:

2) Keeps track of the proper contract at all time periods
3) Imputes or removes bad data, bad data being defined as incomplete observations or spikes in the data.

The result is an xts object practically identical to one downloaded with more common to find data, such as equities or ETFs, which allows for a greater array of diversification in terms of the instruments on which to backtest trading strategies, such as with the quantstrat package.

The results of such backtests can be found on my blog, and my two R packages (this functionality will be available in my IKTrading package) can be found on my Github page.