Futures price prediction using the order book data

[This article was first published on Commodity Stat Arb, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It has been a couple of months since my last post; busy with lots of projects.

I had some fun playing around with data from Interactive Brokers API.  It turns out that it is relatively easily to get hold of the raw market data relating to both trades and order book changes for CME/NYMEX commodity futures.  For the purposes of this analysis I focused on the order book itself and what if anything one can imply forwards in time based on it.

All of the analysis was done in R.  I have been experimenting with R for a year or so and have found it to have many advantages for statistical analyses.  I put all the code at the bottom of the posting for those of you interested in doing something similar.

The raw data needs some manipulation to make it useful.  Each record is a single “order book event” as I think of it.  It could be a new order at a new price or a change to the number of buys or sells in the existing order book.  Only the first 10 prices are supplied.  As a result there are over 500,000 records for a single day’s 7 hours of trading during the pit hours (when most of the trade volume occurs).

For example, reconstructing a time series of the order book itself we see that as of
2012-02-17 13:52:55.302678
 time the order book consisted of these bids:
Price
2.689
2.690
2.691
2.692
2.694
2.695
2.685
2.686
2.687
2.688
quantity
18
37
44
34
10 
1
16 
6
16

And these offers:
Price
2.701
2.702
2.703
2.704
2.705
2.696
2.697
2.698
2.699
2.700
quantity
168 
11 
65  
10  
79 
21 
11 
12

So a rudimentary analysis of the data might focus on the imbalance between the sum of buys and the sum of sells.  In the above example there are a total of 190 bids showing and 391 offers showing.  Surely a difference of this magnitude would be meaningful?

For this sort of time series analysis we turn to the Vector Autoregression [VAR] models (as described in previous postings).  We have two time series, each potentially dependent on itself and the other.

The first step is to discretize the data into 1 second interval data since the raw data is event driven and not summarized.  The coding steps for this are outlined in R below.  Also we focus initially on the price implied by the average of the best bid and offer.  This is because trades may not occur contemporaneously with order book changes.  Therefore the best price indicator we have at any point in time is based on the best bid and offer.

The sample data we are using is from the 2nd hour of trading from 10am to 11am on Friday February 17th 2012.

Estimation results for equation P_chg: 
====================================== 
P_chg = P_chg.l1 + imbalance.l1 + P_chg.l2 + imbalance.l2 + P_chg.l3 + imbalance.l3 + P_chg.l4 + imbalance.l4 + P_chg.l5 + imbalance.l5 + const 
 
               Estimate Std. Error t value Pr(>|t|)    
P_chg.l1     -2.592e-04  1.675e-02  -0.015   0.9877    
imbalance.l1  7.496e-07  2.681e-07   2.796   0.0052 ** 
P_chg.l2      7.043e-02  1.676e-02   4.202 2.71e-05 ***
imbalance.l2  1.594e-07  3.571e-07   0.447   0.6553    
P_chg.l3      2.088e-02  1.679e-02   1.244   0.2138    
imbalance.l3 -6.813e-07  3.572e-07  -1.907   0.0566 .  
P_chg.l4     -6.725e-03  1.675e-02  -0.401   0.6881    
imbalance.l4  3.668e-07  3.536e-07   1.037   0.2997    
P_chg.l5     -6.367e-03  1.666e-02  -0.382   0.7023    
imbalance.l5 -5.575e-07  2.625e-07  -2.124   0.0337 *  
const         5.517e-06  9.708e-06   0.568   0.5698    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
 
Residual standard error: 0.0005032 on 3584 degrees of freedom
Multiple R-Squared: 0.01049,   Adjusted R-squared: 0.007733 
F-statistic: 3.801 on 10 and 3584 DF,  p-value: 4.028e-05 

Looking at the results we see that the model is significant.  This means that we have a model that can explain the change in price at time t based on information available at time t-1.  Specifically it indicates that the price change at time t can be forecast with the price changes and order imbalance data at times t-1, t-2 and t-3.


Unfortunately when we look at the fitted values, we see that the model isn’t giving significant enough forecasts of price.  Fitted values range from -6.543e-04 to 6.409e-04.  I.e.: +/- 0.6 of a tick [see also chart 1].  Given that trading costs alone are approximately 1 tick we haven’t created a forecast that we can actually make any money from.  However it is interesting to get the intuitive validation that prices have a momentum component on a short term basis, and that changes in the order book (increasing number of buys vs. sells) is also a positive predictor for price.
More explanatory variables is one solution to trying to get more explanatory power. 
The data is distributed randomly in time; maybe the frequency of data or intensity of the order book event process is relevant [see chart 2].  Since intensity is highly skewed by adding 1 and taking the log of the data we get a more useful distribution for statistical fitting purposes.


Also, (borrowing from Hasbrouck’s Empirical Market Microstructure), it is also worth considering the serial covariance of prices.  Deviations of this from normal can be a signal of private information entering the market.  Chart 3 shows the rolling serial covariance of price changes.  Significantly positive periods may be predictive of market moving activity.


Rerunning the same analysis with these two additional variables we get the results:

stimation results for equation P_chg: 
====================================== 
P_chg = P_chg.l1 + order_chg.l1 + log_intensity.l1 + serial_cov.l1 + P_chg.l2 + order_chg.l2 + log_intensity.l2 + serial_cov.l2 + P_chg.l3 + order_chg.l3 + log_intensity.l3 + serial_cov.l3 + P_chg.l4 + order_chg.l4 + log_intensity.l4 + serial_cov.l4 + const 
 
                   Estimate Std. Error t value Pr(>|t|)    
P_chg.l1          3.031e-04  1.678e-02   0.018  0.98559    
order_chg.l1      7.210e-07  2.685e-07   2.686  0.00727 ** 
log_intensity.l1  2.574e-06  1.396e-05   0.184  0.85376    
serial_cov.l1     4.551e-01  1.147e+00   0.397  0.69145    
P_chg.l2          7.220e-02  1.678e-02   4.304 1.72e-05 ***
order_chg.l2      1.500e-07  3.576e-07   0.419  0.67503    
log_intensity.l2 -1.737e-05  1.446e-05  -1.202  0.22961    
serial_cov.l2    -5.224e-01  1.650e+00  -0.317  0.75153    
P_chg.l3          2.208e-02  1.678e-02   1.316  0.18837    
order_chg.l3     -6.861e-07  3.541e-07  -1.938  0.05274 .  
log_intensity.l3  8.787e-06  1.445e-05   0.608  0.54320    
serial_cov.l3     2.059e+00  1.649e+00   1.248  0.21204    
P_chg.l4         -9.888e-03  1.669e-02  -0.593  0.55352    
order_chg.l4     -1.387e-07  2.630e-07  -0.528  0.59786    
log_intensity.l4  1.606e-06  1.395e-05   0.115  0.90837    
serial_cov.l4    -1.964e+00  1.146e+00  -1.714  0.08663 .  
const             1.152e-05  2.271e-05   0.507  0.61204    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
 
Residual standard error: 0.0005036 on 3579 degrees of freedom
Multiple R-Squared: 0.01048,   Adjusted R-squared: 0.006059 
F-statistic:  2.37 on 16 and 3579 DF,  p-value: 0.001621 


Unfortunately the range of fitted values is still about 0.6 of a tick: -6.601e-04 to 6.419e-04 and the p value is not as compelling as before.  We have added 2 variables and the model is no better a predictor than it was before.

So in summary, price movements can be predicted using a vector autoregression model, looking at lagged price change, order imbalance, message intensity and serial correlation.  However the strength of the price signal is small relative to transaction costs.

Many thanks to Bernhard Pfaff for his R package ‘vars’ as well as the companion book Analysis of Integrated and Cointegrated Time Series with R, and Joel Hasbrouck for his book Empirical Market Microstructure.


I can be reached at [email protected] 
___________________________________________________
For those interested in the R code see below.  I am aware there are some loops and other coding blunders but it does work:

Download data from IB:
library(IBrokers)
tws <- twsConnect(clientId = 1)
twsPrompt<-twsFuture(symbol="NG","NYMEX","20120227")
reqMktData(tws, twsList, CALLBACK = NULL, file = “data.dat”)

Load and analyze data:
setwd(‘C:/Users/Lloyd/Documents/trading/analysis/2012/intraday/’)
colType<-c("Date","POSIXct","integer","integer","integer","integer","integer","integer","numeric","integer","character")
depthData<-read.table("depthCL.dat",header=F, row.names = NULL, sep = " ", fill=TRUE) #,colClasses=colType

options(digits.secs=6)  # option for formatting dates with millisecond timestamps
cleanDates<-strptime(paste(depthData[,1],depthData[,2]),"%Y%m%d %H:%M:%OS")
cleanData<-array(0,c(length(cleanDates),8))
for (col in 3:10) cleanData[,(col-2)]<-as.numeric(as.vector(depthData[,col]))

# remove bad data
keep<-which(!is.na(cleanDates))
cleanData<-cleanData[keep,]
cleanDates<-cleanDates[keep]

keep<-which(!is.na(cleanData[,5]))
cleanData<-cleanData[keep,]
cleanDates<-cleanDates[keep]

bidPrices<-rep(0,10)
bidSizes<-rep(0,10)
bidAge<-rep(0,10)
offerPrices<-rep(0,10)
offerSizes<-rep(0,10)
offerAge<-rep(0,10)

dataRows<-length(cleanData[,1])
bidTotal<-rep(0,dataRows)                
offerTotal<-rep(0,dataRows)                
midPrice<-rep(0,dataRows)
fwdIndex<-rep(0,dataRows)

# loop through the data, recreating the working order book at each instant in time.
for (rowNum in 1:dataRows) {
  if ((cleanData[rowNum,6]==1)&(cleanData[rowNum,7]>0)) {
    P<-cleanData[rowNum,7]
    Q<-cleanData[rowNum,8]
    if (P %in% bidPrices) {
      bidSizes[which(bidPrices==P)]<-Q
      bidAge[which(bidPrices==P)]<-1
      bidAge[which(bidPrices!=P)]<-bidAge[which(bidPrices!=P)]+1
    } else {
      if (P>max(bidPrices)) {
        delete<-which.min(bidPrices)
        bidPrices[delete]<-P
        bidSizes[delete]<-Q
       
      } else if (P
        delete<-which.max(bidPrices)
        bidPrices[delete]<-P
        bidSizes[delete]<-Q
      }
    }
  } else if ((cleanData[rowNum,6]==0)&(cleanData[rowNum,7]>0)) {
    P<-cleanData[rowNum,7]
    Q<-cleanData[rowNum,8]
    if (P %in% offerPrices) {
      offerSizes[which(offerPrices==P)]<-Q
      offerAge[which(offerPrices==P)]<-1
      offerAge[which(offerPrices!=P)]<-offerAge[which(offerPrices!=P)]+1
    } else {
      if (P>max(offerPrices)) {
        delete<-which.min(offerPrices)
        offerPrices[delete]<-P
        offerSizes[delete]<-Q
       
      } else if (P
        delete<-which.max(offerPrices)
        offerPrices[delete]<-P
        offerSizes[delete]<-Q
      }
    }
  }
  bidTotal[rowNum]<-sum(bidSizes)
  offerTotal[rowNum]<-sum(offerSizes)
  if ((min(offerPrices)>0)&(max(bidPrices)>0)) midPrice[rowNum]<-(min(offerPrices)+max(bidPrices))/2
}

# create second indices for dates with correct start of day
ints<-seq(cleanDates[36],cleanDates[length(cleanDates)],"sec")
secondIndex<-findInterval(as.numeric(ints),as.numeric(cleanDates))
intensity<-diff(secondIndex)+1
imbalance<-bidTotal[secondIndex]-offerTotal[secondIndex]
priceChange<-diff(midPrice[secondIndex])

priceDiffs<-diff(midPrice[secondIndex])
results=NULL
library(vars)

# first look at order imbalance

alldat<-as.matrix(cbind(priceDiffs,imbalance[2:(length(imbalance))]))[921:4520,]
colnames(alldat)<-c("P_chg","imbalance")
varest<-VAR(alldat,p=1,type="const", lag.max=20, ic="AIC")
summary(varest)
summary(varest$varresult$P_chg$fitted.values)

serialCov<-NULL
for (ii in (921-60):(4520-60)) serialCov<-rbind(serialCov,cov(priceDiffs[(ii-1):(ii+59)],priceDiffs[(ii):(ii+60)]))
plot(serialCov)
length(serialCov)

alldat<-(cbind(priceDiffs,imbalance[2:(length(imbalance))],log(intensity, base = 10)[2:(length(imbalance))])[921:4520,])
alldat<-cbind(alldat[,1:3],serialCov*1000)  # 1000 times to avoid singularity
colnames(alldat)<-c("P_chg","order_chg","log_intensity","serial_cov")

varest<-VAR(alldat,type="const", lag.max=10, ic="AIC")
summary(varest)
summary(varest$varresult$P_chg$fitted.values)






To leave a comment for the author, please follow the link and comment on their blog: Commodity Stat Arb.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)