# Futures price prediction using the order book data

March 5, 2012
By

(This article was first published on Commodity Stat Arb, and kindly contributed to R-bloggers)

It has been a couple of months since my last post; busy with lots of projects.

I had some fun playing around with data from Interactive Brokers API.  It turns out that it is relatively easily to get hold of the raw market data relating to both trades and order book changes for CME/NYMEX commodity futures.  For the purposes of this analysis I focused on the order book itself and what if anything one can imply forwards in time based on it.

All of the analysis was done in R.  I have been experimenting with R for a year or so and have found it to have many advantages for statistical analyses.  I put all the code at the bottom of the posting for those of you interested in doing something similar.

The raw data needs some manipulation to make it useful.  Each record is a single “order book event” as I think of it.  It could be a new order at a new price or a change to the number of buys or sells in the existing order book.  Only the first 10 prices are supplied.  As a result there are over 500,000 records for a single day’s 7 hours of trading during the pit hours (when most of the trade volume occurs).

For example, reconstructing a time series of the order book itself we see that as of
2012-02-17 13:52:55.302678
time the order book consisted of these bids:
 Price 2.689 2.69 2.691 2.692 2.694 2.695 2.685 2.686 2.687 2.688 quantity 18 37 44 34 10 1 16 8 6 16

And these offers:
 Price 2.701 2.702 2.703 2.704 2.705 2.696 2.697 2.698 2.699 2.7 quantity 168 11 65 8 10 6 79 21 11 12

So a rudimentary analysis of the data might focus on the imbalance between the sum of buys and the sum of sells.  In the above example there are a total of 190 bids showing and 391 offers showing.  Surely a difference of this magnitude would be meaningful?

For this sort of time series analysis we turn to the Vector Autoregression [VAR] models (as described in previous postings).  We have two time series, each potentially dependent on itself and the other.

The first step is to discretize the data into 1 second interval data since the raw data is event driven and not summarized.  The coding steps for this are outlined in R below.  Also we focus initially on the price implied by the average of the best bid and offer.  This is because trades may not occur contemporaneously with order book changes.  Therefore the best price indicator we have at any point in time is based on the best bid and offer.

The sample data we are using is from the 2nd hour of trading from 10am to 11am on Friday February 17th 2012.

Estimation results for equation P_chg:
======================================
P_chg = P_chg.l1 + imbalance.l1 + P_chg.l2 + imbalance.l2 + P_chg.l3 + imbalance.l3 + P_chg.l4 + imbalance.l4 + P_chg.l5 + imbalance.l5 + const

               Estimate Std. Error t value Pr(>|t|)
P_chg.l1     -2.592e-04  1.675e-02  -0.015   0.9877
imbalance.l1  7.496e-07  2.681e-07   2.796   0.0052 **
P_chg.l2      7.043e-02  1.676e-02   4.202 2.71e-05 ***
imbalance.l2  1.594e-07  3.571e-07   0.447   0.6553
P_chg.l3      2.088e-02  1.679e-02   1.244   0.2138
imbalance.l3 -6.813e-07  3.572e-07  -1.907   0.0566 .
P_chg.l4     -6.725e-03  1.675e-02  -0.401   0.6881
imbalance.l4  3.668e-07  3.536e-07   1.037   0.2997
P_chg.l5     -6.367e-03  1.666e-02  -0.382   0.7023
imbalance.l5 -5.575e-07  2.625e-07  -2.124   0.0337 *
const         5.517e-06  9.708e-06   0.568   0.5698
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 0.0005032 on 3584 degrees of freedom
Multiple R-Squared: 0.01049,   Adjusted R-squared: 0.007733
F-statistic: 3.801 on 10 and 3584 DF,  p-value: 4.028e-05

Looking at the results we see that the model is significant.  This means that we have a model that can explain the change in price at time t based on information available at time t-1.  Specifically it indicates that the price change at time t can be forecast with the price changes and order imbalance data at times t-1, t-2 and t-3.

Unfortunately when we look at the fitted values, we see that the model isn’t giving significant enough forecasts of price.  Fitted values range from -6.543e-04 to 6.409e-04.  I.e.: +/- 0.6 of a tick [see also chart 1].  Given that trading costs alone are approximately 1 tick we haven’t created a forecast that we can actually make any money from.  However it is interesting to get the intuitive validation that prices have a momentum component on a short term basis, and that changes in the order book (increasing number of buys vs. sells) is also a positive predictor for price.
More explanatory variables is one solution to trying to get more explanatory power.
The data is distributed randomly in time; maybe the frequency of data or intensity of the order book event process is relevant [see chart 2].  Since intensity is highly skewed by adding 1 and taking the log of the data we get a more useful distribution for statistical fitting purposes.

Also, (borrowing from Hasbrouck’s Empirical Market Microstructure), it is also worth considering the serial covariance of prices.  Deviations of this from normal can be a signal of private information entering the market.  Chart 3 shows the rolling serial covariance of price changes.  Significantly positive periods may be predictive of market moving activity.

Rerunning the same analysis with these two additional variables we get the results:

stimation results for equation P_chg:
======================================
P_chg = P_chg.l1 + order_chg.l1 + log_intensity.l1 + serial_cov.l1 + P_chg.l2 + order_chg.l2 + log_intensity.l2 + serial_cov.l2 + P_chg.l3 + order_chg.l3 + log_intensity.l3 + serial_cov.l3 + P_chg.l4 + order_chg.l4 + log_intensity.l4 + serial_cov.l4 + const

                   Estimate Std. Error t value Pr(>|t|)
P_chg.l1          3.031e-04  1.678e-02   0.018  0.98559
order_chg.l1      7.210e-07  2.685e-07   2.686  0.00727 **
log_intensity.l1  2.574e-06  1.396e-05   0.184  0.85376
serial_cov.l1     4.551e-01  1.147e+00   0.397  0.69145
P_chg.l2          7.220e-02  1.678e-02   4.304 1.72e-05 ***
order_chg.l2      1.500e-07  3.576e-07   0.419  0.67503
log_intensity.l2 -1.737e-05  1.446e-05  -1.202  0.22961
serial_cov.l2    -5.224e-01  1.650e+00  -0.317  0.75153
P_chg.l3          2.208e-02  1.678e-02   1.316  0.18837
order_chg.l3     -6.861e-07  3.541e-07  -1.938  0.05274 .
log_intensity.l3  8.787e-06  1.445e-05   0.608  0.54320
serial_cov.l3     2.059e+00  1.649e+00   1.248  0.21204
P_chg.l4         -9.888e-03  1.669e-02  -0.593  0.55352
order_chg.l4     -1.387e-07  2.630e-07  -0.528  0.59786
log_intensity.l4  1.606e-06  1.395e-05   0.115  0.90837
serial_cov.l4    -1.964e+00  1.146e+00  -1.714  0.08663 .
const             1.152e-05  2.271e-05   0.507  0.61204
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 0.0005036 on 3579 degrees of freedom
Multiple R-Squared: 0.01048,   Adjusted R-squared: 0.006059
F-statistic:  2.37 on 16 and 3579 DF,  p-value: 0.001621

Unfortunately the range of fitted values is still about 0.6 of a tick: -6.601e-04 to 6.419e-04 and the p value is not as compelling as before.  We have added 2 variables and the model is no better a predictor than it was before.

So in summary, price movements can be predicted using a vector autoregression model, looking at lagged price change, order imbalance, message intensity and serial correlation.  However the strength of the price signal is small relative to transaction costs.

Many thanks to Bernhard Pfaff for his R package ‘vars’ as well as the companion book Analysis of Integrated and Cointegrated Time Series with R, and Joel Hasbrouck for his book Empirical Market Microstructure.

I can be reached at [email protected]
___________________________________________________
For those interested in the R code see below.  I am aware there are some loops and other coding blunders but it does work:

library(IBrokers)
tws <- twsConnect(clientId = 1)
twsPrompt<-twsFuture(symbol="NG","NYMEX","20120227")
reqMktData(tws, twsList, CALLBACK = NULL, file = "data.dat")

colType<-c("Date","POSIXct","integer","integer","integer","integer","integer","integer","numeric","integer","character")

options(digits.secs=6)  # option for formatting dates with millisecond timestamps
cleanDates<-strptime(paste(depthData[,1],depthData[,2]),"%Y%m%d %H:%M:%OS")
cleanData<-array(0,c(length(cleanDates),8))
for (col in 3:10) cleanData[,(col-2)]<-as.numeric(as.vector(depthData[,col]))

keep<-which(!is.na(cleanDates))
cleanData<-cleanData[keep,]
cleanDates<-cleanDates[keep]

keep<-which(!is.na(cleanData[,5]))
cleanData<-cleanData[keep,]
cleanDates<-cleanDates[keep]

bidPrices<-rep(0,10)
bidSizes<-rep(0,10)
bidAge<-rep(0,10)
offerPrices<-rep(0,10)
offerSizes<-rep(0,10)
offerAge<-rep(0,10)

dataRows<-length(cleanData[,1])
bidTotal<-rep(0,dataRows)
offerTotal<-rep(0,dataRows)
midPrice<-rep(0,dataRows)
fwdIndex<-rep(0,dataRows)

# loop through the data, recreating the working order book at each instant in time.
for (rowNum in 1:dataRows) {
if ((cleanData[rowNum,6]==1)&(cleanData[rowNum,7]>0)) {
P<-cleanData[rowNum,7]
Q<-cleanData[rowNum,8]
if (P %in% bidPrices) {
bidSizes[which(bidPrices==P)]<-Q
bidAge[which(bidPrices==P)]<-1
bidAge[which(bidPrices!=P)]<-bidAge[which(bidPrices!=P)]+1
} else {
if (P>max(bidPrices)) {
delete<-which.min(bidPrices)
bidPrices[delete]<-P
bidSizes[delete]<-Q

} else if (P<min(bidPrices)) {
delete<-which.max(bidPrices)
bidPrices[delete]<-P
bidSizes[delete]<-Q
}
}
} else if ((cleanData[rowNum,6]==0)&(cleanData[rowNum,7]>0)) {
P<-cleanData[rowNum,7]
Q<-cleanData[rowNum,8]
if (P %in% offerPrices) {
offerSizes[which(offerPrices==P)]<-Q
offerAge[which(offerPrices==P)]<-1
offerAge[which(offerPrices!=P)]<-offerAge[which(offerPrices!=P)]+1
} else {
if (P>max(offerPrices)) {
delete<-which.min(offerPrices)
offerPrices[delete]<-P
offerSizes[delete]<-Q

} else if (P<min(offerPrices)) {
delete<-which.max(offerPrices)
offerPrices[delete]<-P
offerSizes[delete]<-Q
}
}
}
bidTotal[rowNum]<-sum(bidSizes)
offerTotal[rowNum]<-sum(offerSizes)
if ((min(offerPrices)>0)&(max(bidPrices)>0)) midPrice[rowNum]<-(min(offerPrices)+max(bidPrices))/2
}

# create second indices for dates with correct start of day
ints<-seq(cleanDates[36],cleanDates[length(cleanDates)],"sec")
secondIndex<-findInterval(as.numeric(ints),as.numeric(cleanDates))
intensity<-diff(secondIndex)+1
imbalance<-bidTotal[secondIndex]-offerTotal[secondIndex]
priceChange<-diff(midPrice[secondIndex])

priceDiffs<-diff(midPrice[secondIndex])
results=NULL
library(vars)

# first look at order imbalance

alldat<-as.matrix(cbind(priceDiffs,imbalance[2:(length(imbalance))]))[921:4520,]
colnames(alldat)<-c("P_chg","imbalance")
varest<-VAR(alldat,p=1,type="const", lag.max=20, ic="AIC")
summary(varest)
summary(varest$varresult$P_chg$fitted.values) serialCov<-NULL for (ii in (921-60):(4520-60)) serialCov<-rbind(serialCov,cov(priceDiffs[(ii-1):(ii+59)],priceDiffs[(ii):(ii+60)])) plot(serialCov) length(serialCov) alldat<-(cbind(priceDiffs,imbalance[2:(length(imbalance))],log(intensity, base = 10)[2:(length(imbalance))])[921:4520,]) alldat<-cbind(alldat[,1:3],serialCov*1000) # 1000 times to avoid singularity colnames(alldat)<-c("P_chg","order_chg","log_intensity","serial_cov") varest<-VAR(alldat,type="const", lag.max=10, ic="AIC") summary(varest) summary(varest$varresult$P_chg$fitted.values)