[This article was first published on   Commodity Stat Arb, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It has been a couple of months since my last post; busy with lots of projects.
I had some fun playing around with data from Interactive Brokers API.  It turns out that it is relatively easily to get hold of the raw market data relating to both trades and order book changes for CME/NYMEX commodity futures.  For the purposes of this analysis I focused on the order book itself and what if anything one can imply forwards in time based on it.
All of the analysis was done in R.  I have been experimenting with R for a year or so and have found it to have many advantages for statistical analyses.  I put all the code at the bottom of the posting for those of you interested in doing something similar.
The raw data needs some manipulation to make it useful.  Each record is a single “order book event” as I think of it.  It could be a new order at a new price or a change to the number of buys or sells in the existing order book.  Only the first 10 prices are supplied.  As a result there are over 500,000 records for a single day’s 7 hours of trading during the pit hours (when most of the trade volume occurs).
For example, reconstructing a time series of the order book itself we see that as of 
2012-02-17 13:52:55.302678
 time the order book consisted of these bids:
| Price | 2.689  | 2.690  | 2.691  | 2.692  | 2.694  | 2.695  | 2.685  | 2.686  | 2.687  | 2.688 | 
| quantity | 18  | 37  | 44  | 34  | 10   | 1  | 16   | 8   | 6  | 16 | 
And these offers:
| Price | 2.701  | 2.702  | 2.703  | 2.704  | 2.705  | 2.696  | 2.697  | 2.698  | 2.699  | 2.700 | 
| quantity | 168   | 11   | 65    | 8   | 10    | 6   | 79   | 21   | 11   | 12 | 
So a rudimentary analysis of the data might focus on the imbalance between the sum of buys and the sum of sells.  In the above example there are a total of 190 bids showing and 391 offers showing.  Surely a difference of this magnitude would be meaningful?
For this sort of time series analysis we turn to the Vector Autoregression [VAR] models (as described in previous postings).  We have two time series, each potentially dependent on itself and the other.
The first step is to discretize the data into 1 second interval data since the raw data is event driven and not summarized.  The coding steps for this are outlined in R below.  Also we focus initially on the price implied by the average of the best bid and offer.  This is because trades may not occur contemporaneously with order book changes.  Therefore the best price indicator we have at any point in time is based on the best bid and offer. 
The sample data we are using is from the 2nd hour of trading from 10am to 11am on Friday February 17th 2012.
Estimation results for equation P_chg: ====================================== P_chg = P_chg.l1 + imbalance.l1 + P_chg.l2 + imbalance.l2 + P_chg.l3 + imbalance.l3 + P_chg.l4 + imbalance.l4 + P_chg.l5 + imbalance.l5 + const Estimate Std. Error t value Pr(>|t|) P_chg.l1 -2.592e-04 1.675e-02 -0.015 0.9877 imbalance.l1 7.496e-07 2.681e-07 2.796 0.0052 ** P_chg.l2 7.043e-02 1.676e-02 4.202 2.71e-05 *** imbalance.l2 1.594e-07 3.571e-07 0.447 0.6553 P_chg.l3 2.088e-02 1.679e-02 1.244 0.2138 imbalance.l3 -6.813e-07 3.572e-07 -1.907 0.0566 . P_chg.l4 -6.725e-03 1.675e-02 -0.401 0.6881 imbalance.l4 3.668e-07 3.536e-07 1.037 0.2997 P_chg.l5 -6.367e-03 1.666e-02 -0.382 0.7023 imbalance.l5 -5.575e-07 2.625e-07 -2.124 0.0337 * const 5.517e-06 9.708e-06 0.568 0.5698 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0005032 on 3584 degrees of freedom Multiple R-Squared: 0.01049, Adjusted R-squared: 0.007733 F-statistic: 3.801 on 10 and 3584 DF, p-value: 4.028e-05
Looking at the results we see that the model is significant.  This means that we have a model that can explain the change in price at time t based on information available at time t-1.  Specifically it indicates that the price change at time t can be forecast with the price changes and order imbalance data at times t-1, t-2 and t-3.
< !--[endif]-->
Unfortunately when we look at the fitted values, we see that the model isn’t giving significant enough forecasts of price.  Fitted values range from -6.543e-04 to 6.409e-04.  I.e.: +/- 0.6 of a tick [see also chart 1].  Given that trading costs alone are approximately 1 tick we haven’t created a forecast that we can actually make any money from.  However it is interesting to get the intuitive validation that prices have a momentum component on a short term basis, and that changes in the order book (increasing number of buys vs. sells) is also a positive predictor for price.
More explanatory variables is one solution to trying to get more explanatory power.  
The data is distributed randomly in time; maybe the frequency of data or intensity of the order book event process is relevant [see chart 2].  Since intensity is highly skewed by adding 1 and taking the log of the data we get a more useful distribution for statistical fitting purposes.
< !--[endif]-->
Also, (borrowing from Hasbrouck’s Empirical Market Microstructure), it is also worth considering the serial covariance of prices.  Deviations of this from normal can be a signal of private information entering the market.  Chart 3 shows the rolling serial covariance of price changes.  Significantly positive periods may be predictive of market moving activity.
< !--[endif]-->
Rerunning the same analysis with these two additional variables we get the results:
stimation results for equation P_chg: ====================================== P_chg = P_chg.l1 + order_chg.l1 + log_intensity.l1 + serial_cov.l1 + P_chg.l2 + order_chg.l2 + log_intensity.l2 + serial_cov.l2 + P_chg.l3 + order_chg.l3 + log_intensity.l3 + serial_cov.l3 + P_chg.l4 + order_chg.l4 + log_intensity.l4 + serial_cov.l4 + const Estimate Std. Error t value Pr(>|t|) P_chg.l1 3.031e-04 1.678e-02 0.018 0.98559 order_chg.l1 7.210e-07 2.685e-07 2.686 0.00727 ** log_intensity.l1 2.574e-06 1.396e-05 0.184 0.85376 serial_cov.l1 4.551e-01 1.147e+00 0.397 0.69145 P_chg.l2 7.220e-02 1.678e-02 4.304 1.72e-05 *** order_chg.l2 1.500e-07 3.576e-07 0.419 0.67503 log_intensity.l2 -1.737e-05 1.446e-05 -1.202 0.22961 serial_cov.l2 -5.224e-01 1.650e+00 -0.317 0.75153 P_chg.l3 2.208e-02 1.678e-02 1.316 0.18837 order_chg.l3 -6.861e-07 3.541e-07 -1.938 0.05274 . log_intensity.l3 8.787e-06 1.445e-05 0.608 0.54320 serial_cov.l3 2.059e+00 1.649e+00 1.248 0.21204 P_chg.l4 -9.888e-03 1.669e-02 -0.593 0.55352 order_chg.l4 -1.387e-07 2.630e-07 -0.528 0.59786 log_intensity.l4 1.606e-06 1.395e-05 0.115 0.90837 serial_cov.l4 -1.964e+00 1.146e+00 -1.714 0.08663 . const 1.152e-05 2.271e-05 0.507 0.61204 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0005036 on 3579 degrees of freedom Multiple R-Squared: 0.01048, Adjusted R-squared: 0.006059 F-statistic: 2.37 on 16 and 3579 DF, p-value: 0.001621
Unfortunately the range of fitted values is still about 0.6 of a tick: -6.601e-04 to 6.419e-04 and the p value is not as compelling as before.  We have added 2 variables and the model is no better a predictor than it was before.
So in summary, price movements can be predicted using a vector autoregression model, looking at lagged price change, order imbalance, message intensity and serial correlation.  However the strength of the price signal is small relative to transaction costs.
Many thanks to Bernhard Pfaff for his R package ‘vars’ as well as the companion book Analysis of Integrated and Cointegrated Time Series with R, and Joel Hasbrouck for his book Empirical Market Microstructure.
< o:p>
< o:p>I can be reached at lloyd.spencer@kahutrading.com 
___________________________________________________
For those interested in the R code see below.  I am aware there are some loops and other coding blunders but it does work:
Download data from IB:
library(IBrokers)< o:p>
tws <- twsConnect(clientId = 1)< o:p>
twsPrompt<-twsFuture(symbol=”NG”,”NYMEX”,”20120227″)< o:p>
reqMktData(tws, twsList, CALLBACK = NULL, file = “data.dat”)< o:p>
Load and analyze data:
setwd(‘C:/Users/Lloyd/Documents/trading/analysis/2012/intraday/’)< o:p>
colType<-c(“Date”,”POSIXct”,”integer”,”integer”,”integer”,”integer”,”integer”,”integer”,”numeric”,”integer”,”character”)< o:p>
depthData<-read.table(“depthCL.dat”,header=F, row.names = NULL, sep = ” “, fill=TRUE) #,colClasses=colType< o:p>
options(digits.secs=6)  # option for formatting dates with millisecond timestamps< o:p>
cleanDates<-strptime(paste(depthData[,1],depthData[,2]),”%Y%m%d %H:%M:%OS”)< o:p>
cleanData<-array(0,c(length(cleanDates),8))< o:p>
for (col in 3:10) cleanData[,(col-2)]<-as.numeric(as.vector(depthData[,col]))< o:p>
# remove bad data< o:p>
keep<-which(!is.na(cleanDates))< o:p>
cleanData<-cleanData[keep,]< o:p>
cleanDates<-cleanDates[keep]< o:p>
keep<-which(!is.na(cleanData[,5]))< o:p>
cleanData<-cleanData[keep,]< o:p>
cleanDates<-cleanDates[keep]< o:p>
bidPrices<-rep(0,10)< o:p>
bidSizes<-rep(0,10)< o:p>
bidAge<-rep(0,10)< o:p>
offerPrices<-rep(0,10)< o:p>
offerSizes<-rep(0,10)< o:p>
offerAge<-rep(0,10)< o:p>
dataRows<-length(cleanData[,1])< o:p>
bidTotal<-rep(0,dataRows)                 < o:p>
offerTotal<-rep(0,dataRows)                 < o:p>
midPrice<-rep(0,dataRows)< o:p>
fwdIndex<-rep(0,dataRows)< o:p>
# loop through the data, recreating the working order book at each instant in time.< o:p>
for (rowNum in 1:dataRows) {< o:p>
  if ((cleanData[rowNum,6]==1)&(cleanData[rowNum,7]>0)) {< o:p>
    P<-cleanData[rowNum,7]< o:p>
    Q<-cleanData[rowNum,8]< o:p>
    if (P %in% bidPrices) {< o:p>
      bidSizes[which(bidPrices==P)]<-Q< o:p>
      bidAge[which(bidPrices==P)]<-1< o:p>
      bidAge[which(bidPrices!=P)]<-bidAge[which(bidPrices!=P)]+1< o:p>
    } else {< o:p>
      if (P>max(bidPrices)) {< o:p>
        delete<-which.min(bidPrices)< o:p>
        bidPrices[delete]<-P< o:p>
        bidSizes[delete]<-Q< o:p>
        < o:p>
      } else if (P<min(bidPrices)) {< o:p>
        delete<-which.max(bidPrices)< o:p>
        bidPrices[delete]<-P< o:p>
        bidSizes[delete]<-Q< o:p>
      }< o:p>
    }< o:p>
  } else if ((cleanData[rowNum,6]==0)&(cleanData[rowNum,7]>0)) {< o:p>
    P<-cleanData[rowNum,7]< o:p>
    Q<-cleanData[rowNum,8]< o:p>
    if (P %in% offerPrices) {< o:p>
      offerSizes[which(offerPrices==P)]<-Q< o:p>
      offerAge[which(offerPrices==P)]<-1< o:p>
      offerAge[which(offerPrices!=P)]<-offerAge[which(offerPrices!=P)]+1< o:p>
    } else {< o:p>
      if (P>max(offerPrices)) {< o:p>
        delete<-which.min(offerPrices)< o:p>
        offerPrices[delete]<-P< o:p>
        offerSizes[delete]<-Q< o:p>
        < o:p>
      } else if (P<min(offerPrices)) {< o:p>
        delete<-which.max(offerPrices)< o:p>
        offerPrices[delete]<-P< o:p>
        offerSizes[delete]<-Q< o:p>
      }< o:p>
    }< o:p>
  }< o:p>
  bidTotal[rowNum]<-sum(bidSizes) < o:p>
  offerTotal[rowNum]<-sum(offerSizes)< o:p>
  if ((min(offerPrices)>0)&(max(bidPrices)>0)) midPrice[rowNum]<-(min(offerPrices)+max(bidPrices))/2< o:p>
}< o:p>
# create second indices for dates with correct start of day< o:p>
ints<-seq(cleanDates[36],cleanDates[length(cleanDates)],”sec”)< o:p>
secondIndex<-findInterval(as.numeric(ints),as.numeric(cleanDates))< o:p>
intensity<-diff(secondIndex)+1< o:p>
imbalance<-bidTotal[secondIndex]-offerTotal[secondIndex]< o:p>
priceChange<-diff(midPrice[secondIndex])< o:p>
priceDiffs<-diff(midPrice[secondIndex])< o:p>
results=NULL< o:p>
library(vars)< o:p>
# first look at order imbalance< o:p>
alldat<-as.matrix(cbind(priceDiffs,imbalance[2:(length(imbalance))]))[921:4520,]< o:p>
colnames(alldat)<-c(“P_chg”,”imbalance”)< o:p>
varest<-VAR(alldat,p=1,type=”const”, lag.max=20, ic=”AIC”)< o:p>
summary(varest)< o:p>
summary(varest$varresult$P_chg$fitted.values)< o:p>
serialCov<-NULL< o:p>
for (ii in (921-60):(4520-60)) serialCov<-rbind(serialCov,cov(priceDiffs[(ii-1):(ii+59)],priceDiffs[(ii):(ii+60)]))< o:p>
plot(serialCov)< o:p>
length(serialCov)< o:p>
alldat<-(cbind(priceDiffs,imbalance[2:(length(imbalance))],log(intensity, base = 10)[2:(length(imbalance))])[921:4520,])< o:p>
alldat<-cbind(alldat[,1:3],serialCov*1000)  # 1000 times to avoid singularity < o:p>
colnames(alldat)<-c(“P_chg”,”order_chg”,”log_intensity”,”serial_cov”)< o:p>
varest<-VAR(alldat,type=”const”, lag.max=10, ic=”AIC”)< o:p>
summary(varest)< o:p>
summary(varest$varresult$P_chg$fitted.values)< o:p>
To leave a comment for the author, please follow the link and comment on their blog:  Commodity Stat Arb.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
