Evaluating Quandl Data Quality – part II

December 2, 2013
By

(This article was first published on The R Trader » R, and kindly contributed to R-bloggers)

This post is a more in depth analysis of Quandl futures data vs. Bloomberg data. Since my last post Quandl has updated its futures database to 200+ contracts from 68 contracts originally. For practical reasons, I limit myself here to the initial list of 60+ contracts. I’m still comparing the “Front Month” contract between the two sources. When evaluating the differences, I want the following:

  • Evaluate the scale of the differences
  • Evaluate the time localization of the differences (if any)
  • A single number that captures both features above
  • A measure that is comparable across instruments

After a bit of thinking, I came up with the below metric:

 D_t  =  {P(Quandl)_t  -  P(Bloomberg)_t }/ {Tick Size}

As an example, below is the chart of the above formula over time for the E-mini S&P 500 contract.

ES1

I plotted the same chart for each of the 60 contracts in the list of my previous post. Interested readers can find all the charts here.

From my perspective there are essentially two main sources of differences. First, plain wrong data points largely off compared to the reality and second a difference in the data building process (i.e. construction methodology for the front month contract). A mix of both is very likely to happen here.  In order to quantify this, I defined one additional metric: Mean Absolute Differences (MAD).

 MAD=sum{t=1}{n}{Abs(D_t)}/n for D_t < data-recalc-dims= 0″ title=”MAD=sum{t=1}{n}{Abs(D_t)}/n for D_t <> 0″/>

Instrument Quandl Symbol Bloomberg Ticker MAD
Soybean Oil OFDP/FUTURE_BO1 BO1 Comdty 12254897
Russian Ruble OFDP/FUTURE_RU1 RU1 Curncy 29653
DJ-UBS Commodity Index OFDP/FUTURE_AW1 DNA Index 3041
S&P500 Volatility Index OFDP/FUTURE_VX1 UX1 Index 2453
Cocoa OFDP/FUTURE_CC1 CC1 Comdty 1552
Lean Hogs OFDP/FUTURE_LN1 LH1 Comdty 391

Ranking the 60+ contracts on MAD allows to identify immediately large differences which are: Soybean Oil, Russian Ruble, DJ-UBS Commodity Index, S&P500 Volatility Index, Cocoa, and Lean Hogs. Those are the obvious candidates for immediate checking.

I put together what I think is the basis for a systematic data checking approach. It can obviously be refined in many ways but those refinements are largely dependent upon what one want to do with the data and which contracts are relevant to the analyst. As an example I assume that it is more relevant for most people to have accurate data for the E-mini S&P 500 contract than for the Milk contract.

As usual any comments welcome

To leave a comment for the author, please follow the link and comment on his blog: The R Trader » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.