Evaluating Quandl Data Quality – part II

[This article was first published on The R Trader » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is a more in depth analysis of Quandl futures data vs. Bloomberg data. Since my last post Quandl has updated its futures database to 200+ contracts from 68 contracts originally. For practical reasons, I limit myself here to the initial list of 60+ contracts. I’m still comparing the “Front Month” contract between the two sources. When evaluating the differences, I want the following:

  • Evaluate the scale of the differences
  • Evaluate the time localization of the differences (if any)
  • A single number that captures both features above
  • A measure that is comparable across instruments

After a bit of thinking, I came up with the below metric:

 D_t  =  {P(Quandl)_t  -  P(Bloomberg)_t }/ {Tick Size}

As an example, below is the chart of the above formula over time for the E-mini S&P 500 contract.


I plotted the same chart for each of the 60 contracts in the list of my previous post. Interested readers can find all the charts here.

From my perspective there are essentially two main sources of differences. First, plain wrong data points largely off compared to the reality and second a difference in the data building process (i.e. construction methodology for the front month contract). A mix of both is very likely to happen here.  In order to quantify this, I defined one additional metric: Mean Absolute Differences (MAD).

  0″ title=”MAD=sum{t=1}{n}{Abs(D_t)}/n for D_t <> 0″/>

InstrumentQuandl SymbolBloomberg TickerMAD
Soybean OilOFDP/FUTURE_BO1BO1 Comdty12254897
Russian RubleOFDP/FUTURE_RU1RU1 Curncy29653
DJ-UBS Commodity IndexOFDP/FUTURE_AW1DNA Index3041
S&P500 Volatility IndexOFDP/FUTURE_VX1UX1 Index2453
CocoaOFDP/FUTURE_CC1CC1 Comdty1552
Lean HogsOFDP/FUTURE_LN1LH1 Comdty391

Ranking the 60+ contracts on MAD allows to identify immediately large differences which are: Soybean Oil, Russian Ruble, DJ-UBS Commodity Index, S&P500 Volatility Index, Cocoa, and Lean Hogs. Those are the obvious candidates for immediate checking.

I put together what I think is the basis for a systematic data checking approach. It can obviously be refined in many ways but those refinements are largely dependent upon what one want to do with the data and which contracts are relevant to the analyst. As an example I assume that it is more relevant for most people to have accurate data for the E-mini S&P 500 contract than for the Milk contract.

As usual any comments welcome

To leave a comment for the author, please follow the link and comment on their blog: The R Trader » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)