How to Check Data Quality using R

[This article was first published on R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Share on FacebookTweet about this on TwitterShare on LinkedInShare on Google+


How to check data quality

By Milind Paradkar

Do You Use Clean Data?

Always go for clean data! Why is it that experienced traders/authors stress this point in their trading articles/books so often? As a novice trader, you might be using the freely available data from sources like Google or Yahoo finance. Do such sources provide accurate, quality data?

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

Here is our finding. Out of the 143 stocks scanned, 89 stocks had data points less than 5250, that’s more than 60% of our sample set!! The table shown below lists downs 10 such stocks from those 89 stocks.


Let’s take the case of PAGEIND. Google finance has captured only 4348 1-minute data points for the stock, thus missing 902 points!!

Example – Missing the 1306 minute bar on 20160801:

Missing the 1306 minute bar on 20160801

Example – Missing the 1032 minute bar on 20160802:

Missing the 1032 minute bar on 20160802

If a trader is running an intraday strategy which generates buy/sell signals based on 1-minute bars, the strategy is bound to give some false signals.

As can be seen from the quick check above, data quality from free sources or from cheap data vendors is not always guaranteed. Many of the cheap data vendors source the data from Yahoo finance and provide it to their clients. Poor data feed is a big issue faced by many traders and you will find many traders complaining about the same on various trading forums.

Backtesting a trading strategy using such data will give false results. If are using the data in live trading and in case there is a server problem with Google or Yahoo finance, it will lead to a delay in the data feed. As a trader, you don’t want to be in a position where you have an open trade, and the data feed stops or is delayed. When trading with real money, one is always advised to use quality data from reliable data vendors. After all, Data is Everything!

Next Step

If you’re a retail trader interested in learning various aspects of Algorithmic trading, check out the Executive Programme in Algorithmic Trading (EPAT). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. The course equips you with the required skillsets to be a successful trader.

Download Data Files

  • Do You Use Clean Data.rar
    • 15 Day Intraday Historical
    • F&O Stock List.csv
    • R code – Google_Data_Quality_Check.txt
    • R code – Stock price data.txt



Share on FacebookTweet about this on TwitterShare on LinkedInShare on Google+

The post How to Check Data Quality using R appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)