Data sanity checks: Data Proofer (and R analogues?)

May 20, 2016

(This article was first published on R – Civil Statistician, and kindly contributed to R-bloggers)

I just heard about Data Proofer (h/t Nathan Yau), a test suite of sanity-checks for your CSV dataset.

It checks a few basic things you’d really want to know but might forget to check yourself, like whether any rows are exact duplicates, or whether any columns are totally empty.
There are things I always forget to check until they cause a bug, like whether geographic coordinates are within -180 to 180 degrees latitude or longitude.
And there are things I never think to check, though I should, like whether there are exactly 65k rows (probably an error exporting from Excel) or whether integers are exactly at certain common cutoff/overflow values.
I like the idea of automating this. It certainly wouldn’t absolved me of the need to think critically about a new dataset—but it might flag some things I wouldn’t have caught otherwise.

(They also do some statistical checks for outliers; but being a statistician, this is one thing I do remember to do myself. (I’d like to think) I do it more carefully than any simple automated check.)

Does an R package like this exist already? The closest thing in spirit that I’ve seen is testdat, though I haven’t played with that yet. If not, maybe testdat could add some more of Data Proofer’s checks. It’d become an even more valuable tool to run whenever you load or import any tabular dataset for the first time.

To leave a comment for the author, please follow the link and comment on their blog: R – Civil Statistician. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)