**R – Civil Statistician**, and kindly contributed to R-bloggers)

I just heard about Data Proofer (h/t Nathan Yau), a test suite of sanity-checks for your CSV dataset.

It checks a few basic things you’d really want to know but might forget to check yourself, like whether any rows are exact duplicates, or whether any columns are totally empty.

There are things I always forget to check until they cause a bug, like whether geographic coordinates are within -180 to 180 degrees latitude or longitude.

And there are things I never think to check, though I should, like whether there are exactly 65k rows (probably an error exporting from Excel) or whether integers are exactly at certain common cutoff/overflow values.

I like the idea of automating this. It certainly wouldn’t absolved me of the need to think critically about a new dataset—but it might flag some things I wouldn’t have caught otherwise.

(They also do some statistical checks for outliers; but being a statistician, this is one thing I do remember to do myself. (I’d like to think) I do it more carefully than any simple automated check.)

Does an R package like this exist already? The closest thing in spirit that I’ve seen is `testdat`

, though I haven’t played with that yet. If not, maybe `testdat`

could add some more of Data Proofer’s checks. It’d become an even more valuable tool to run whenever you load or import any tabular dataset for the first time.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Civil Statistician**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...