It checks a few basic things you’d really want to know but might forget to check yourself, like whether any rows are exact duplicates, or whether any columns are totally empty.
There are things I always forget to check until they cause a bug, like whether geographic coordinates are within -180 to 180 degrees latitude or longitude.
And there are things I never think to check, though I should, like whether there are exactly 65k rows (probably an error exporting from Excel) or whether integers are exactly at certain common cutoff/overflow values.
I like the idea of automating this. It certainly wouldn’t absolved me of the need to think critically about a new dataset—but it might flag some things I wouldn’t have caught otherwise.
(They also do some statistical checks for outliers; but being a statistician, this is one thing I do remember to do myself. (I’d like to think) I do it more carefully than any simple automated check.)
Does an R package like this exist already? The closest thing in spirit that I’ve seen is
testdat, though I haven’t played with that yet. If not, maybe
testdat could add some more of Data Proofer’s checks. It’d become an even more valuable tool to run whenever you load or import any tabular dataset for the first time.