Data set with all the conceivable errors

June 14, 2016

(This article was first published on Peter Solymos - R related posts, and kindly contributed to R-bloggers)

As I was preparing for an R intro course
I came up with the idea of creating a fake data set that is stuffed full
of all the conceivable errors one can imagine.
Just in case my imagination falls short, Iíd appreciate all the suggestions
in the comments so that I can incorporate more errors.

There is a Hungarian saying about the veterinarianís horse to describe
a case that exhibits all the possible conditions a subject can suffer from
(read more of the etymology here).
I would like to create a data set that shows all the
possible errors a data set can exhibit. This data would be then used in
the aforementioned course to make participantsí
life miserable experience more diverse.

So far I have been able to come up with the following issues:

  • ill formatted entries, usually as GIS output: "1,234,567.0058654" (needs to clear commas, turn it into numeric, digits are irrelevant but eating up memory)
  • special characters (e.g. from MS Word) where UTF-8 or ASCII is expected
  • mixed case typos: "W-123" vs. "w-123"
  • leading/trailing whitespace: "W-123" vs. "W-123 "
  • MS Excel turning values into dates (e.g. 0-3 works fine, but 3-5 becomes 05-Mar)

I donít imagine that this list can ever be complete, but right now it is far from complete.
If you have struggled with a problem in the past and would like others to
learn from it, please leave a comment and I will expand the list

To leave a comment for the author, please follow the link and comment on their blog: Peter Solymos - R related posts. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)