Can you spot the Error?

[This article was first published on Statistical Graphics and more » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Peter Huber referred to “the rawness of raw data”, a kind of data we would not expect to find in a textbook. The book of Fahrmeir and Tutz on multivariate modelling refers to the visual impairment data from Liang et al., 1992 in table 3.12:

Visual Impairment Data from Liang et al. as found in Fahrmeir and Tutz

Nothing wrong here at first sight; but how would you tell? There are some people who are actually able to look at non-trivial table data and spot “the round peg in the square hole”, but that just won’t work for the rest of us.

As you might guess, I am going to make a case for graphics here.

Let’s start with what the mainstream would do: plot the data in a dotplot like thing using the trellis paradigm of conditioning. I used ggplot2 to make sure to trellis state-of-the-art. A simple

  qplot(count, side, data=visual2, colour=impaired) + facet_grid(age ~ race)

gives me:

The visual impairment data in a trellis display(I still have a hard time to find that syntax intuitive …) Surprisingly this plot already is sufficient to spot the “problem” in the data, although some important properties of the data can’t be seen here.

A mosaic plot makes the whole thing even easier:

(impairment cases highlighted, left and right is left and right)

The left and right cases are (what a surprise) always of the same size, except for the 70+, black – hard to believe that in this group 110 cyclops show up not having a right eye.

In the mosaic plot the higher proportion of the impaired right eyes for 70+ blacks jumps immediately to ones eyes, but what reveals the error is the missing independence between race and side for 70+. That implies that we have too few cases here, and what is ’226′ in the table should actually be ’336′.

Here is the (corrected) data.

To leave a comment for the author, please follow the link and comment on their blog: Statistical Graphics and more » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)