There’s no mistake in the barley data

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Statistics has many canonical data sets. For classification statistics, we have the Fisher's iris data. For Big Data statistics, the canonical data set used in many examples is the Airlines data. And for dotplots, we have the barley data, first popularized by Bill Cleveland in the landmark 1993 text Visualizing Data. Cleveland's innovations in data visualiation were hugely influential in the S language and (later) R's lattice and ggplot2 packages, and the panel chart of the barley data shown below is one of the best known. 

Barley31-32

The chart above shows the yields for several different varieties of barley (Trebi, Glabron and so on) planted at each of six different sites in Minnesota (Duluth, Grand Rapids, etc.) in the years 1931 (pink) and 1932 (blue). The reason this data set has become legendary appears in the “Morris” panel, where unlike all other sites the yields in 1931 exceeded those in 1932 for all barley varieties. This is a great demonstration of the power of dotplots and panel graphics. In his book, Cleveland said that “either an extraordinary natural event, such as disease or a local weather anomaly, produced a strange coincidence, or the years for Morris were inadvertently reversed”, and “on the basis of the evidence, the mistake hypothesis would appear to be the more likely.”

But it now looks that despite Cleveland's suggestion, the data are correct after all. In a paper in the American Statistician published last year, Kevin Wright notes that in that time period local effects of weather (especially drought), insects and disease had greater impact on barley yields than any overall year-to-year effects on yield, and that the results at Morris were not surprising. Kevin offers as evidence extended barley yield data (available in his R package agridat) covering 10 years and 18 varieties. As you can see in the chart below, there is significant variation across years and within sites. Take a look at 1934 for example: a bounty of barley in Duluth, but a meagre crop in St Paul:

Yields1927-1936

So it goes to show that in Cleveland's original example, it wasn't a data error that led to the “unusual” results at the Morris site. Rather, it's an expected consequence of the year-to-year variation of yields in each of the growing sites. But it's no less of an interesting data set to show off the power of dot plots and panel charts — as you can see from several other examples included in Kevin Wright's paper linked below. (With thanks to Kevin for describing this example to me at the useR! 2014 poster session. You can see a version of his poster here.)

American Statistician: Revisiting Immer's Barley Data. The American Statistician, 67(3), 129–133.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)