There’s no mistake in the barley data

July 21, 2014
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Statistics has many canonical data sets. For classification statistics, we have the Fisher's iris data. For Big Data statistics, the canonical data set used in many examples is the Airlines data. And for dotplots, we have the barley data, first popularized by Bill Cleveland in the landmark 1993 text Visualizing Data. Cleveland's innovations in data visualiation were hugely influential in the S language and (later) R's lattice and ggplot2 packages, and the panel chart of the barley data shown below is one of the best known. 

Barley31-32

The chart above shows the yields for several different varieties of barley (Trebi, Glabron and so on) planted at each of six different sites in Minnesota (Duluth, Grand Rapids, etc.) in the years 1931 (pink) and 1932 (blue). The reason this data set has become legendary appears in the "Morris" panel, where unlike all other sites the yields in 1931 exceeded those in 1932 for all barley varieties. This is a great demonstration of the power of dotplots and panel graphics. In his book, Cleveland said that "either an extraordinary natural event, such as disease or a local weather anomaly, produced a strange coincidence, or the years for Morris were inadvertently reversed", and "on the basis of the evidence, the mistake hypothesis would appear to be the more likely."

But it now looks that despite Cleveland's suggestion, the data are correct after all. In a paper in the American Statistician published last year, Kevin Wright notes that in that time period local effects of weather (especially drought), insects and disease had greater impact on barley yields than any overall year-to-year effects on yield, and that the results at Morris were not surprising. Kevin offers as evidence extended barley yield data (available in his R package agridat) covering 10 years and 18 varieties. As you can see in the chart below, there is significant variation across years and within sites. Take a look at 1934 for example: a bounty of barley in Duluth, but a meagre crop in St Paul:

Yields1927-1936

So it goes to show that in Cleveland's original example, it wasn't a data error that led to the "unusual" results at the Morris site. Rather, it's an expected consequence of the year-to-year variation of yields in each of the growing sites. But it's no less of an interesting data set to show off the power of dot plots and panel charts — as you can see from several other examples included in Kevin Wright's paper linked below. (With thanks to Kevin for describing this example to me at the useR! 2014 poster session. You can see a version of his poster here.)

American Statistician: Revisiting Immer's Barley Data. The American Statistician, 67(3), 129–133.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.