Looking after Datasets
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Antony Unwin
University of Augsburg, Germany
David Moore's definition of data: numbers that have been given a context.
Here is some context for the finch dataset:
Fig 1: Illustrations of the beaks of four of Darwin's finches from “The Voyage of the Beagle”. Note that only one of these (fortis) is included in the dataset.
R's package system is one of it great strengths, offering powerful additional capabilities of all kinds and including many interesting real datasets. Of course not all packages are as good as they might be, and as Bill Venables memorably put it on R-help in 2007: “Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous.” So you have to treat any packages with care (“caveat downloader”, as the R-omans might have said) and any datasets supplied must be handled carefully too.
You might think that supplying a dataset in an R package would be a simple matter: You include the file, you write a short general description mentioning the background and giving the source, you define the variables. Perhaps you provide some sample analyses and discuss the results briefly. Kevin Wright's agridat package is exemplary in these respects.
As it happens, there are a couple of other issues that turn out to be important. Is the dataset or a version of it already in R and is the name you want to use for the dataset already taken? At this point the experienced R user will correctly guess that some datasets have the same name but are quite different (e.g., movies, melanoma) and that some datasets appear in many different versions under many different names. The best example I know is the Titanic dataset, which is availble in the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exactLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, on formatting, and on discussion, if any, of analyses. Versions with the same names in different packages are not identical. There may be others I have missed.
The issue came up because I was looking for a dataset of the month for the website of my book “Graphical Data Analysis with R“. The plan is to choose a dataset from one of the recently released or revised R packages and publish a brief graphical analysis to illustrate and reinforce the ideas presented in the book while showing some interesting information about the data. The dataset finch in dynRB looked rather nice: five species of finch with nine continuous variables and just under 150 cases. It looked promising and what’s more it is related to Darwin’s work and there was what looked like an original reference from 1904.
Figure 2 shows the distribution of species:
Fig 2: The numbers of birds of the five different species in the dataset. The distribution is a little unbalanced with over half the birds being from one species, but that is real data for you.
Some of the variable names are clear enough (TailL must be the length of the tail), but what on earth could N.UBkL be? The help for the dataset only says it is a numeric vector. As a first resort I tried to find the 1904 reference on the web and it was surprisingly easy. The complete book is available and searchable from Cornell University Library. N.UBkL must be 'Maxilla from Nostril', i.e. the distance from nose to upper beak — obvious in retrospect really.
Naturally, once you have an original source in front of you, you explore a bit more. It turns out that the dataset only includes the birds found on one island, although the species may be found on more than one. That is OK (although the package authors could have told us). All cases with any missing values have been dropped (9 out of 155). You can understand why that might have been done (methods cannot handle missing values?), but mentioning it would have been nice. Information is available on sex for each bird in the original, but is not included in the dataset. Perhaps sex is not so relevant for their studies, but surely potentially very interesting to others. It is possible that the dataset was actually passed on to the authors by someone else and the authors themselves never looked at the original source. This would be by no means unusual in academic circles (sadly).
There is an extensive literature on Darwin's finches (which incidentally are not finches at all) and a key feature differentiating the species is the beak, as you can see in Fig 1. We can explore differences between species beaks more quantitatively by displaying the data in a suitable way:
Fig 3: A parallel coordinate plot of the nine measurements made on each bird with the five species distinguished by colour. The first two beak variables (BeakW and BeakH) separate the two bigger species from the other three. The following three variables (LBeakL, UBeakL, and N.UBkL) separate the smaller species from one another.
Could the two bigger species be separated from one another using some discrimination analysis or some machine learning technique? Possibly, I have not tried, but it is worth noting that these two species are considered to be two subspecies of the same species, so demonstrable differences are not so likely.
If you have got this far, you will realise that I am grateful to the package authors for providing this dataset in R and I appreciate their efforts. I just wish they had made a little more effort. When you think of how much care and effort went into collecting the real datasets we use (how long would it take you to collect so many birds, classify them and measure them?), we should take more trouble in looking after datasets and offer the original collectors of the data more respect and gratitude.
This is all true of so many datasets in R and you begin to wonder if there should not be a society for the protection of datasets (SPODS?). That might prevent them being so abused and maltreated. Far worse has been done to datasets in R than anything I have detailed here, but this is a family blog and details of graver abuses might upset sensitive readers.
To end on an optimistic note, some further googling led to the discovery of the complete data from the 1904 reference for all the species (there are not just 5 taxa, but 32) for all the Galapogos islands, with the sex variable, and with the cases with missing values. The source was the Dryad Digital Repository, a site I confess that was unknown to me. “The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable.” Sounds good, we should encourage more sites like that, and we should encourage providers of datasets in R to look after any data in their care better.
And returning to Moore's definition of data, wouldn't it be a help to distinguish proper datasets from mere sets of numbers in R?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
