There seems to be a scale that’s tipping at the moment: Data or code that should be published* is often not. This should seem like an odd statement for anyone in science, but it should be easy to show that most publications (at least in the geophysics community) publish data in the form of a map, or a scatter plot, or in some other non-accessible way (besides visualization). Where is the spirit of reproducibility with such a practice?
Publications that do supplement their paper with data usually provide it in the form of a table in an ASCII flat file. And yes, this is what I’m arguing for, but I have come across some hideously formatted tables (the worst is when the table is in HTML or in a PDF) that nearly make me abandon all hope for the work.
So why am I writing this? Well, I’m preparing a paper for submission right now and thinking about how I want to publish the data (besides typeset tables or graphs), and I think I’ve come up with a solution for smaller datasets: An R package of datasets in CRAN.
What is CRAN? The Comprehensive R Archive Network. It’s a place for users of the R language to publish their code and/or data in a way that’s usable to the R community. Before the package (or an updated version of it) is accepted, it is checked for consistency; this means the only worry should be whether or not the code and/or data in the package is a pile of useless crap.
For the purposes of data publishing, we would need to consider a few things:
- Identification. How will the user know what to access?
- The package would need to be identified somehow, either by a topic, or a journal, or an author/working-group. I’d choose author since it assigns responsibility to him/her.
- The dataset in the package will need to be associated with a published work – perhaps a Digital Object Identifier handle?
- Obviously, once the data are in the package they will may be accessed in the R-language.
- Databases such as flat files, for example, are not necessarily optimally normalized. So I propose publishing the data with as high a normalization as the author can stand. This will allow for robust subsetting of the data with, for example, sqldf (if SQL commands are your fancy).
- Size: What happens when the dataset to be archived is very large? I propose the package author find a repository that will host the file, and then write a function which accesses the remote file.
- Versions. I’m certainly no expert in version control, but there should be strict rules with this. 0.1-0 might be the first place to start which would mean: ‘zeroth version’.'one dataset’-'no minor updates/fixes’
- Functions. Internal functions should probably be avoided unless they are necessary, used to reproduce a calculation (and hence a dataset), or are needed for accessing large datasets [see (3)].
- Methods and Classes.
- The class should be something specific either to the package or dataset – I would argue for dataset (i.e. publication).
- There should be at least the print, summary, and print.summary methods for the new class. If, as I propose in (A), the class is specific to the dataset, the methods could be easily customizable.
So I suppose this post has gone on long enough, but the argument seems sound given some fundamental considerations. Data should be published, and the R/CRAN package versioning system may provide just the right opportunity for doing so easily, reproducibly, and in a way which benefits many more than just the scientific community which has journal access. It would also force more people to learn R!
*I apologize if you’re not granted access this piece.