For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?
If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)
But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.
In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:
- either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;
- and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.
Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.
As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.
For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).
Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.
Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.