How Might Data Journalists Show Their Working? Sweave

Posted on November 1, 2011 by Tony Hirst in R bloggers | 0 Comments

[This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If part of the role of data journalism is to make transparent the justification behind claims that are, or aren’t, backed up by data, there’s good reason to suppose that the journalists should be able to back up their own data-based claims with evidence about how they made use of the data. Posting links to raw data helps to a certain extent – at least third parties can then explore the data themselves and check the claims the press are making – but you could also argue that the journalists should also make their notes available regarding how they worked the data. (The same is true in public reports, where summary statistics and charts are included in a report, along with a link to the raw data, but no transparency in how the summary reports/charts were actually produced from the data.)

In Power Tools for Aspiring Data Journalists: R, I explored how we might use the R statistical programming language to replicate a chart that appeared in one of Ben Goldacre’s Bad Science columns. I included code snippets in the post, along with the figures they generated. But is there a way of getting even closer to the source, as it were, and produce documents that essentially generate their output from some sort of “source code”?

For example, take this view of my working relating to the production of the funnel chart described in Goldacre’s column:

You can find the actual “source code” for that document here: bowel cancer funnel plot working notes If you load it into something like RStudio, you can “run” the code and generate your own PDF from it.

The “source” of the document includes both text and R code. When the Sweave document is processed, the R code contained within the document is executed and the results also included in the document. The charts shown in the report are generated directly from the code included in the document, using data pulled in to the document form a source referenced within the document. If the source data is changed, or the R code is changed, what’s contained in the output document will change as well.

This sort of workflow will be familiar to many experimental scientists, but I wonder: is it something that data journalists have considered, at least as a way of keeping working notes about data related projects they are working on?

PS as well as Sweave, see dexy.it, which generalises the Sweave approach to allow you to create self-documenting software/code. Educators, also take note…;-)