Like all governments, the UK government is responsible for producing reports of official statistics on an ongoing basis. That process has traditionally been a highly manual one: extract data from government systems, load it into a mainframe statistical analysis tool and run models and forecasts, extract the results to a spreadsheet to prepare data for presentation, and ultimately combine it all in a manual document editing tool to produce the final report. The process in the UK looks much like this today:
Matt Upson, a Data Scientist at the UK Government Digital Service, is looking to modernize this process with a reproducible analytical pipeline. This new process, based on the UK Government's Technology Service Manual for new IT deployments, aims to simplify the process by using R — the open-source programming language for statistical analysis — to automate the data extraction, analysis, and document generation tasks.
The development of the new process is underway now, with the first target being a report on culture and sporting impacts on the UK economy:
Towards the end of 2016 we embarked on a project with a team in the Department for Culture, Media, and Sport (DCMS) who are responsible for the production of the Economic Estimates for DCMS Sectors Statistical First Release (SFR). Currently this publication is produced with a mix of manual and semi-manual processes. Our aim was to see if we could speed up production of the SFR, whilst maintaining the high standard of the publication and QA.
Central to the process is automating the report with Rmarkdown, a system for programmatically laying out a document while encapsulating all of the R code for data preparation, analysis, tabulation and charting into a collaborative document. As a comprehensive language, R has the functionality to handle all of these tasks natively, without the need to bring other systems into the chain. And it's an ideal tool for the statistical data analysis required for these reports, which must follow the guidance of the Aqua Book, the UK government's manual for producing quality analysis. (The Aqua Book is well worth reading for any data scientist, with excellent guidance on designing studies in the face of uncertain data.)
Naturally, the government has specific standards for the presentation of data, and the R-based process needs to be able to reproduce that look in the final report. The GDS team has produced a package for R, called govstyle, which standardizes the look of data charts while upgrading them to more modern design principles. For example this chart from a report on truancy in UK schools:
has been upgraded and reproduced in R to look like this:
An automated workflow allows for the inclusion of modern devops processes, too. Dependency management for R packages is handled with packrat. Test-driven development comes courtesy of the testthat package, and automated testing (including data verification and code coverage analysis) is provided by Travis CI. And source code control and collaboration is provided by Github, which is also where the entire process is documented as R package: the eesectors package.
For more on the UK government's use of R for official statistics reporting, see the post below from the UK Government Digital Service.
Data at GDS: Reproducible Analytical Pipeline