At the last LondonR meeting Francine Bennett from Mastodon C shared some of her experience and findings from an analysis of a large prescriptions data set of the UK’s national health service (NHS). However, it was her last slide, which I found the most thought provoking. It asked for the definition of the following term:
Francine explained that test driven development (TDD) is a concept often used in software development for quality assurance and she wondered if a similar approach could be also used for data analysis. Unfortunately the audience couldn’t provide her with the answer, but many expressed that they face similar challenges. So do I.
Indeed, how do I go about test driven analysis? How do I know that I haven’t made a mistake, when I start an analysis of a new data set? Well, I don’t. But I try to mitigate risks. Similar to TDD, I consider which outputs I should expect from my analysis. Those outputs form the test scenarios of my analysis. Basically I try to write down everything I know, before I start working with the data, e.g.
- any other data sets or reports I can use for cross referencing,
- any back-of-the-envelope analysis I can carry out to provide ballpark answers,
- any relativities and ratios which should hold true,
- any known boundaries and thresholds,
- test scenarios for my code with small well known data, for which I know the outcome,
- names of experts, who could sense check and peer review my output.
But most importantly: I try to think long and hard which questions I want to answer, following the advice of John Tukey: Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.