Test Driven Analysis?
At the last LondonR meeting Francine Bennett from Mastodon C shared some of her experience and findings from an analysis of a large prescriptions data set of the UK's national health service (NHS). However, it was her last slide, which I found the most thought provoking. It asked for the definition of the following term:Test-driven analysis?Francine explained that test driven development (TDD) is a concept often used in software development for quality assurance and she wondered if a similar approach could be also used for data analysis. Unfortunately the audience couldn't provide her with the answer, but many expressed that they face similar challenges. So do I.
Indeed, how do I go about test driven analysis? How do I know that I haven't made a mistake, when I start an analysis of a new data set? Well, I don't. But I try to mitigate risks. Similar to TDD, I consider which outputs I should expect from my analysis. Those outputs form the test scenarios of my analysis. Basically I try to write down everything I know, before I start working with the data, e.g.
- any other data sets or reports I can use for cross referencing,
- any back-of-the-envelope analysis I can carry out to provide ballpark answers,
- any relativities and ratios which should hold true,
- any known boundaries and thresholds,
- test scenarios for my code with small well known data, for which I know the outcome,
- names of experts, who could sense check and peer review my output.
3 comments :
Hadley's assertthat package may be very useful to get started implementing some of these ideas.
https://github.com/hadley/assertthat
Hi Erik, thanks for the link. I wasn't aware of that package. Christopher Gandrud published a post today pointing out that everyone makes coding mistakes and finding them should be easy: http://christophergandrud.blogspot.co.uk/2013/04/reinhart-rogoff-everyone-makes-coding.html
For completeness, there's another Hadley package: https://github.com/hadley/devtools/wiki/Testing
Post a Comment