In 1854 there was a massive cholera outbreak in Soho, London – in three days over 120 people died from the disease. Famously, John Snow plotted the locations of the deaths on a map and found they clustered around a pump in Broad Street – he suggested that the pump be taken out of service – thus helping to end the epidemic. This then helped him formulate his theory of the spread of cholera by dirty water.
This analysis is famous as it is often considered to be:
- The first epidemiological analysis of disease – trying to understand the spread of cases by factors in the environment
- The first geographical analysis of disease data – plotting points on a map and looking for relationships
So, that’s what I did – and the data is available to download as SnowGIS.zip. All of the data is provided in TIFF (with TFW) or SHP formats, ready for loading in to ArcGIS, QGis, R, or anything else. There is a README in the zip file, but read on below for more details on what’s included.
To create the data I took I a copy of Snow’s original map, georeferenced it to the Ordnance Survey National Grid, warped it to fit correctly, and then digitised the locations of the deaths and the pumps. This allows the data that Snow collected to be overlaid on a modern OS map (click for larger copy):
The pumps are shown in blue, and the size of the red circles indicates the number of deaths at that location. Of course, the data can be overlaid on the original map created by Snow (so you can check I digitised it properly!):
- How about performing some sort of statistical cluster analysis on the deaths data? Does it identify the correct pump as the source?
- What if the data were only provided in aggregated form? Lots of healthcare data is provided in that way today because of privacy concerns – but if the data had been provided aggregated to (for example) census output areas or a standard grid, would the right pump have been identified?