John Snow’s famous cholera analysis data in modern GIS formats

January 6, 2012

(This article was first published on Robin's BlogRobin's Blog » R, and kindly contributed to R-bloggers)

In 1854 there was a massive cholera outbreak in Soho, London – in three days over 120 people died from the disease. Famously, John Snow plotted the locations of the deaths on a map and found they clustered around a pump in Broad Street – he suggested that the pump be taken out of service – thus helping to end the epidemic. This then helped him formulate his theory of the spread of cholera by dirty water.

This analysis is famous as it is often considered to be:

  • The first epidemiological analysis of disease – trying to understand the spread of cases by factors in the environment
  • The first geographical analysis of disease data – plotting points on a map and looking for relationships
Snow’s work is often used as a case study in courses in GIS and the geographies of health. So, I thought – why not convert Snow’s data into a format that will work with modern GIS systems to allow students (and others, of course) to analyse the data themselves with all of the capabilities of modern tools.

So, that’s what I did – and the data is available to download as All of the data is provided in TIFF (with TFW) or SHP formats, ready for loading in to ArcGIS, QGis, R, or anything else. There is a README in the zip file, but read on below for more details on what’s included.

To create the data I took I a copy of Snow’s original map, georeferenced it to the Ordnance Survey National Grid, warped it to fit correctly, and then digitised the locations of the deaths and the pumps. This allows the data that Snow collected to be overlaid on a modern OS map (click for larger copy):

The pumps are shown in blue, and the size of the red circles indicates the number of deaths at that location. Of course, the data can be overlaid on the original map created by Snow (so you can check I digitised it properly!):

So, that’s basically the data that’s included in the zip file (plus a greyscale version of the OS map to make for easier visualisation in certain circumstances). The question then is – what can you do with it? I’d be very interested to see what you do – but here are a few ideas:
  • How about performing some sort of statistical cluster analysis on the deaths data? Does it identify the correct pump as the source?
  • What if the data were only provided in aggregated form? Lots of healthcare data is provided in that way today because of privacy concerns – but if the data had been provided aggregated to (for example) census output areas or a standard grid, would the right pump have been identified?
So – have fun, and please let me know what you’ve done with the data in the comments (particularly if you do any useful analyses or use it in teaching).

To leave a comment for the author, please follow the link and comment on their blog: Robin's BlogRobin's Blog » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)