[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.

I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.

Ebola cases and deaths by country and by date, from Wikipedia

The Wikipedia page includes a data table, which is a starting point. It’s not especially well-designed (click image at right to see the headers and a few rows) and the notes underneath suggest that a large amount of manual intervention was required to obtain the numbers.

The last column contains hyperlinked references. Now we see why so much manual work was required. The citations link out to two main types of information:

1. Paragraphs of free text with numbers, somewhere in amongst it, like this example
2. Infographic-style reports in PDF format, like this example

That’s more wrangling than I have time for just now; OK, so the Wikipedia table it is. Still a little more “wrangling” to get the data out of that HTML table.

library(XML)
library(ggplot2)
library(reshape2)

# get all tables on the page
stringsAsFactors = FALSE)
# thankfully our table has a name; it is table #5
# this is not something you can really automate
# [1] "Ebola virus epidemic in West Africa"
# [2] "Nigeria Ebola areas-2014"
# [3] "Treatment facilities in West Africa"
# [4] "Democratic Republic of Congo-2014"
# [5] "Ebola cases and deaths by country and by date"
# [6] "NULL"

ebola <- ebola$Ebola cases and deaths by country and by date # again, manual examination reveals that we want rows 2-71 and columns 1-3 ebola.new <- ebola[2:71, 1:3] colnames(ebola.new) <- c("date", "cases", "deaths") # need to fix up a couple of cases that contain text other than the numbers ebola.new$cases[27]  <- "759"
ebola.new$deaths[27] <- "467" # get rid of the commas; convert to numeric ebola.new$cases  <- gsub(",", "", ebola.new$cases) ebola.new$cases  <- as.numeric(ebola.new$cases) ebola.new$deaths <- gsub(",", "", ebola.new$deaths) ebola.new$deaths <- as.numeric(ebola.new\$deaths)

# the days in the dates are encoded 1-31
# are we there yet? quick and dirty attempt to reproduce Wikipedia plot
ebola.m <- melt(ebola.new)
ggplot(ebola.m, aes(as.Date(date, "%e %b %Y"), value)) +
geom_point(aes(color = variable)) +
coord_trans(y = "log10") + xlab("Date") +
labs(title = "Cumulative totals log scale") +
theme_bw()


Result: on the right, click for full-size.

We can complain: if only the WHO, CDC and other organisations provided data as a web service. Or even as files in CSV format. Anything but PDF. But right now at least, they do not. So hats off to the heroic efforts of the Wikipedian so-called “data janitors“. From that article:

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone

Surprising? Not to scientists (who don’t qualify the profession with the redundant word “data”). “Key hurdle to insights”, says the article title? Not really – just part and parcel of the job. I’d even argue that effective wrangling is where most of the skills are required. So perhaps, think twice before belittling peoples extensive skill sets with terms like “janitor”. You might need them to wrangle your data some day.

Filed under: R, statistics, web resources Tagged: data science, ebola, wikipedia