One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin Rivers‘ Github repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.
Obtaining the most recent data
Using the following little script, I grab the latest from the
ebola-drc Github repo:
# Fetch most recent DRC data. library(magrittr) library(curl) library(readr) library(dplyr) current_drc_data <- Sys.time() %>% format("%d%H%M%S%b%Y") %>% paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>% curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>% read_csv()
curl (the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (
source("fetch_drc_data.R")) should then load the current DRC dataset into the environment.4
We need to do a little data munging. First, we melt down the data frame using
melt function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of
cases/deaths. Melting the data frame down creates a variable type column (say,
confirmed_deaths and a value column (giving the value, e.g.
lubridate,5 the dates are parsed, and the values are stored in a numeric format.
library(magrittr) library(reshape2) library(lubridate) current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out")) current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date) current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date) current_drc_data$value <- as.numeric(current_drc_data$value)
Next, we drop
ruled_out cases, as they play no significant role for the current visualisation.
current_drc_data <- current_drc_data[current_drc_data$variable != "ruled_out",]
We also need to split the type labels into two different columns, so as to allow plotting them as a matrix. Currently, data type labels (the
variable column) has both the certainty status (
probable) and the type of indicator (
cases vs deaths) in a single variable, separated by an underscore. We’ll use
str_split_fixed to split the variable names by underscore, and join them into two separate columns,
mm, the latter denoting mortality/morbidity status.
current_drc_data %<>% cbind(., str_split_fixed(use_series(., variable), "_", 2)) %>% subset(select = -c(variable)) %>% set_colnames(c("event_date", "report_date", "health_zone", "value", "suspicion", "mm"))
Let’s filter out the health zones that are being observed but have no relevant data for us yet:
relevant_health_zones <- current_drc_data %>% subset(select = c("health_zone", "value")) %>% group_by(health_zone) %>% summarise(totals = sum(value, na.rm=TRUE)) %>% dplyr::filter(totals > 0) %>% use_series(health_zone)
This gives us a vector of all health zones that are currently reporting cases. We can filter our DRC data for that:
current_drc_data %<>% dplyr::filter(health_zone %in% relevant_health_zones)
This whittles down our table by a few rows. Finally, we might want to create a fake health zone that summarises all other health zones’ respective data:
totals <- current_drc_data %>% group_by(event_date, report_date, suspicion, mm) %>% summarise(value = sum(value), health_zone=as.factor("DRC total")) # Reorder totals to match the core dataset totals <- totals[,c(1,2,6,5,3,4)]
Finally, we bind these together to a single data frame:
current_drc_data %<>% rbind.data.frame(totals)
Of course, all this was in pursuance of cranking out a nice visualisation. For this, we need to do a couple of things, including first ensuring that “DRC total” is treated separately and comes last:
regions <- current_drc_data %>% use_series(health_zone) %>% unique() regions[!regions == "DRC total"] regions %<>% c("DRC total") current_drc_data$health_zone_f <- factor(current_drc_data$health_zone, levels = regions)
I normally start out by declaring the colour scheme I will be using. In general, I tend to use the same few colour schemes, which I keep in a few gists. For simple plots, I prefer to use no more than five colours:
colour_scheme <- c(white = rgb(238, 238, 238, maxColorValue = 255), light_primary = rgb(236, 231, 216, maxColorValue = 255), dark_primary = rgb(127, 112, 114, maxColorValue = 255), accent_red = rgb(240, 97, 114, maxColorValue = 255), accent_blue = rgb(69, 82, 98, maxColorValue = 255))
With that sorted, I can invoke the
ggplot method, storing the plot in an object,
p. This is so as to facilitate later retrieval by the
p <- ggplot(current_drc_data, aes(x=event_date, y=value)) + # Title and subtitle ggtitle(paste("Daily EBOV status", "DRC", Sys.Date(), sep=" - ")) + labs(subtitle = "(c) Chris von Csefalvay/CBRD (cbrd.co) - @chrisvcsefalvay") + # This facets the plot based on the factor vector we created ear facet_grid(health_zone_f ~ suspicion) + geom_path(aes(group = mm, colour = mm, alpha = mm), na.rm = TRUE) + geom_point(aes(colour = mm, alpha = mm)) + # Axis labels ylab("Cases") + xlab("Date") + # The x-axis is between the first notified case and the last xlim(c("2018-05-08", Sys.Date())) + scale_x_date(date_breaks = "7 days", date_labels = "%m/%d") + # Because often there's an overlap and cases that die on the day of registration # tend to count here as well, some opacity is useful. scale_alpha_manual(values = c("cases" = 0.5, "deaths" = 0.8)) + scale_colour_manual(values = c("cases" = colour_scheme[["accent_blue"]], "deaths" = colour_scheme[["accent_red"]])) + # Ordinarily, I have these derive from a theme package, but they're very good # defaults and starting poinnnnnntsssssts theme(panel.spacing.y = unit(0.6, "lines"), panel.spacing.x = unit(1, "lines"), plot.title = element_text(colour = colour_scheme[["accent_blue"]]), plot.subtitle = element_text(colour = colour_scheme[["accent_blue"]]), axis.line = element_line(colour = colour_scheme[["dark_primary"]]), panel.background = element_rect(fill = colour_scheme[["white"]]), panel.grid.major = element_line(colour = colour_scheme[["light_primary"]]), panel.grid.minor = element_line(colour = colour_scheme[["light_primary"]]), strip.background = element_rect(fill = colour_scheme[["accent_blue"]]), strip.text = element_text(colour = colour_scheme[["light_primary"]]) )
The end result is a fairly appealing plot, although if the epidemic goes on, one might want to consider getting rid of the point markers. All that remains is to insert an automatic call to the
ggsave function to save the image:
Sys.time() %>% format("%d%H%M%S%b%Y") %>% paste("DRC-EBOV-", ., ".png", sep="") %>% ggsave(plot = p, device="png", path="visualisations/drc/", width = 8, height = 6)
Of course, being a lazy epidemiologist, this is the kind of stuff that just has to be automated! Since I run my entire RStudio instance on a remote machine, it would make perfect sense to regularly run this script.
cronR package comes with a nice widget, which will allow you to simply schedule any task. Old-school command line users can, of course, always resort to ye olde command line based scheduling and execution. One important caveat: the context of cron execution will not necessarily be the same as of your R project or indeed of the R user. Therefore, when you source a file or refer to paths, you may want to refer to fully qualified paths, i.e.
/home/my_user/my_R_stuff/script.R rather than merely
cronR is very good at logging when things go awry, so if the plots do not start to magically accumulate at the requisite rate, do give the log a check.
The next step is, of course, to automate uploads to Twitter. But that’s for another day.
References [ + ]
|1.||↑||Disclaimer: Dr Rivers is a friend, former collaborator and someone I hold in very high esteem. I’m also from time to time trying to contribute to these repos.|
|2.||↑||My convention for timestamps is the military DTG convention of
|3.||↑||It is, technically, bad form to put the path in the file name for the
|4.||↑||If for whatever reason you would prefer to keep only one copy of the data around and overwrite it every time, quite simply provide a fixed filename into
|5.||↑||As an informal convention of mine, when using the simple conversion functions of