Automagic epi curve plotting: part I
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the best resources during the 2013-16 West African Ebola outbreak was Caitlin Rivers‘ Github repo, which was probably one of the best ways to stay up to date on the numbers. For the current outbreak, she has also set up a Github repo, with really frequent updates straight from the WHO’s DON data and the information from DRC Ministry of Public Health (MdlS) mailing list.1 Using R, I have set up a simple script that I only have to run every time I want a pre-defined visualisation of the current situation. I am usually doing this on a remote RStudio server, which makes matters quite easy for me to quickly generate data on the fly from RStudio.
Obtaining the most recent data
Using the following little script, I grab the latest from the ebola-drc
Github repo:
# Fetch most recent DRC data. library(magrittr) library(curl) library(readr) library(dplyr) current_drc_data <- Sys.time() %>% format("%d%H%M%S%b%Y") %>% paste("raw_data/drc/", "drc-", ., ".csv", sep = "") %T>% curl_fetch_disk("https://raw.githubusercontent.com/cmrivers/ebola_drc/master/drc/data.csv", .) %>% read_csv()
This uses curl
(the R implementation) to fetch the most recent data and save it as a timestamped2 file in the data folder I set up just for that purpose.3 Simply sourcing this script (source("fetch_drc_data.R")
) should then load the current DRC dataset into the environment.4
Data munging
We need to do a little data munging. First, we melt down the data frame using reshape2
‘s melt
function. Melting takes ‘wide’ data and converumnts it into ‘long’ data – for example, in our case, the original data had a row for each daily report for each health zone, and a column for the various combinations of confirmed/probable/suspected
over cases/deaths
. Melting the data frame down creates a variable type column (say, confirmed_deaths
and a value column (giving the value, e.g. 3
). Using lubridate
,5 the dates are parsed, and the values are stored in a numeric format.
library(magrittr) library(reshape2) library(lubridate) current_drc_data %<>% melt(value_name = "value", measure.vars = c("confirmed_cases", "confirmed_deaths", "probable_cases", "probable_deaths", "suspect_cases", "suspect_deaths", "ruled_out")) current_drc_data$event_date <- lubridate::ymd(current_drc_data$event_date) current_drc_data$report_date <- lubridate::ymd(current_drc_data$report_date) current_drc_data$value <- as.numeric(current_drc_data$value)
Next, we drop ruled_out
cases, as they play no significant role for the current visualisation.
current_drc_data <- current_drc_data[current_drc_data$variable != "ruled_out",]
We also need to split the type labels into two different columns, so as to allow plotting them as a matrix. Currently, data type labels (the variable
column) has both the certainty status (confirmed
, suspected
or probable
) and the type of indicator (cases
vs deaths) in a single variable, separated by an underscore. We’ll use stringr
‘s str_split_fixed
to split the variable names by underscore, and join them into two separate columns, suspicion
and mm
, the latter denoting mortality/morbidity status.
current_drc_data %<>% cbind(., str_split_fixed(use_series(., variable), "_", 2)) %>% subset(select = -c(variable)) %>% set_colnames(c("event_date", "report_date", "health_zone", "value", "suspicion", "mm"))
Let’s filter out the health zones that are being observed but have no relevant data for us yet:
relevant_health_zones <- current_drc_data %>% subset(select = c("health_zone", "value")) %>% group_by(health_zone) %>% summarise(totals = sum(value, na.rm=TRUE)) %>% dplyr::filter(totals > 0) %>% use_series(health_zone)
This gives us a vector of all health zones that are currently reporting cases. We can filter our DRC data for that:
current_drc_data %<>% dplyr::filter(health_zone %in% relevant_health_zones)
This whittles down our table by a few rows. Finally, we might want to create a fake health zone that summarises all other health zones’ respective data:
totals <- current_drc_data %>% group_by(event_date, report_date, suspicion, mm) %>% summarise(value = sum(value), health_zone=as.factor("DRC total")) # Reorder totals to match the core dataset totals <- totals[,c(1,2,6,5,3,4)]
Finally, we bind these together to a single data frame:
current_drc_data %<>% rbind.data.frame(totals)
Visualising it!
Of course, all this was in pursuance of cranking out a nice visualisation. For this, we need to do a couple of things, including first ensuring that “DRC total” is treated separately and comes last:
regions <- current_drc_data %>% use_series(health_zone) %>% unique() regions[!regions == "DRC total"] regions %<>% c("DRC total") current_drc_data$health_zone_f <- factor(current_drc_data$health_zone, levels = regions)
I normally start out by declaring the colour scheme I will be using. In general, I tend to use the same few colour schemes, which I keep in a few gists. For simple plots, I prefer to use no more than five colours:
colour_scheme <- c(white = rgb(238, 238, 238, maxColorValue = 255), light_primary = rgb(236, 231, 216, maxColorValue = 255), dark_primary = rgb(127, 112, 114, maxColorValue = 255), accent_red = rgb(240, 97, 114, maxColorValue = 255), accent_blue = rgb(69, 82, 98, maxColorValue = 255))
With that sorted, I can invoke the ggplot
method, storing the plot in an object, p
. This is so as to facilitate later retrieval by the ggsave
method.
p <- ggplot(current_drc_data, aes(x=event_date, y=value)) + # Title and subtitle ggtitle(paste("Daily EBOV status", "DRC", Sys.Date(), sep=" - ")) + labs(subtitle = "(c) Chris von Csefalvay/CBRD (cbrd.co) - @chrisvcsefalvay") + # This facets the plot based on the factor vector we created ear facet_grid(health_zone_f ~ suspicion) + geom_path(aes(group = mm, colour = mm, alpha = mm), na.rm = TRUE) + geom_point(aes(colour = mm, alpha = mm)) + # Axis labels ylab("Cases") + xlab("Date") + # The x-axis is between the first notified case and the last xlim(c("2018-05-08", Sys.Date())) + scale_x_date(date_breaks = "7 days", date_labels = "%m/%d") + # Because often there's an overlap and cases that die on the day of registration # tend to count here as well, some opacity is useful. scale_alpha_manual(values = c("cases" = 0.5, "deaths" = 0.8)) + scale_colour_manual(values = c("cases" = colour_scheme[["accent_blue"]], "deaths" = colour_scheme[["accent_red"]])) + # Ordinarily, I have these derive from a theme package, but they're very good # defaults and starting poinnnnnntsssssts theme(panel.spacing.y = unit(0.6, "lines"), panel.spacing.x = unit(1, "lines"), plot.title = element_text(colour = colour_scheme[["accent_blue"]]), plot.subtitle = element_text(colour = colour_scheme[["accent_blue"]]), axis.line = element_line(colour = colour_scheme[["dark_primary"]]), panel.background = element_rect(fill = colour_scheme[["white"]]), panel.grid.major = element_line(colour = colour_scheme[["light_primary"]]), panel.grid.minor = element_line(colour = colour_scheme[["light_primary"]]), strip.background = element_rect(fill = colour_scheme[["accent_blue"]]), strip.text = element_text(colour = colour_scheme[["light_primary"]]) )
The end result is a fairly appealing plot, although if the epidemic goes on, one might want to consider getting rid of the point markers. All that remains is to insert an automatic call to the ggsave
function to save the image:
Sys.time() %>% format("%d%H%M%S%b%Y") %>% paste("DRC-EBOV-", ., ".png", sep="") %>% ggsave(plot = p, device="png", path="visualisations/drc/", width = 8, height = 6)
Automation
Of course, being a lazy epidemiologist, this is the kind of stuff that just has to be automated! Since I run my entire RStudio instance on a remote machine, it would make perfect sense to regularly run this script. cronR
package comes with a nice widget, which will allow you to simply schedule any task. Old-school command line users can, of course, always resort to ye olde command line based scheduling and execution. One important caveat: the context of cron execution will not necessarily be the same as of your R project or indeed of the R user. Therefore, when you source a file or refer to paths, you may want to refer to fully qualified paths, i.e. /home/my_user/my_R_stuff/script.R
rather than merely script.R
. cronR
is very good at logging when things go awry, so if the plots do not start to magically accumulate at the requisite rate, do give the log a check.
The next step is, of course, to automate uploads to Twitter. But that’s for another day.
References
1. | ↑ | Disclaimer: Dr Rivers is a friend, former collaborator and someone I hold in very high esteem. I’m also from time to time trying to contribute to these repos. |
2. | ↑ | My convention for timestamps is the military DTG convention of DDHHMMSSMONYEAR , so e.g. 7.15AM on 21 May 2018 would be 21071500MAY2018 . |
3. | ↑ | It is, technically, bad form to put the path in the file name for the curl::curl_fetch_disk() function, given that curl::curl_fetch_disk() offers the path parameter just for that. However, due to the intricacies of piping data forward, this is arguably the best way to do it so as to allow the filename to be reused for the read_csv() function, too. |
4. | ↑ | If for whatever reason you would prefer to keep only one copy of the data around and overwrite it every time, quite simply provide a fixed filename into curl_fetch_disk() . |
5. | ↑ | As an informal convention of mine, when using the simple conversion functions of lubridate such as ymd , I tend to note the source package, hence lubridate::ymd over simply ymd . |
The post Automagic epi curve plotting: part I appeared first on Doodling in Data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.