An R View into Epidemiology
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you have been tracking the numbers for the COVID-19 pandemic, you must have looked at dozens of models and tried to make some comparisons. Even under the best of situations it is difficult to compare models, and this is especially true if you don’t have sufficient domain knowledge. Experts tend to leave out assumptions and background material that they know other experts will take for granted. This leaves newcomers pretty much on their own.
It has been my experience that a good way for an R literate person to begin to acquire knowledge in a new field is to find some appropriate packages, study the vignettes, work through the examples, and read whatever source material they may reference. So, this post shows how one might go about finding those appropriate packages. Also, I thought it would be interesting to see what kind of special resources are available to epidemiologists working in R beyond the basic statistical infrastructure and packages for data manipulation and visualization.
Because there is no epidemiology task view, a good place to start is to search CRAN directly. (Note that there are task views on differential equations, spatial statistics, time series and other tools used by epidemiologists, so I confined my search to the basics.)
The two main packages I used to search were: pkgsearch
which searches CRAN package data and dlstats
which retrieves package download information from the RStudio mirror. The pkg_search()
function takes a query string as input and returns information about packages that match the query along with some basic information, a score, and the number of downloads in the previous month. Finding a useful string that returns a reasonable list of packages requires developing some hunting skills gained through iterating over plausible strings.
Epi <- pkg_search(query="epidemiology epidemic",size=200)
On the day I did the search the the above query returned a list of 98 packages.
Then, the parametersscore
, a measure of accuracy, and downloads_last_month
, a proxy for quality, can help filter down to a short list of packages to examine.
as_tibble(Epi) %>% filter(score >= 10, downloads_last_month > 830) %>% select(package,title,downloads_last_month) %>% arrange(-downloads_last_month) -> df print(df, n = nrow(df)) # A tibble: 23 x 3 package title downloads_last_mo… <chr> <chr> <int> 1 epitools "Epidemiology Tools" 8480 2 Epi "A Package for Statistical Analysis in Epide… 8212 3 epiR "Tools for the Analysis of Epidemiological D… 5888 4 EpiEstim "Estimate Time Varying Reproduction Numbers … 5477 5 epiDisplay "Epidemiological Data Display Package" 4707 6 table1 "Tables of Descriptive Statistics in HTML" 3640 7 EpiModel "Mathematical Modeling of Infectious Disease… 3502 8 haplo.stats "Statistical Analysis of Haplotypes with Tra… 2723 9 SpatialEpi "Methods and Data for Spatial Epidemiology" 2502 10 R0 "Estimation of R0 and Real-Time Reproduction… 2468 11 popEpi "Functions for Epidemiological Analysis usin… 2081 12 epitrix "Small Helpers and Tricks for Epidemics Anal… 1993 13 surveillance "Temporal and Spatio-Temporal Modeling and M… 1986 14 EpiContactT… "Epidemiological Tool for Contact Tracing" 1181 15 EpiCurve "Plot an Epidemic Curve" 1173 16 epibasix "Elementary Epidemiological Functions for Ep… 1126 17 epicontacts "Handling, Visualisation and Analysis of Epi… 1030 18 pubh "A Toolbox for Public Health and Epidemiolog… 997 19 powerSurvEpi "Power and Sample Size Calculation for Survi… 963 20 mem "The Moving Epidemic Method" 867 21 epimdr "Functions and Data for \"Epidemics: Models … 867 22 DSAIDE "Dynamical Systems Approach to Infectious Di… 861 23 episensr "Basic Sensitivity Analysis of Epidemiologic… 855
Most of the packages in the short list turned out to be “professional” packages in the sense that they provide essential functions but are rather light on documentation. These are targeted towards working professionals. So, while most of the packages found are for the experts, my search did turn up a few for self study. The package epimdr
, for example, is associated with Bjornstad’s book Epidemics: Models and Data in R as well as the Coursera course Epidemics - the Dymanics of Infectious Diseases. And, the vignette for the epiR
package references the free CDC online course Principles of Epidemiology in Public Health Practice.
Six of the packages on the short list: DSAIDE
, epicontacts
,EpiEstim
, EpiModel
, epitrix
, andsurveillance
have all either been developed or authorized by the R Epidemics Consortium (RECON), an international not-profit organization with a mission to “create the next generation of analytics tools for informing the response to disease outbreaks, health emergencies and humanitarian crises, using the R software and other free, open-source resources”. This group not only develops software and builds models, but members go onsite to help fight disease outbreaks. These packages are mostly very well documented and useful to experts and students alike.
The DSAIDE
package provides a tutorial on infectious diseases.
epicontacts
provides a collection of tools for representing epidemiological contact data.
EpiEstim
is targeted towards estimating time varying reproduction numbers from epidemic curves.
The EpiModel
package, which is documented with a JSS paper and it’s own tutorial website, provides a number of advanced epidemiological models including deterministic compartmental models, stochastic individual contact models and network models which go beyond the simple assumption of random contact among all members in a compartment. EpiModel
was featured in Tim Churches March Post.
The epitrix
package contains a number of utility functions for infectious disease modeling including a function to anonymize data.
The surveillance
package which supports spatio-temporal analysis is well documented with seven vignettes.
Finally, taking a look at package download history indicates which packages continue to be useful over time, and in this case, provides some idea of the demand for infectious disease modeling. Here is the download history of the top five packages on the short list.
# Get download history for top 5 top_5 <- df %>% slice(1:5) dl_stats <- cran_stats(top_5$package) p <- ggplot(dl_stats, aes(end,downloads, colour=package)) + geom_line() + xlab("Month") + ggtitle("Download History of Top 5 Epidemiology Packages") fig <- ggplotly(p) fig
I’ll close with this: if you are R literate, you can be pretty confident that you will be able to find tutorials, models and reference implementations to help you learn something about any field that benefits from statistical analysis. If you are an expert in the field, there will be something for you too.
When looking for R packages for a particular application, first look to see if there is a task view. If not, R provides some pretty good tools to help you search.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.