As Google learned, predicting the spread of influenza, even with mountains of data, is notoriously difficult. Nonetheless, bioinformatician and R user Shirin Glander has created a two-part tutorial about predicting flu deaths with R (part 2 here). The analysis is based on just 136 cases of influenza A H7N9 in China in 2013 (data provided in the outbreaks package) so the intent was not to create a generally predictive model, but by providing all of the R code and graphics Shirin has created a useful example of real-word predictive modeling with R.
The tutorial covers loading and cleaning the data (including a nice example of using the mice package to impute missing values) and begins with some exploratory data visualizations. I was particularly impressed by the use of density charts (using the stat_density2d ggplot2 aesthetic) to highlight differences in the scatterplots of flu cases ending in death and recovery.
For the statistical analysis, Shirin applies several different kinds of predictive models, including:
- Decision trees (implemented using rpart and visualized using fancyRpartPlot from the rattle package)
- Random Forests (using caret's “rf” training method)
- Elastic-Net Regularized Generalized Linear Models (using caret's “glmnet” training method)
- K-nearest neighbors clustering (using caret's “kknn” training method)
- Penalized Discriminant Analysis (using caret's “pda” training method)
- and in Part 2, Extreme gradient boosting using the xgboost package and various preprocessing techniques from the caret package
Due to the limited data size, there's not too much difference between the models: in each case, 13-15 of the 23 cases were classified correctly. Nonetheless, the post provides a useful template for applying several different model types to the same data set, and using the power of the caret package to normalize the data and optimize the models.
(By the way, if you like the style of Shirin's blog, she's also created a useful guide to creating an R blog using Github, JekyllBootstrap, and RMarkdown.)
For Shirin's complete analyses of the flu data, including the R code, follow the links below.
Shirin's playgRound: Can we predict flu deaths with Machine Learning and R? (part 2) (via Thomas Dinsmore's ML/DL blog)