Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply, from our latest package, replyr.

Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

To keep this post short, we will skip over the preliminary data processing and the modeling; if you are interested, the code for the full example is available here. We will fit a logistic regression model (GLM), a generalized additive model (GAM), and a random forest model (ranger implementation) to a training set, and evaluate the models’ performance on a hold-out set.

# load the file of model fitting and prediction functions
source("modelfitting.R")
algolist = list(glm=glm_predictor,
gam=gam_predictor,
rangerRF=ranger_predictor)

# define outcome column and variables
outcome = "defaults"
varlist =  ...

# Fit models for each algorithm and gather together the
# predictions each model makes on a test set.
predictors = fit_models(algolist, outcome, varlist, train)
predframe = make_predictions(predictors, test, outcome)

##     column     class nunique
## 2 defaults   logical       2
## 3    model character       3
## 1     pred   numeric   17973
## # A tibble: 3 × 2
##      model     n
##      <chr> <dbl>
## 1      gam  5997
## 2      glm  5997
## 3 rangerRF  5997


The results of the evaluation are in a single data frame predframe, with columns defaults (the true outcome: whether or not this customer defaulted on their loan in the next month); pred (the predicted probability of default); and model (the model that made the prediction).

To evaluate each model’s performance, we write a function metric_row that takes a frame of predictions and true outcomes, and returns a data frame of all the performance metrics (deviance explained, accuracy, precision, and recall; the implementations for each metric are not shown here). This is the function we wish to apply to each group of data.

metric_row = function(subframe,
yvar,
pred,
label) {
confmat = cmat(subframe[[yvar]], subframe[[pred]])
devExplained = sigr::formatChiSqTest(subframe, pred, yvar)$pseudoR2 tframe = data.frame(devExplained=devExplained, accuracy=accuracy(confmat), precision=precision(confmat), recall=recall(confmat)) tframe$model = subframe[[label]][1]
# assuming there is only one label
tframe
}

# example outcome of metric_row, for the glm model
metric_row(subset(predframe, model=="glm"), outcome, "pred", "glm")

##   devExplained  accuracy precision    recall
## 1    0.1125283 0.8094047 0.7238979 0.2335329


In this case our data processing returns a one-row data frame but you could return a multirow frame. For example, if the data we process includes predictions for both the training and test sets, we could return a data frame with one row each for test and training performance.

We would like to use split-apply-combine on all the data, to return a frame of performance metrics for all the models that we evaluated. We can do that explicitly, of course (additionally sorted by deviance explained, descending):

#
# Compute performance metrics for all the model types
# Order by deviance explained
#

split(predframe, predframe\$model) %>%
lapply(function(fi) {metric_row(fi, outcome,
'pred',
'model')}) %>%
dplyr::bind_rows() %>%
dplyr::arrange(desc(devExplained))

replyr::gapply provides a convenient function to wrap most of the above pipe.

#
# Compute performance metrics for all the model types
# Order by deviance explained
#
function(fi) metric_row(fi,outcome,
'pred',
'model'),
partitionMethod = 'split') %>%
dplyr::arrange(desc(devExplained))

##   devExplained  accuracy precision    recall    model
## 1    0.1810591 0.8174087 0.6704385 0.3547904      gam
## 2    0.1767817 0.8180757 0.6680384 0.3645210 rangerRF
## 3    0.1125283 0.8094047 0.7238979 0.2335329      glm


The partitionMethod = 'split' argument tells gapply to split the data using base::split, rather than partitioning the data using dplyr::group_by before applying the user-supplied function. dplyr::group_by is the default partitioning method, but isn’t suitable for the function (metric_row) that I want to apply.

Conclusion

replyr::gapply implements the split-order-apply pattern in a convenient wrapper function. It supports dplyr grouped operations and explicit data partitioning (as in base::split), and can be used on any dplyr-supported back-end. The replyr package is on CRAN; the most recent development version is available on Github.