intubate <||> R stat functions in data science pipelines

intubate <||>, XBRL, bioPN, sbioPN, and stats with R

5 years ago

[This article was first published on intubate <||>, XBRL, bioPN, sbioPN, and stats with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The aim of intubate (logo <||>) is to offer a painless way to add R functions that are non-pipe-aware to data science pipelines implemented by magrittr with the operator %>%, without having to rely on workarounds of varying complexity.

Installation

the latest released version from CRAN (1.0.0) with

install.packages("intubate")

the latest development version from github (1.4.0) with

# install.packages("devtools")
devtools::install_github("rbertolusso/intubate")

Pipelines

dplyr, by Hadley Wickham, Romain Francois, and RStudio, is used here to illustrate data transformations.

Suppose you have the following code:

library(dplyr)

tmp <- filter(LifeCycleSavings, dpi >= 1000)
to_fit <- select(tmp, sr, pop15, pop75)

and you would like to avoid creating the temporary object tmp. One approach could be the following:

to_fit <- select(filter(LifeCycleSavings, dpi >= 1000), sr, pop15, pop75)

The problem with this approach is that, as the number of intermediate steps increase, it is error prone and becomes more complicated to understand.

Pipes in R are made possible by the package magrittr, by Stefan Milton Bache and Hadley Wickham. They provide an elegant alternative:

library(magrittr)

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) ->
to_fit

Pipelines seem to be a popular way, these days, of doing data science in R. If you need an introduction about pipelines, please follow this link (http://r4ds.had.co.nz/transform.html) to the chapter on data transformation of the forthcoming book “R for Data Science” by Garrett Grolemund and Hadley Wickham.

R statistical functions and pipelines

Suppose you want to perform a regression analysis of sr on pop15 and pop75 (assuming for the sake of argument that it is a valid analysis to perform).

As most R functions are not pipeline-aware, you should do something like the following:

fitted <- lm(sr ~ ., to_fit)
summary(fitted)

## 
## Call:
## lm(formula = sr ~ ., data = to_fit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5438 -2.1996  0.4071  2.2060  5.4754 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.5981     9.6146   4.015 0.000898 ***
## pop15        -0.6574     0.2481  -2.650 0.016843 *  
## pop75        -2.7315     1.2458  -2.193 0.042536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.558 on 17 degrees of freedom
## Multiple R-squared:  0.3213,	Adjusted R-squared:  0.2415 
## F-statistic: 4.024 on 2 and 17 DF,  p-value: 0.03709

This is an absolutely correct approach.

But what if, in addition to the data transformation, you would also like to perform your data modeling/analysis under the same pipeline paradigm (by adding lm to it), which would impart notation consistency and would avoid the need of creating the temporary object to_fit?

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>% 
  lm(sr ~ .) %>%               ## Adding lm to the pipeline
  summary()

## Error in as.data.frame.default(data): cannot coerce class ""formula"" to a data.frame

You get an error.

The reason of this failure is that pipeline-aware functions (such as the ones in dplyr that were specifically designed to work in pipelines) receive the data as the first parameter, and most statistical procedures (or graphical functions such as the ones in package lattice) that work with formulas to specify the model, such as lm and lots of other rock solid reliable functions that implement well established statistical procedures, receive the data as the second parameter.

There are alternatives that allow to include lm (and others) in the pipeline without errors and without intubate. They require workarounds of varying levels of complexity. Some of the possible approaches are illustrated in the post Workarounds to include R stat functions in data science pipelines.

If you choose intubate is because you do not want to bother about workarounds when working with pipelines that include statistical procedures, or other non-pipe-aware functions.

By the way, intubate also implements three extensions for pipelines called intubOrders, intuEnv, and intuBags. These extensions will be treated in following posts.

intubate

The original aim of intubate is to offer a painless way to add R functions that are non-pipe-aware to data science pipelines implemented by ‘magrittr’ with the operator %>%, without having to rely on workarounds of varying complexity.

## install.packages("intubate")
library(intubate)

To this end, intubate provides interfaces (such as ntbt_lm) that let you do:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt_lm(sr ~ pop15 + pop75) %>% 
  summary()

## 
## Call:
## lm(formula = sr ~ pop15 + pop75)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5438 -2.1996  0.4071  2.2060  5.4754 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.5981     9.6146   4.015 0.000898 ***
## pop15        -0.6574     0.2481  -2.650 0.016843 *  
## pop75        -2.7315     1.2458  -2.193 0.042536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.558 on 17 degrees of freedom
## Multiple R-squared:  0.3213,	Adjusted R-squared:  0.2415 
## F-statistic: 4.024 on 2 and 17 DF,  p-value: 0.03709

without error.

With intubate, you can push boundaries and do things like:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt_lm(sr ~ pop15 + pop75) %>% 
  ntbt_plot(which = 1) %>%        ## Adding a residual plot
  summary()

## 
## Call:
## lm(formula = sr ~ pop15 + pop75)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5438 -2.1996  0.4071  2.2060  5.4754 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.5981     9.6146   4.015 0.000898 ***
## pop15        -0.6574     0.2481  -2.650 0.016843 *  
## pop75        -2.7315     1.2458  -2.193 0.042536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.558 on 17 degrees of freedom
## Multiple R-squared:  0.3213,	Adjusted R-squared:  0.2415 
## F-statistic: 4.024 on 2 and 17 DF,  p-value: 0.03709

(as plot returns NULL, intubate automatically forwards its input so summary receives the result of lm).

Currently, intubate provides more than 450 interfaces to 88 data science related packages in CRAN or in R installation (list of packages in Appendix).

Moreover, the user can:
- call non-pipe-aware functions “on the fly” without needing to create an interface;
- create interfaces “on demand”.

Calling non-pipe-aware functions “on the fly”

If you do not want to use interfaces, you can use the function ntbt to call the non-pipe-aware functions directly “on the fly”:

LifeCycleSavings %>% 
  ntbt(lm, sr ~ pop15 + pop75)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75)
## 
## Coefficients:
## (Intercept)        pop15        pop75  
##     30.6277      -0.4708      -1.9341

Note: this approach works with any function, including the ones lacking interfaces.

For example, lsfit does not currently have an interface provided by intubate, but you still can call it “on the fly” with ntbt:

LifeCycleSavings %>%
  ntbt_plot(pop75, sr) %>%
  ntbt(lsfit, pop75, sr) %>%    # Calling lsfit "on the fly" with ntbt
  abline()

Creating interfaces “on demand”

If you like to use interfaces, and intubate does not provide one, you can create your own “on demand”. For example, to create an interface to lsfit, all that is needed is the following line:

ntbt_lsfit <- intubate

The only thing you need to remember is that the name of an interface must start with ntbt_ followed by the name of the interfaced function (lsfit in this particular case), no matter which function you want to interface.

You can now use the newly created interface as any other provided by intubate:

LifeCycleSavings %>%
  ntbt_plot(pop75, sr) %>%
  ntbt_lsfit(pop75, sr) %>%    # Using just created "on demand" interface
  abline()

Just in case, let’s clarify that the intubate machinery does not perform any statistical computation. The interfaced functions (those that are already well tested) are the ones performing the computations.

Non-formula variants:

Some functions offer non-formula variants (or both variants). For example, including cor.test in a pipeline in any of its variants produces an error:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  cor.test(pop15, pop75)   ## Non-formula variant

## Error in match.arg(alternative): object 'pop75' not found

or:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  cor.test(~ pop15 + pop75)   ## Formula variant

## Error in cor.test.default(., ~pop15 + pop75): 'x' and 'y' must have the same length

Both variants work when using any of the approaches provided by intubate:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt_cor.test(pop15, pop75)   ## Non-formula variant

## 
## 	Pearson's product-moment correlation
## 
## data:  pop15 and pop75
## t = -2.4193, df = 18, p-value = 0.02636
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.76924958 -0.06766132
## sample estimates:
##        cor 
## -0.4953505

or:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt(cor.test, ~ pop15 + pop75)   ## Formula variant

## 
## 	Pearson's product-moment correlation
## 
## data:  pop15 and pop75
## t = -2.4193, df = 18, p-value = 0.02636
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.76924958 -0.06766132
## sample estimates:
##        cor 
## -0.4953505

Appendix

Packages containing interfaces

The 88 R packages that have interfaces implemented so far are:

adabag: Multiclass AdaBoost.M1, SAMME and Bagging
AER: Applied Econometrics with R
aod: Analysis of Overdispersed Data
ape: Analyses of Phylogenetics and Evolution
arm: Data Analysis Using Regression and Multilevel/Hierarchical Models
betareg: Beta Regression
brglm: Bias reduction in binomial-response generalized linear models
caper: Comparative Analyses of Phylogenetics and Evolution in R
car: Companion to Applied Regression
caret: Classification and Regression Training
coin: Conditional Inference Procedures in a Permutation Test Framework
CORElearn: Classification, Regression and Feature Evaluation
drc: Analysis of Dose-Response Curves
e1071: Support Vector Machines
earth: Multivariate Adaptive Regression Splines
EnvStats: Environmental Statistics, Including US EPA Guidance
fGarch: Rmetrics – Autoregressive Conditional Heteroskedastic Modelling
flexmix: Flexible Mixture Modeling
forecast: Forecasting Functions for Time Series and Linear Models
frontier: Stochastic Frontier Analysis
gam: Generalized Additive Models
gbm: Generalized Boosted Regression Models
gee: Generalized Estimation Equation Solver
glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models
glmx: Generalized Linear Models Extended
gmnl: Multinomial Logit Models with Random Parameters
gplots: Various R Programming Tools for Plotting Data
gss: General Smoothing Splines
graphics: The R Graphics Package
hdm: High-Dimensional Metrics
Hmisc: Harrell Miscellaneous
ipred: Improved Predictors
iRegression: Regression Methods for Interval-Valued Variables
ivfixed: Instrumental fixed effect panel data model
kernlab: Kernel-Based Machine Learning Lab
kknn: Weighted k-Nearest Neighbors
klaR: Classification and Visualization
lars: Least Angle Regression, Lasso and Forward Stagewise
lattice: Trellis Graphics for R
latticeExtra: Extra Graphical Utilities Based on Lattice
leaps: Regression Subset Selection
lfe: Linear Group Fixed Effects
lme4: Linear Mixed-Effects Models using ‘Eigen’ and S4
lmtest: Testing Linear Regression Models
MASS: Robust Regression, Linear Discriminant Analysis, Ridge Regression, Probit Regression, …
MCMCglmm: MCMC Generalised Linear Mixed Models
mda: Mixture and Flexible Discriminant Analysis
metafor: Meta-Analysis Package for R
mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation
minpack.lm: R Interface to the Levenberg-Marquardt Nonlinear Least-Squares Algorithm Found in MINPACK, Plus Support for Bounds
mhurdle: Multiple Hurdle Tobit Models
mlogit: Multinomial logit model
mnlogit: Multinomial Logit Model
modeltools: Tools and Classes for Statistical Models
nlme: Linear and Nonlinear Mixed Effects Models
nlreg: Higher Order Inference for Nonlinear Heteroscedastic Models
nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models
ordinal: Regression Models for Ordinal Data
party: A Laboratory for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning
plotrix: Various Plotting Functions
pls: Partial Least Squares and Principal Component Regression
pROC: Display and Analyze ROC Curves
pscl: Political Science Computational Laboratory, Stanford University
psychomix: Psychometric Mixture Models
psychotools: Infrastructure for Psychometric Modeling
psychotree: Recursive Partitioning Based on Psychometric Models
quantreg: Quantile Regression
randomForest: Random Forests for Classification and Regression
Rchoice: Discrete Choice (Binary, Poisson and Ordered) Models with Random Parameters
rminer: Data Mining Classification and Regression Methods
rms: Regression Modeling Strategies
robustbase: Basic Robust Statistics
rpart: Recursive Partitioning and Regression Trees
RRF: Regularized Random Forest
RWeka: R/Weka Interface
sampleSelection: Sample Selection Models
sem: Structural Equation Models
spBayes: Univariate and Multivariate Spatial-temporal Modeling
stats: The R Stats Package (glm, lm, loess, lqs, nls, …)
strucchange: Testing, Monitoring, and Dating Structural Changes
survey: Analysis of Complex Survey Samples
survival: Survival Analysis
SwarmSVM: Ensemble Learning Algorithms Based on Support Vector Machines
systemfit: Estimating Systems of Simultaneous Equations
tree: Classification and Regression Trees
vcd: Visualizing Categorical Data
vegan: Community Ecology Package

Bugs and Feature requests

The robustness and generality of the interfacing machinery still needs to be further verified (and very likely improved), as there are thousands of potential functions to interface and certainly some are bound to fail when interfaced. Some have already been addressed when implementing provided interfaces (as their examples failed).

The goal is to make intubate each time more robust by addressing the peculiarities of newly discovered failing functions.

For the time being, only cases where the interfaces provided with intubate fail will be considered as bugs.

Cases of failing user defined interfaces or when using ntbt to call functions directly that do not have interfaces provided with released versions of intubate, will be considered feature requests.

Of course, it will be greatly appreciated, if you have some coding skills and can follow the code of the interface, if you could provide the proposed solution, that shouldn’t break anything else, together with the feature request.

Workarounds to include R stat functions in data science pipelines

To leave a comment for the author, please follow the link and comment on their blog: intubate <||>, XBRL, bioPN, sbioPN, and stats with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.