Workarounds to include R stat functions in data science pipelines

[This article was first published on intubate <||>, XBRL, bioPN, sbioPN, and stats with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post explores some of the possible workarounds that can be employed
if you want to include non-pipe-aware functions to magrittr pipelines
without using intubate and, at the end, the intubate alternative. See
intubate <||> R stat functions in data science pipelines for an introduction.

Some workarounds to include non-pipe-aware functions in pipelines.

<span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span>

Example 1:

Using lm directly in a data pipeline will raise an error

<span class="n">LifeCycleSavings</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">lm</span><span class="p">(</span><span class="n">sr</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## Error in as.data.frame.default(data): cannot coerce class ""formula"" to a data.frame

lm can be added directly to the pipeline,
without error, by specifying the name of the parameter
associated with the model (formula in this case).

<span class="n">LifeCycleSavings</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">lm</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sr</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## 
## Call:
## lm(formula = sr ~ ., data = .)
## 
## Coefficients:
## (Intercept)        pop15        pop75          dpi         ddpi  
##  28.5660865   -0.4611931   -1.6914977   -0.0003369    0.4096949

The drawback of this approach is that not all functions
use formula to specify the model.

So far I have encountered 5 variants:

  • formula
  • x
  • object
  • model, and
  • fixed

The following are examples of functions using the other variants.

Example 2:

Using xyplot directly in a data pipeline will raise an error

<span class="n">library</span><span class="p">(</span><span class="n">lattice</span><span class="p">)</span><span class="w">
</span><span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">xyplot</span><span class="p">(</span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
         </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
         </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>
## Error in UseMethod("xyplot"): no applicable method for 'xyplot' applied to an object of class "data.frame"

unless x is specified.

<span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">xyplot</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
         </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
         </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>

plot of chunk unnamed-chunk-5

Example 3:

Using tmd (a different function in the same package)
directly in a data pipeline will raise an error

<span class="n">library</span><span class="p">(</span><span class="n">lattice</span><span class="p">)</span><span class="w">

</span><span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">tmd</span><span class="p">(</span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
      </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
      </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>
## Error in UseMethod("tmd"): no applicable method for 'tmd' applied to an object of class "data.frame"

unless object is specified.

<span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">tmd</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
      </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
      </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>

plot of chunk unnamed-chunk-7

Example 4:

Using gls directly in a data pipeline
will raise an error

<span class="n">library</span><span class="p">(</span><span class="n">nlme</span><span class="p">)</span><span class="w">

</span><span class="n">Ovary</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">gls</span><span class="p">(</span><span class="n">follicles</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">cos</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">),</span><span class="w">
      </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corAR1</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Mare</span><span class="p">))</span><span class="w">
</span>
## Error in gls(., follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time), correlation = corAR1(form = ~1 | : 
## model must be a formula of the form "resp ~ pred"

unless model is specified.

<span class="n">Ovary</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">gls</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">follicles</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">cos</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">),</span><span class="w">
      </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corAR1</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Mare</span><span class="p">))</span><span class="w">
</span>
## Generalized least squares fit by REML
##   Model: follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time) 
##   Data: . 
##   Log-restricted-likelihood: -780.7273
## 
## Coefficients:
##        (Intercept) sin(2 * pi * Time) cos(2 * pi * Time) 
##         12.2163982         -2.7747122         -0.8996047 
## 
## Correlation Structure: AR(1)
##  Formula: ~1 | Mare 
##  Parameter estimate(s):
##       Phi 
## 0.7532079 
## Degrees of freedom: 308 total; 305 residual
## Residual standard error: 4.616172

Example 5:

Using lme directly in a data pipeline
will raise an error

<span class="n">library</span><span class="p">(</span><span class="n">nlme</span><span class="p">)</span><span class="w">

</span><span class="n">Orthodont</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">lme</span><span class="p">(</span><span class="n">distance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w">
</span>
## Error in (function (fixed, data = sys.frame(sys.parent()), random, correlation = NULL, : formal argument "data" matched by multiple actual arguments

unless fixed(!) is specified.

<span class="n">Orthodont</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">lme</span><span class="p">(</span><span class="n">fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">distance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w">
</span>
## Linear mixed-effects model fit by REML
##   Data: . 
##   Log-restricted-likelihood: -221.3183
##   Fixed: distance ~ age 
## (Intercept)         age 
##  16.7611111   0.6601852 
## 
## Random effects:
##  Formula: ~age | Subject
##  Structure: General positive-definite
##             StdDev    Corr  
## (Intercept) 2.3270339 (Intr)
## age         0.2264276 -0.609
## Residual    1.3100399       
## 
## Number of Observations: 108
## Number of Groups: 27

Having to remember the name of the
parameter associated to the model in each case
is inconvenient, may be error prone, and gives an
inconsistent look and feel to an otherwise elegant
interface.

Moreover, it is consider good practice
in R to not specify the name of the first two parameters (and in
pipes the first is implicit), and
name the remaining.

Not having to specify the name of the
model argument completely hides the heterogeneity of names
that can be associated with it. You only write the model
and completely forget which name has been assigned to it.

More complicated workarounds

There are functions that rely on the order of the parameters
(such as aggregate, cor.test and other 28 I found so far) that will still
raise an error even if you name the model.

In fact, there are cases where it is not
true
that if in a function call you name the parameters
you can write them in any order you want.

One example is cor.test:

1) Unnamed parameters in the natural order. Works

<span class="n">cor.test</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">,</span><span class="w"> </span><span class="n">USJudgeRatings</span><span class="p">)</span><span class="w">
</span>
## 
## 	Pearson's product-moment correlation
## 
## data:  CONT and INTG
## t = -0.8605, df = 41, p-value = 0.3945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4168591  0.1741182
## sample estimates:
##        cor 
## -0.1331909

2) Named parameters in the natural order. Works

<span class="n">cor.test</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">USJudgeRatings</span><span class="p">)</span><span class="w">
</span>
## 
## 	Pearson's product-moment correlation
## 
## data:  CONT and INTG
## t = -0.8605, df = 41, p-value = 0.3945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4168591  0.1741182
## sample estimates:
##        cor 
## -0.1331909

3) Named parameters with the order changed. Doesn’t work!

<span class="n">cor.test</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">USJudgeRatings</span><span class="p">,</span><span class="w"> </span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## Error in cor.test.default(data = USJudgeRatings, formula = ~CONT + INTG): argument "x" is missing, with no default

Let’s see what happens if we want to add these cases to the %>% pipeline.

Example of error 1: cor.test

Using cor.test directly in a data pipeline
will raise an error

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">cor.test</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## Error in cor.test.default(., ~CONT + INTG): 'x' and 'y' must have the same length

even when specifying formula (as it should be according to
the documentation).

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">cor.test</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## Error in cor.test.default(., formula = ~CONT + INTG): argument "y" is missing, with no default

Was it y then?

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">cor.test</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## Error in cor.test.default(., y = ~CONT + INTG): 'x' and 'y' must have the same length

No…

Was it x then?

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">cor.test</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## Error in cor.test.formula(., x = ~CONT + INTG): 'formula' missing or invalid

No

Example of error 2: aggregate

Using aggregate directly in a data pipeline
will raise an error

<span class="n">ToothGrowth</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">aggregate</span><span class="p">(</span><span class="n">len</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span>
## Error in aggregate.data.frame(., len ~ ., mean): 'by' must be a list

even when specifying formula

<span class="n">ToothGrowth</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">aggregate</span><span class="p">(</span><span class="n">formula</span><span class="o">=</span><span class="n">len</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span>
## Error in match.fun(FUN): argument "FUN" is missing, with no default

or other variants.

Example of error 3: lda

Using lda directly in a data pipeline
will raise an error

<span class="n">library</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span><span class="w">

</span><span class="n">Iris</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">rbind</span><span class="p">(</span><span class="n">iris3</span><span class="p">[,,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">iris3</span><span class="p">[,,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">iris3</span><span class="p">[,,</span><span class="m">3</span><span class="p">]),</span><span class="w">
                   </span><span class="n">Sp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"s"</span><span class="p">,</span><span class="s2">"c"</span><span class="p">,</span><span class="s2">"v"</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="m">3</span><span class="p">)))</span><span class="w">
</span><span class="n">Iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">lda</span><span class="p">(</span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## Error in lda.default(x, grouping, ...): nrow(x) and length(grouping) are different

even when specifying formula.

<span class="n">Iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">lda</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## Error in lda.default(x, grouping, ...): argument "grouping" is missing, with no default

or other variants.

Let’s try another strategy. Let’s see
if the %$% operator, that
expands the names of the variables inside
the data structure, can be of help.

<span class="n">Iris</span><span class="w"> </span><span class="o">%$%</span><span class="w">
  </span><span class="n">lda</span><span class="p">(</span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## Error in terms.formula(formula, data = data): '.' in formula and no 'data' argument

Still no…

One last try…

<span class="n">Iris</span><span class="w"> </span><span class="o">%$%</span><span class="w">
  </span><span class="n">lda</span><span class="p">(</span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Sepal.L.</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.W.</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.L.</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.W.</span><span class="p">)</span><span class="w">
</span>
## Call:
## lda(Sp ~ Sepal.L. + Sepal.W. + Petal.L. + Petal.W.)
## 
## Prior probabilities of groups:
##         c         s         v 
## 0.3333333 0.3333333 0.3333333 
## 
## Group means:
##   Sepal.L. Sepal.W. Petal.L. Petal.W.
## c    5.936    2.770    4.260    1.326
## s    5.006    3.428    1.462    0.246
## v    6.588    2.974    5.552    2.026
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## Sepal.L. -0.8293776  0.02410215
## Sepal.W. -1.5344731  2.16452123
## Petal.L.  2.2012117 -0.93192121
## Petal.W.  2.8104603  2.83918785
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

Finally! But… we had to specify all the variables
(and they may be a lot), and use %$% instead of %>%.

There is still another workaround that allows
these functions to be used directly in a pipeline.
It requires the use of another function (with)
encapsulating the offending function. Here it goes:

<span class="n">Iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">with</span><span class="p">(</span><span class="n">lda</span><span class="p">(</span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">))</span><span class="w">
</span>
## Call:
## lda(Sp ~ ., data = .)
## 
## Prior probabilities of groups:
##         c         s         v 
## 0.3333333 0.3333333 0.3333333 
## 
## Group means:
##   Sepal.L. Sepal.W. Petal.L. Petal.W.
## c    5.936    2.770    4.260    1.326
## s    5.006    3.428    1.462    0.246
## v    6.588    2.974    5.552    2.026
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## Sepal.L. -0.8293776  0.02410215
## Sepal.W. -1.5344731  2.16452123
## Petal.L.  2.2012117 -0.93192121
## Petal.W.  2.8104603  2.83918785
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

In the case of aggregate it goes like

<span class="n">ToothGrowth</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">with</span><span class="p">(</span><span class="n">aggregate</span><span class="p">(</span><span class="n">len</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">))</span><span class="w">
</span>
##   supp dose   len
## 1   OJ  0.5 13.23
## 2   VC  0.5  7.98
## 3   OJ  1.0 22.70
## 4   VC  1.0 16.77
## 5   OJ  2.0 26.06
## 6   VC  2.0 26.14

In addition, there is the added complexity of
interpreting the meaning of each of those .
(unfortunately they do not mean the same)
which may cause confusion, particularly at a future
time when you may have to remember why you had to
do this to yourself… (the first is specifying to include in the
rhs of the model all the variables in the data but len,
the second is the name of the data
structure passed by the pipe. Yes, it is called .!)

It is also a solution for the case of cor.test before,
(and it should work in any case):

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">with</span><span class="p">(</span><span class="n">cor.test</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">))</span><span class="w">
</span>
## 
## 	Pearson's product-moment correlation
## 
## data:  CONT and INTG
## t = -0.8605, df = 41, p-value = 0.3945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4168591  0.1741182
## sample estimates:
##        cor 
## -0.1331909

Undoubtedly, there may be more elegant workarounds that
I am unaware of. But the point is that, no matter how elegant,
they will be, well,
still workarounds. You want to force unbehaving functions
into something that is unnatural to them:

  • In some cases you had to name the parameters,
  • in the other you had to use %$% instead of %>% and where not allowed
    to use . in your model definition,
  • if you wanted to use %>% you had to use
    also with and include . as the second parameter.

The idea of avoiding such “hacks”
motivated me to write intubate.

The intubate alternative

<span class="n">library</span><span class="p">(</span><span class="n">intubate</span><span class="p">)</span><span class="w">
</span>

For Example 1:

No need to specify formula.

<span class="n">LifeCycleSavings</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">lm</span><span class="p">,</span><span class="w"> </span><span class="n">sr</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## 
## Call:
## lm(formula = sr ~ ., data = .)
## 
## Coefficients:
## (Intercept)        pop15        pop75          dpi         ddpi  
##  28.5660865   -0.4611931   -1.6914977   -0.0003369    0.4096949

or

<span class="n">LifeCycleSavings</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ntbt_lm</span><span class="p">(</span><span class="n">sr</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## 
## Call:
## lm(formula = sr ~ ., data = .)
## 
## Coefficients:
## (Intercept)        pop15        pop75          dpi         ddpi  
##  28.5660865   -0.4611931   -1.6914977   -0.0003369    0.4096949

For Example 2:

No need to specify x.

<span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">xyplot</span><span class="p">,</span><span class="w"> </span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
       </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
       </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>

plot of chunk unnamed-chunk-31

or

<span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_xyplot</span><span class="p">(</span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
              </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
              </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>

plot of chunk unnamed-chunk-32

For Example 3:

No need to specify object.

<span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">tmd</span><span class="p">,</span><span class="w"> </span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
       </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
       </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>

plot of chunk unnamed-chunk-33

or

<span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_tmd</span><span class="p">(</span><span class="n">Sepal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Petal.Length</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Petal.Width</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Species</span><span class="p">,</span><span class="w">
           </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
           </span><span class="n">auto.key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="n">corner</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span><span class="w">
</span>

plot of chunk unnamed-chunk-34

For Example 4:

No need to specify model.

<span class="n">Ovary</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">gls</span><span class="p">,</span><span class="w"> </span><span class="n">follicles</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">cos</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">),</span><span class="w">
       </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corAR1</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Mare</span><span class="p">))</span><span class="w">
</span>
## Generalized least squares fit by REML
##   Model: follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time) 
##   Data: NULL 
##   Log-restricted-likelihood: -780.7273
## 
## Coefficients:
##        (Intercept) sin(2 * pi * Time) cos(2 * pi * Time) 
##         12.2163982         -2.7747122         -0.8996047 
## 
## Correlation Structure: AR(1)
##  Formula: ~1 | Mare 
##  Parameter estimate(s):
##       Phi 
## 0.7532079 
## Degrees of freedom: 308 total; 305 residual
## Residual standard error: 4.616172

or

<span class="n">Ovary</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_gls</span><span class="p">(</span><span class="n">follicles</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">cos</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">Time</span><span class="p">),</span><span class="w">
           </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corAR1</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Mare</span><span class="p">))</span><span class="w">
</span>
## Generalized least squares fit by REML
##   Model: follicles ~ sin(2 * pi * Time) + cos(2 * pi * Time) 
##   Data: NULL 
##   Log-restricted-likelihood: -780.7273
## 
## Coefficients:
##        (Intercept) sin(2 * pi * Time) cos(2 * pi * Time) 
##         12.2163982         -2.7747122         -0.8996047 
## 
## Correlation Structure: AR(1)
##  Formula: ~1 | Mare 
##  Parameter estimate(s):
##       Phi 
## 0.7532079 
## Degrees of freedom: 308 total; 305 residual
## Residual standard error: 4.616172

For Example 5:

No need to specify fixed.

<span class="n">Orthodont</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">distance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w">
</span>
## Linear mixed-effects model fit by REML
##   Data: . 
##   Log-restricted-likelihood: -221.3183
##   Fixed: distance ~ age 
## (Intercept)         age 
##  16.7611111   0.6601852 
## 
## Random effects:
##  Formula: ~age | Subject
##  Structure: General positive-definite
##             StdDev    Corr  
## (Intercept) 2.3270339 (Intr)
## age         0.2264276 -0.609
## Residual    1.3100399       
## 
## Number of Observations: 108
## Number of Groups: 27

or

<span class="n">Orthodont</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_lme</span><span class="p">(</span><span class="n">distance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w">
</span>
## Linear mixed-effects model fit by REML
##   Data: . 
##   Log-restricted-likelihood: -221.3183
##   Fixed: distance ~ age 
## (Intercept)         age 
##  16.7611111   0.6601852 
## 
## Random effects:
##  Formula: ~age | Subject
##  Structure: General positive-definite
##             StdDev    Corr  
## (Intercept) 2.3270339 (Intr)
## age         0.2264276 -0.609
## Residual    1.3100399       
## 
## Number of Observations: 108
## Number of Groups: 27

For Example of error 1:

It simply works.

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">cor.test</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## 
## 	Pearson's product-moment correlation
## 
## data:  CONT and INTG
## t = -0.8605, df = 41, p-value = 0.3945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4168591  0.1741182
## sample estimates:
##        cor 
## -0.1331909

or

<span class="n">USJudgeRatings</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_cor.test</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">CONT</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">INTG</span><span class="p">)</span><span class="w">
</span>
## 
## 	Pearson's product-moment correlation
## 
## data:  CONT and INTG
## t = -0.8605, df = 41, p-value = 0.3945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4168591  0.1741182
## sample estimates:
##        cor 
## -0.1331909

For Example of error 2:

It simply works.

<span class="n">ToothGrowth</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">aggregate</span><span class="p">,</span><span class="w"> </span><span class="n">len</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span>
##   supp dose   len
## 1   OJ  0.5 13.23
## 2   VC  0.5  7.98
## 3   OJ  1.0 22.70
## 4   VC  1.0 16.77
## 5   OJ  2.0 26.06
## 6   VC  2.0 26.14

or

<span class="n">ToothGrowth</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_aggregate</span><span class="p">(</span><span class="n">len</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span>
##   supp dose   len
## 1   OJ  0.5 13.23
## 2   VC  0.5  7.98
## 3   OJ  1.0 22.70
## 4   VC  1.0 16.77
## 5   OJ  2.0 26.06
## 6   VC  2.0 26.14

For Example of error 3:

It simply works.

<span class="n">Iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt</span><span class="p">(</span><span class="n">lda</span><span class="p">,</span><span class="w"> </span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## Call:
## lda(Sp ~ ., data = .)
## 
## Prior probabilities of groups:
##         c         s         v 
## 0.3333333 0.3333333 0.3333333 
## 
## Group means:
##   Sepal.L. Sepal.W. Petal.L. Petal.W.
## c    5.936    2.770    4.260    1.326
## s    5.006    3.428    1.462    0.246
## v    6.588    2.974    5.552    2.026
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## Sepal.L. -0.8293776  0.02410215
## Sepal.W. -1.5344731  2.16452123
## Petal.L.  2.2012117 -0.93192121
## Petal.W.  2.8104603  2.83918785
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

or

<span class="n">Iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ntbt_lda</span><span class="p">(</span><span class="n">Sp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w">
</span>
## Call:
## lda(Sp ~ ., data = .)
## 
## Prior probabilities of groups:
##         c         s         v 
## 0.3333333 0.3333333 0.3333333 
## 
## Group means:
##   Sepal.L. Sepal.W. Petal.L. Petal.W.
## c    5.936    2.770    4.260    1.326
## s    5.006    3.428    1.462    0.246
## v    6.588    2.974    5.552    2.026
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## Sepal.L. -0.8293776  0.02410215
## Sepal.W. -1.5344731  2.16452123
## Petal.L.  2.2012117 -0.93192121
## Petal.W.  2.8104603  2.83918785
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

I think the approach intubate proposes
looks consistent, elegant, simple and clean,
less error prone, and easy to follow (of course,
keep in mind that I have a vested interest in the
success of intubate).

After all, the complication should be in
the analysis you are performing,
and not in how you are performing it.

Previous

intubate <||> R stat functions in data science pipelines

To leave a comment for the author, please follow the link and comment on their blog: intubate <||>, XBRL, bioPN, sbioPN, and stats with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)