RObservations #37: Demistifying the tapply() function and comparing it to the “tidy” approach.

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Many seasoned base R users use the tapply() function to help them in many contexts and talk about how powerful it is. However, many new R users have either have never seen tapply() or they have and are unsure how it works. The documentation is not very helpful in explaining it either:

[tapply() applies] a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

While I saw other programmers use this function, I found myself unsure how of how it worked or knew when I would need to use it. In this blog I attempt to change that and explain the cryptic description by showing some applications with my commentary and how it compares to using the “tidy” approach with tidyverse.

My inspiration for writing this blog was from seeing Dr. Norm Matloff’s blog where he mentions the use of tapply() and his thoughts on the tidyverse. For a more thorough treatment on his critique of the tidyverse and “tidy” methods, check out his formal essay here.

Now to get into the understanding and using tapply().

What is a ragged array?

After digging around on Google, I found a good of ragged arrays (also referred to as jagged arrays) on GeeksforGeeks:

A [r]agged array is an array of arrays such that member arrays can be of different sizes […].

The Stan user guide however offers a better definition:

Ragged arrays are arrays that are not rectangular, but have different sized entries. This kind of structure crops up when there are different numbers of observations per entry.A general approach to dealing with ragged structure is to move to a full database-like data structure […].

To understand this further, we the str() function on the warpbreaks dataset (the dataset used in the tapply() documentation):

str(warpbreaks)


## 'data.frame':    54 obs. of  3 variables:
##  $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
##  $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

While the dataset is a dataframe. It can also be understood as understood as a ragged array. The dataset is a record of (1) the number warp breaks per loom and has record (the number of observations) of (2) the wool type (2 types) and (3) level of tension (three levels) applied. With the two factor variables it is possible to aggregate the number of breaks. We can do this by importing tidyverse and using the group_by() and summarize() functions, or we can The tapply() function.

Using tapply() vs tidyverse

Using one dimension

Suppose we are interested in determining the total number of breaks for each tension level, ignoring wool type. To do this with tapply() we would write:

tapply(X=warpbreaks$breaks,
       INDEX=warpbreaks$wool,
       FUN = sum)


##   A   B 
## 838 682

To do the same thing using tidyverse we would need to do the following:

library(tidyverse)
warpbreaks %>% 
  group_by(wool) %>% 
  summarize(total_breaks=sum(breaks)) %>% 
  ungroup() %>% 
  pivot_wider(names_from=wool,
              values_from=total_breaks)


## # A tibble: 1 x 2
##       A     B
##   <dbl> <dbl>
## 1   838   682

Using two dimensions

Suppose we are interested in determining the total number of breaks for each tension level, ignoring wool type. To do this with tapply() we would write:

tapply(X=warpbreaks$breaks,
       INDEX=warpbreaks[,-1],
       FUN = sum)


##     tension
## wool   L   M   H
##    A 401 216 221
##    B 254 259 169

To do this equivalently with tidyverse we would write:

warpbreaks %>% 
  group_by(wool,tension) %>% 
  summarize(total_breaks=sum(breaks)) %>% 
  ungroup() %>% 
  pivot_wider(names_from=tension,
              values_from=total_breaks)


## # A tibble: 2 x 4
##   wool      L     M     H
##   <fct> <dbl> <dbl> <dbl>
## 1 A       401   216   221
## 2 B       254   259   169

If we benchmark these two approaches, we see that tapply() is the clear winner.

library(rbenchmark)

benchmark(
  'tidyverse'= {warpbreaks %>% 
  group_by(wool,tension) %>% 
  summarize(total_breaks=sum(breaks)) %>% 
  ungroup() %>% 
  pivot_wider(names_from=tension,
              values_from=total_breaks)},
  'tapply()'=tapply(X=warpbreaks$breaks,
       INDEX=warpbreaks[,-1],
       FUN = sum),
  replications = 1000
  ) 


##        test replications elapsed relative user.self sys.self user.child sys.child
## 2  tapply()         1000    0.08    1.000      0.08     0.00         NA        NA
## 1 tidyverse         1000   14.87  185.875     14.69     0.05         NA        NA


benchmark(
  'tidyverse'= {warpbreaks %>% 
  group_by(wool) %>% 
  summarize(total_breaks=sum(breaks)) %>% 
  ungroup() %>% 
  pivot_wider(names_from=wool,
              values_from=total_breaks)},
  'tapply()'=tapply(X=warpbreaks$breaks,
       INDEX=warpbreaks$wool,
       FUN = sum),
  replications = 1000
  ) 


##        test replications elapsed relative user.self sys.self user.child sys.child
## 2  tapply()         1000    0.05        1      0.04     0.00         NA        NA
## 1 tidyverse         1000    9.25      185      9.22     0.01         NA        NA

Conclusion

From my experience outlined above, its clear that tapply() makes for a more efficient method of aggregating data than using the “tidy” approach.

From my experience I have found “tidy” methods offer for an experience where code can be written in the same manner a solution to a given problem is thought out. While it may not be ideal for computational speed or teaching, I have experienced in my own work that it has allowed for problems to be approached in a systematic manner which allows for code to be written as the solution is developed. But, after having worked through a few examples of using tapply(), I will definitely have it as one of my go-to functions and will experiment further using it.

What do you think? Let me know in the comments!

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)