# RObservations #37: Demistifying the tapply() function and comparing it to the “tidy” approach.

**r – bensstats**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Many seasoned base R users use the `tapply()`

function to help them in many contexts and talk about how powerful it is. However, many new R users have either have never seen `tapply()`

or they have and are unsure how it works. The documentation is not very helpful in explaining it either:

[tapply() applies] a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

While I saw other programmers use this function, I found myself unsure how of how it worked or knew when I would need to use it. In this blog I attempt to change that and explain the cryptic description by showing some applications with my commentary and how it compares to using the “tidy” approach with `tidyverse`

.

My inspiration for writing this blog was from seeing Dr. Norm Matloff’s blog where he mentions the use of `tapply()`

and his thoughts on the tidyverse. For a more thorough treatment on his critique of the tidyverse and “tidy” methods, check out his formal essay here.

Now to get into the understanding and using `tapply()`

.

# What is a ragged array?

After digging around on Google, I found a good of ragged arrays (also referred to as jagged arrays) on GeeksforGeeks:

A [r]agged array is an array of arrays such that member arrays can be of different sizes […].

The Stan user guide however offers a better definition:

Ragged arrays are arrays that are not rectangular, but have different sized entries. This kind of structure crops up when there are different numbers of observations per entry.A general approach to dealing with ragged structure is to move to a full database-like data structure […].

To understand this further, we the `str()`

function on the `warpbreaks`

dataset (the dataset used in the `tapply()`

documentation):

str(warpbreaks) ## 'data.frame': 54 obs. of 3 variables: ## $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... ## $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... ## $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

While the dataset is a dataframe. It can also be understood as understood as a ragged array. The dataset is a record of (1) the number warp breaks per loom and has record (the number of observations) of (2) the wool type (2 types) and (3) level of tension (three levels) applied. With the two factor variables it is possible to aggregate the number of breaks. We can do this by importing `tidyverse`

and using the `group_by()`

and `summarize()`

functions, or we can The `tapply()`

function.

# Using tapply() vs tidyverse

## Using one dimension

Suppose we are interested in determining the total number of breaks for each tension level, ignoring wool type. To do this with `tapply()`

we would write:

tapply(X=warpbreaks$breaks, INDEX=warpbreaks$wool, FUN = sum) ## A B ## 838 682

To do the same thing using `tidyverse`

we would need to do the following:

library(tidyverse) warpbreaks %>% group_by(wool) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=wool, values_from=total_breaks) ## # A tibble: 1 x 2 ## A B ## <dbl> <dbl> ## 1 838 682

## Using two dimensions

Suppose we are interested in determining the total number of breaks for each tension level, ignoring wool type. To do this with `tapply()`

we would write:

tapply(X=warpbreaks$breaks, INDEX=warpbreaks[,-1], FUN = sum) ## tension ## wool L M H ## A 401 216 221 ## B 254 259 169

To do this equivalently with tidyverse we would write:

warpbreaks %>% group_by(wool,tension) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=tension, values_from=total_breaks) ## # A tibble: 2 x 4 ## wool L M H ## <fct> <dbl> <dbl> <dbl> ## 1 A 401 216 221 ## 2 B 254 259 169

If we benchmark these two approaches, we see that `tapply()`

is the clear winner.

library(rbenchmark) benchmark( 'tidyverse'= {warpbreaks %>% group_by(wool,tension) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=tension, values_from=total_breaks)}, 'tapply()'=tapply(X=warpbreaks$breaks, INDEX=warpbreaks[,-1], FUN = sum), replications = 1000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 2 tapply() 1000 0.08 1.000 0.08 0.00 NA NA ## 1 tidyverse 1000 14.87 185.875 14.69 0.05 NA NA benchmark( 'tidyverse'= {warpbreaks %>% group_by(wool) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=wool, values_from=total_breaks)}, 'tapply()'=tapply(X=warpbreaks$breaks, INDEX=warpbreaks$wool, FUN = sum), replications = 1000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 2 tapply() 1000 0.05 1 0.04 0.00 NA NA ## 1 tidyverse 1000 9.25 185 9.22 0.01 NA NA

# Conclusion

From my experience outlined above, its clear that `tapply()`

makes for a more efficient method of aggregating data than using the “tidy” approach.

From my experience I have found “tidy” methods offer for an experience where code can be written in the same manner a solution to a given problem is thought out. While it may not be ideal for computational speed or teaching, I have experienced in my own work that it has allowed for problems to be approached in a systematic manner which allows for code to be written as the solution is developed. But, after having worked through a few examples of using `tapply()`

, I will definitely have it as one of my go-to functions and will experiment further using it.

What do you think? Let me know in the comments!

Thank you for reading!

## Want to see more of my content?

Be sure to subscribe and never miss an update!

**leave a comment**for the author, please follow the link and comment on their blog:

**r – bensstats**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.