Wrangling Data with R: A Guide to the tapply() Function

Steven P. Sanderson II, MPH

10 hours ago

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< section id="introduction" class="level1">

Introduction

Hey R enthusiasts! Today we’re diving into the world of data manipulation with a fantastic function called tapply(). This little gem lets you apply a function of your choice to different subgroups within your data.

Imagine you have a dataset on trees, with a column for tree height and another for species. You might want to know the average height for each species. tapply() comes to the rescue!

< section id="understanding-the-syntax" class="level1">

Understanding the Syntax

Let’s break down the syntax of tapply():

tapply(X, INDEX, FUN, simplify = TRUE)

X: This is the vector or variable you want to perform the function on.
INDEX: This is the factor variable that defines the groups. Each level in the factor acts as a subgroup for applying the function.
FUN: This is the function you want to apply to each subgroup. It can be built-in functions like mean() or sd(), or even custom functions you write!
simplify (optional): By default, simplify = TRUE (recommended for most cases). This returns a nice, condensed output that’s easy to work with. Setting it to FALSE gives you a more complex structure.

< section id="examples-in-action" class="level1">

Examples in Action

< section id="example-1-average-tree-height-by-species" class="level2">

Example 1: Average Tree Height by Species

Let’s say we have a data frame trees with columns “height” (numeric) and “species” (factor):

# Sample data
trees <- data.frame(height = c(20, 30, 25, 40, 15, 28),
                    species = c("Oak", "Oak", "Maple", "Pine", "Maple", "Pine"))

# Average height per species
average_height <- tapply(trees$height, trees$species, mean)
print(average_height)

Maple   Oak  Pine 
   20    25    34

This code calculates the average height for each species in the “species” column and stores the results in average_height. The output will be a named vector showing the average height for each unique species.

< section id="example-2-exploring-distribution-with-summary-statistics" class="level2">

Example 2: Exploring Distribution with Summary Statistics

We can use tapply() with summary() to get a quick overview of how a variable is distributed within groups. Here, we’ll see the distribution of height within each species:

summary_by_species <- tapply(trees$height, trees$species, summary)
print(summary_by_species)

$Maple
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    17.5    20.0    20.0    22.5    25.0 

$Oak
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   20.0    22.5    25.0    25.0    27.5    30.0 

$Pine
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     28      31      34      34      37      40

This code applies the summary() function to each subgroup defined by the “species” factor. The output will be a data frame showing various summary statistics (like minimum, maximum, quartiles) for the height of each species.

< section id="example-3-custom-function-for-identifying-tall-trees" class="level2">

Example 3: Custom Function for Identifying Tall Trees

Let’s create a custom function to find trees that are taller than the average height of their species:

tall_trees <- function(height, avg_height) {
    height > avg_height
}

# Find tall trees within each species
tall_trees_by_species <- tapply(trees$height, trees$species, mean(trees$height),FUN=tall_trees)
print(tall_trees_by_species)

$Maple
[1] FALSE FALSE

$Oak
[1] FALSE  TRUE

$Pine
[1] TRUE TRUE

Here, we define a function tall_trees() that takes a tree’s height and the average height (passed as arguments) and returns TRUE if the tree’s height is greater. We then use tapply() with this custom function. The crucial difference here is that we use mean(trees$height) within the FUN argument to calculate the average height for each group outside of the custom function. This ensures the average height is calculated correctly for each subgroup before being compared to individual tree heights. The output will be a logical vector for each species, indicating which trees are taller than the average.

< section id="give-it-a-try" class="level1">

Give it a Try!

This is just a taste of what tapply() can do. There are endless possibilities for grouping data and applying functions. Try it out on your own datasets! Here are some ideas:

Calculate the median income for different age groups.
Find the most frequent word used in emails sent by different departments.
Group customers by purchase history and analyze their average spending.

Remember, R is all about exploration. So dive in, play with tapply(), and see what insights you can uncover from your data!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.