# Wrangling Data with R: A Guide to the tapply() Function

**Steve's Data Tips and Tricks**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Hey R enthusiasts! Today we’re diving into the world of data manipulation with a fantastic function called `tapply()`

. This little gem lets you apply a function of your choice to different subgroups within your data.

Imagine you have a dataset on trees, with a column for tree height and another for species. You might want to know the average height for each species. `tapply()`

comes to the rescue!

# Understanding the Syntax

Let’s break down the syntax of `tapply()`

:

tapply(X, INDEX, FUN, simplify = TRUE)

**X**: This is the vector or variable you want to perform the function on.**INDEX**: This is the factor variable that defines the groups. Each level in the factor acts as a subgroup for applying the function.**FUN**: This is the function you want to apply to each subgroup. It can be built-in functions like`mean()`

or`sd()`

, or even custom functions you write!**simplify (optional)**: By default,`simplify = TRUE`

(recommended for most cases). This returns a nice, condensed output that’s easy to work with. Setting it to`FALSE`

gives you a more complex structure.

# Examples in Action

## Example 1: Average Tree Height by Species

Let’s say we have a data frame `trees`

with columns “height” (numeric) and “species” (factor):

# Sample data trees <- data.frame(height = c(20, 30, 25, 40, 15, 28), species = c("Oak", "Oak", "Maple", "Pine", "Maple", "Pine")) # Average height per species average_height <- tapply(trees$height, trees$species, mean) print(average_height)

Maple Oak Pine 20 25 34

This code calculates the average height for each species in the “species” column and stores the results in `average_height`

. The output will be a named vector showing the average height for each unique species.

## Example 2: Exploring Distribution with Summary Statistics

We can use `tapply()`

with `summary()`

to get a quick overview of how a variable is distributed within groups. Here, we’ll see the distribution of height within each species:

summary_by_species <- tapply(trees$height, trees$species, summary) print(summary_by_species)

$Maple Min. 1st Qu. Median Mean 3rd Qu. Max. 15.0 17.5 20.0 20.0 22.5 25.0 $Oak Min. 1st Qu. Median Mean 3rd Qu. Max. 20.0 22.5 25.0 25.0 27.5 30.0 $Pine Min. 1st Qu. Median Mean 3rd Qu. Max. 28 31 34 34 37 40

This code applies the `summary()`

function to each subgroup defined by the “species” factor. The output will be a data frame showing various summary statistics (like minimum, maximum, quartiles) for the height of each species.

## Example 3: Custom Function for Identifying Tall Trees

Let’s create a custom function to find trees that are taller than the average height of their species:

tall_trees <- function(height, avg_height) { height > avg_height } # Find tall trees within each species tall_trees_by_species <- tapply(trees$height, trees$species, mean(trees$height),FUN=tall_trees) print(tall_trees_by_species)

$Maple [1] FALSE FALSE $Oak [1] FALSE TRUE $Pine [1] TRUE TRUE

Here, we define a function `tall_trees()`

that takes a tree’s height and the average height (passed as arguments) and returns TRUE if the tree’s height is greater. We then use `tapply()`

with this custom function. The crucial difference here is that we use `mean(trees$height)`

within the `FUN`

argument to calculate the average height for each group **outside** of the custom function. This ensures the average height is calculated correctly for each subgroup before being compared to individual tree heights. The output will be a logical vector for each species, indicating which trees are taller than the average.

# Give it a Try!

This is just a taste of what `tapply()`

can do. There are endless possibilities for grouping data and applying functions. Try it out on your own datasets! Here are some ideas:

- Calculate the median income for different age groups.
- Find the most frequent word used in emails sent by different departments.
- Group customers by purchase history and analyze their average spending.

Remember, R is all about exploration. So dive in, play with `tapply()`

, and see what insights you can uncover from your data!

**leave a comment**for the author, please follow the link and comment on their blog:

**Steve's Data Tips and Tricks**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.