Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

I have received a few queries recently that can be categorized as “How do I collapse a list of categories or values into a shorter list of category or values?” For example, one user wanted to collapse species of fish into their respective families. Another user wanted to collapse years into decades. Data munging such as this is common in fisheries. Thus, I provide a quick demonstration here of one way to accomplish these tasks using tools from the tidyverse.

This post requires the `dplyr`, `magrittr`, and `plyr` packages. Note, however, that `plyr` is not loaded below because I am only going to use one specific function from `plyr` (i.e., `mapvalues()`) and I have found that `plyr` and `dplyr` don’t always “play well” together.[^1]

Because I am creating random example data below, I set the random number seed to make the results reproducible.

## Create A Sample of Data

The following creates a very simple sample of 250 individuals on which the species (as a short abbreviation) and year of capture were recorded.

## Example 1 – Recode and Collapse Categories

The `mutate()` function may be used to add a new variable to a data.frame. The `mapvalues()` function (from `plyr`) may be use to efficiently recode character (or factor) values in a vector. Because `mapvalues()` operates on a vector, it must be used within `mutate()` to add a new variable with the recoded values to a data.frame. When used within `mutate()`, the first argument to `mapvalues()` is the vector that contains the original data to be recoded. A vector of categories for these original data are then given in `from=` and a vector of new categories for these data are given in `to=`.

I find it most simple to first create vectors of categories for `from=` and `to=` and then use them in `mapvalues()`. For example, the use of `levels()` below extracts (and saves into `short`) the vector of species abbreviations found in the `species` variable of the example data.

“New categories” that correspond to each of the original categories may then be entered into a vector. For example, the `long` vector below contains the long-form names for each species (in the same order as the abbreviations in `short`) and `family` contains the corresponding family names.

“Column bind” these vectors together to make sure that the categories are correctly matched across the vectors.

The combined use of `mutate()` and `mapvalues()` below demonstrates how these vectors may be used to change the original abbreviated names to long-form names or family names. In addition, the last use of `mapvalues()` shows how to change the long-form names to family names. This last example is, of course, repetitive, but it is used here to demonstrate how `mutate()` allows a variable that was “just created” to be immediately used.

Note in the code above that the use of `plyr::` in front of `mapvalues()` allows the user to use the `mapvalues()` function from `plyr` without loading the entire `plyr` package.[^2] As noted previously, this idiom is used here to avoid potential conflicts between `plyr` and `dplyr`.

Note that this use of `mapvalues()` and `mutate()` is described in Section 2.2.7 of my book Introductory Fisheries Analyses with R.

## Example 2 – Collapse Values into Categories

The `case_when()` function (from `dplyr`) may be used to efficiently collapse discrete values into categories.[^3] This function also operates on vectors and, thus, must be used with `mutate()` to add a variable to a data.frame. The arguments to `case_when()` are a series of two-sided formulae where the left-side is a conditioning statement based on the original data and the right-side is the value that should appear in the new variable when that condition is `TRUE`. For example, the first line in `case_when()` below asks “if the year variable is in the values from 1980 to 1989 then the new category should be ‘1980s’.”[^4] For example, the code below creates a new variable called `decade` that identifies the decade that corresponds to the year-of-capture variable.

The lines in `case_when()` operate sequentially (like a series of “if” statements) such that the above operation can be more succinctly coded as below. Also note in this example the resulting variable is numeric rather than categorical (simply as an example).

## Footnotes

[^1] This may not be a concern with recent versions of `plyr` and `dplyr`. However, I have been bitten by enough problems when I have both of these packages loaded that I prefer to use the cautionary approach that I take in this post.

[^2] The `FSA` package imports `mapvalues` from `plyr` and then exports it. Thus, if you have loaded the `FSA` package then you will not need to use the `plyr::` idiom.

[^3] You may also want to consider `cut()` for this purpose or, for collapsing continuous data into categories, `lencat()` from the `FSA` package.

[^4] The colon operator creates a sequence of all integers between the two numbers separated by the colon. The `%in%` is used on conditional statements to determine if a value is contained with a vector, returning `TRUE` if it is and `FALSE` if it is not.