I have received a few queries recently that can be categorized as “How do I collapse a list of categories or values into a shorter list of category or values?” For example, one user wanted to collapse species of fish into their respective families. Another user wanted to collapse years into decades. Data munging such as this is common in fisheries. Thus, I provide a quick demonstration here of one way to accomplish these tasks using tools from the tidyverse.
This post requires the
plyr packages. Note, however, that
plyr is not loaded below because I am only going to use one specific function from
mapvalues()) and I have found that
dplyr don’t always “play well” together.[^1]
Because I am creating random example data below, I set the random number seed to make the results reproducible.
Create A Sample of Data
The following creates a very simple sample of 250 individuals on which the species (as a short abbreviation) and year of capture were recorded.
Example 1 – Recode and Collapse Categories
mutate() function may be used to add a new variable to a data.frame. The
mapvalues() function (from
plyr) may be use to efficiently recode character (or factor) values in a vector. Because
mapvalues() operates on a vector, it must be used within
mutate() to add a new variable with the recoded values to a data.frame. When used within
mutate(), the first argument to
mapvalues() is the vector that contains the original data to be recoded. A vector of categories for these original data are then given in
from= and a vector of new categories for these data are given in
I find it most simple to first create vectors of categories for
to= and then use them in
mapvalues(). For example, the use of
levels() below extracts (and saves into
short) the vector of species abbreviations found in the
species variable of the example data.
“New categories” that correspond to each of the original categories may then be entered into a vector. For example, the
long vector below contains the long-form names for each species (in the same order as the abbreviations in
family contains the corresponding family names.
“Column bind” these vectors together to make sure that the categories are correctly matched across the vectors.
The combined use of
mapvalues() below demonstrates how these vectors may be used to change the original abbreviated names to long-form names or family names. In addition, the last use of
mapvalues() shows how to change the long-form names to family names. This last example is, of course, repetitive, but it is used here to demonstrate how
mutate() allows a variable that was “just created” to be immediately used.
Note in the code above that the use of
plyr:: in front of
mapvalues() allows the user to use the
mapvalues() function from
plyr without loading the entire
plyr package.[^2] As noted previously, this idiom is used here to avoid potential conflicts between
Note that this use of
mutate() is described in Section 2.2.7 of my book Introductory Fisheries Analyses with R.
Example 2 – Collapse Values into Categories
case_when() function (from
dplyr) may be used to efficiently collapse discrete values into categories.[^3] This function also operates on vectors and, thus, must be used with
mutate() to add a variable to a data.frame. The arguments to
case_when() are a series of two-sided formulae where the left-side is a conditioning statement based on the original data and the right-side is the value that should appear in the new variable when that condition is
TRUE. For example, the first line in
case_when() below asks “if the year variable is in the values from 1980 to 1989 then the new category should be ‘1980s’.”[^4] For example, the code below creates a new variable called
decade that identifies the decade that corresponds to the year-of-capture variable.
The lines in
case_when() operate sequentially (like a series of “if” statements) such that the above operation can be more succinctly coded as below. Also note in this example the resulting variable is numeric rather than categorical (simply as an example).
[^1] This may not be a concern with recent versions of
dplyr. However, I have been bitten by enough problems when I have both of these packages loaded that I prefer to use the cautionary approach that I take in this post.
FSA package imports
plyr and then exports it. Thus, if you have loaded the
FSA package then you will not need to use the
[^3] You may also want to consider
cut() for this purpose or, for collapsing continuous data into categories,
lencat() from the
[^4] The colon operator creates a sequence of all integers between the two numbers separated by the colon. The
%in% is used on conditional statements to determine if a value is contained with a vector, returning
TRUE if it is and
FALSE if it is not.