# Working With Factors In R – Tutorial forcats Package

**R Statistics Blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction to factors in R

In R Language, factors represent categorical variables. Conceptually, categorical variables take a limited number of different values but can be represented by either character or integer values. Understanding of factors in R language is critical to developing statistical modeling because character variables are treated differently in statistical models than continuous variables. By the end of this tutorial on forcats package for working with factors in R, you will be able to inspect levels, change the order of levels, change the values of levels, combine levels, and add/drop levels more efficiently.

But before that, let us learn a bit more about factors in R.

The **first and foremost** thing to remember is that a factor variable in R is represented, or you can say stored as a vector of integer values. Here, each integer represents a character value used to display the levels of character values. You can check that by `str()`

function. When you check the structure of the data frame, you will realize that all the factor variables are denoted by 1, 2, 1 after the colon(:). Let’s take a look for a better understanding.

# Checking structure str(mtcars)

# Output 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

**Second**, both numeric and character variables can be converted to factor variable using `as.factor()`

or `factor()`

function from the forcats package in R.

**Third**, the levels of factors are always stored as character values. You can check the levels of factor variables using `levels()`

function in R.

**Fourth**, factors R can be either **ordered** or **unordered**. Please do not ignore this point; in some analysis or statistical models, the order of the levels may matter.

Now that you are aware of the factors in R. Let’s learn how to execute some of the most frequently used tasks in R involving factors. All the functions mentioned in this tutorial come from `forcats`

package in R. The best part of using `forcats`

package is that it returns tibble, and that means consistency.

## Convert and check levels of factor variables

As mentioned earlier, here, we will use `factor()`

function to covert `cyl`

variable from `mtcars`

data to factor. We will then check the levels of the variable using `levels()`

function. Finally, we check the class of the variable, which will validate our third point that levels are represented as characters.

**Converting an integer variable to factor variable.**

library(forcats) mtcars$cyl <- factor(mtcars$cyl) class(mtcars$cyl)

# Output [1] "factor"

**Check levels of a factor variable**

levels(mtcars$cyl) # Output [1] "4" "6" "8"

You can see the output values are represented using inverted quotes confirming that levels are stored as character values.

## Inspecting levels of factor variables

Here we will see how to get the count of each level within a factor using `fct_count()`

. While we do so, you will also learn how to sort the levels by count using `sort=`

argument. We will then learn how to get the unique values, removing duplicates using `fct_unique()`

function.

**Count the number of values**

fct_count(mtcars$cyl, sort = TRUE)

# Output # A tibble: 3 x 2 f n <fct> <int> 1 8 14 2 4 11 3 6 7

**Remove duplicates to get unique values**

fct_unique(mtcars$cyl)

# Output [1] 4 6 8 Levels: 4 6 8

## Changing the order of levels for a factor variable

There could be multiple reasons for which you would want to change the order of levels in factor variables. As this tutorial is only about `R programming language`

and `forcats`

package. The why and when do we need to order the levels of factor variables is out of scope. However, we will still discuss the different logical approaches one can take to reorder the factor variable levels.

**Manually ordering levels of a factor variable**

Here the choice is your that is how you wish to reorder the levels. Let’s say you want to reorder levels of`cyl`

variable; then, you can use`fct_relevel()`

function as illustrated below.

fct_relevel(mtcars$cyl, c("8", "4", "6")

**Reorder factor levels based on the appearance in data**

The`fct_inorder()`

will reorder the levels of a factor variable in R based on the order in which they appear in the data. Below you will notice that 6 appears, then 4 and lastly 8 and so does are factor levels are arranged.

fct_inorder(mtcars$cyl) # Output [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 [31] 8 4 Levels: 6 4 8

**Order factor levels based on the frequency**

The `fct_infreq()`

function from the `forcats`

package arranges the levels of a factor based on each level’s frequency. The level with the highest frequency takes the first place, followed by lesser frequent levels. It seems most cars in the dataset have 8 cylinders followed by 4 and 6 cylinders.

fct_infreq(mtcars$cyl) > fct_infreq(mtcars$cyl) [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 [31] 8 4 Levels: 8 4 6

**Reversing the order of levels**

If you are interested in reversing the order of the levels of the factors, you can use the`fct_rev()`

function. You can see we end up with exact reverse order. If you wish, you can check the original order using`levels()`

function.

fct_rev(mtcars$cyl) [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 [31] 8 4 Levels: 8 6 4

**Reorder factor levels based on the relationship with other variables**

I have not used this function so far, but while I was reading through the documentation of the`forcats`

package, I found this of interest and thus sharing it with you. The example here is borrowed from the project documentation itself. The function is mostly useful for display purposes.

In the example, we order the levels of the color variable based upon its relationship with the `a`

variables. Specifically, we look for minimum values in column `a`

, and you will find that the level `blue`

has the minimum value of 1 and then `red`

has 2 and so on.

The package provides another function called `fct_reorder2()`

; the function takes into account the relationship with two variables instead of just one.

df <- tibble::tribble( ~color, ~a, ~b, "blue", 1, 2, "green", 6, 2, "purple", 3, 3, "red", 2, 3, "yellow", 5, 1 ) df$color <- factor(df$color) fct_reorder(df$color, df$a, min)

## Add or drop factor levels in R

The three functions which are important to know from the addition and deletion perspective are

**fct_expand()**– use it to add new level**fct_explicit_na()**– use it, if you wish to assign NA as one of the levels. This way, when you plot charts, NA’s will also appear.**fct_drop()**– use it drop a particular level

Below we have code snippets with examples for better understanding.

# Adding factor level fct_expand(mtcars$cyl,"7") # Converting NA to factor level f1 <- factor(c(1, 1, NA, NA,2, 2, NA,2, 1, 2, 2)) f2 <- fct_explicit_na(f1, na_level = "(Unknown)") # Drop factor level fac1 <- factor(c("aa","bb"),c("aa","bb","cc")) fac2 <- fct_drop(fac1) fac2

## Changing values of factor levels in R

The task of changing the levels of variables can be done in multiple ways. One, you may be interested in manual recording. Two, You may be interested in collapsing the levels into lesser groups. Three, You may be interested in clubbing the least/most common levels into a single level. Fourth, You may just want to keep/drop some levels and rename everything as **others**.

Below is an illustration of how to achieve the above tasks using the functions from the `forcats`

package in R.

**Use**`fct_collapse()`

to manually combine levels into defined groups.

Below we collapse 4 and 6 to form another group called others.

fct_collapse(mtcars$cyl, Other = c("4", "6"))

**Use**that you don’t want to keep to others. You can also mention levels that you want to drop; here, the level mentioned in the`fct_other()`

to replace levels`drop=`

argument will be named others. The below code produces the exact same results, as mentioned above.

# Example showing keep as argument fct_other(mtcars$cyl, keep = c("8")) # Example showing drop as argument fct_other(mtcars$cyl, drop = c("4", "6"))

**Use**. The function is very powerful can provides other statistics to be considered as a measure to combine levels. I encourage you to read more about the function using help(fct_lump). Below we reserve the most common n values. This again results in the exact same output as mentioned above.`fct_lump()`

to group most/least common levels into a single level

fct_lump(mtcars$cyl, n = 1)

We also have different variants of the above function.

`fct_lump_min()`

: lumps levels that appear fewer than min times.`fct_lump_prop()`

: lumps levels that appear in fewer prop * n times.`fct_lump_n()`

lumps all levels except for the n most frequent (or least frequent if n < 0)`fct_lump_lowfreq()`

lumps together the least frequent levels, ensuring that “other” is the smallest.

**use**The other function which you can use to achieve the same task is`fct_recode()`

if you wish to replace the values of the levels manually.`fct_relable()`

.

Here we kind of rename the levels of the cyl factor variable to cyl4, cyl6, and cyl8.

fct_recode(mtcars$cyl, cyl4 = "4", cyl6 = "6", cly8 = "8")

The same task can also be achieved using `fct_relable()`

. The syntax of the function obeys purrr::map() syntax. The purrr package is an amazing package, and if you have not explored that package yet, I insist that you must. You can find the detailed tutorial on purrr package here.

## Combining factors with different levels

Often, we get data from different sources, which can also potentially lead to some information available in one source and some in another. For categorical variables, that means that we may now have to patch together factors from these sources because these should have the same levels and not different. The below image illustrates what we mean by factors coming from different sources. This should help you digest the concept.

**To combine different levels you can use**`fct_c()`

function.

# Creating two factors with different levels fac1 <- factor("aa") fac2 <- factor("bb") fct_c(fac1, fac2)

- To standardize the factor levels across different sources use
`fct_unify()`

. This is another approach that you can use. Here both lists will have all the levels irrespective of if the value is present or not in the dataset.

# Creating two factors with different levels fac1 <- factor("aa") fac2 <- factor("bb") fct_unify(list(fac1, fac2)) # Output [[1]] [1] aa Levels: aa bb [[2]] [1] bb Levels: aa bb

Notice how we have both the levels in both the vectors mentioned as part of levels.

With this, we come to an end of the tutorial on the forcats package. I hope you find this tutorial of help and start incorporating some of the functionality discussed here.

Happy Learning!

**leave a comment**for the author, please follow the link and comment on their blog:

**R Statistics Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.