Handling Categorical Data in R – Part 3
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is part 3 of a series on “Handling Categorical Data in R where we are learning to read, store, summarize, visualize & manipulate categorical data..”
In part 1 and part 2 of this series, we understood how R stores categorical data using factor and got an overview of tables in R while learning to summarize data.
In this article, we will learn to manipulate/reshape categorical data by changing the value and order of levels/categories.
Table of Contents
Resources
You can download all the data sets, R scripts, practice questions and their solutions from our GitHub repository.
Introduction
In this section, our focus will be on handling the levels of a categorical variable, and exploring the forcats package for the same. We will basically look at 3 key operations or transformations we would like to do when it comes to factors which are:
- change value of levels
- add or remove levels
- change order of levels
Before we start working with the value of the levels, let us read the case study data and take a quick look at some of the functions we used in the previous articles.
# read data data <- readRDS('analytics.rds')
We will store the source of traffic as channel
instead of referring to the column in the data.frame
every time.
channel <- data$channel
Let us go back to the function we used for tabulating data, fct_count()
. If you observe the result, it is in the same order as displayed by levels()
.
fct_count(channel) ## # A tibble: 8 x 2 ## f n ## <fct> <int> ## 1 (Other) 6073 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Display 3375 ## 5 Organic Search 139668 ## 6 Paid Search 4395 ## 7 Referral 35615 ## 8 Social 8031
If you want to sort the results by the count i.e. most common level comes at the top, use the sort
argument.
fct_count(channel, sort = TRUE) ## # A tibble: 8 x 2 ## f n ## <fct> <int> ## 1 Organic Search 139668 ## 2 Direct 39853 ## 3 Referral 35615 ## 4 Social 8031 ## 5 Affiliates 7388 ## 6 (Other) 6073 ## 7 Paid Search 4395 ## 8 Display 3375
If you want to view the proportion along with the count, set the prop
argument to TRUE
.
fct_count(channel, prop = TRUE) ## # A tibble: 8 x 3 ## f n p ## <fct> <int> <dbl> ## 1 (Other) 6073 0.0248 ## 2 Affiliates 7388 0.0302 ## 3 Direct 39853 0.163 ## 4 Display 3375 0.0138 ## 5 Organic Search 139668 0.571 ## 6 Paid Search 4395 0.0180 ## 7 Referral 35615 0.146 ## 8 Social 8031 0.0329
One of the important steps in data preparation/sanitization is to check if the levels are valid i.e. only levels which should be present in the data are actually present. fct_match()
can be used to check validity of levels. It returns a logical vector if the level is present and an error if not.
table(fct_match(channel, "Social")) ## ## FALSE TRUE ## 236367 8031
Change Value of Levels
In this section, we will learn how to change the value of the levels. In order to keep it interesting, we will state an objective from our case study and then map it into a function from the forcats package.
Combine both Paid & Organic Search into a single level, Search
In this case, we want to change the value of two levels, Paid Search & Organic Search and give them the common value Search. You can also look at it as collapsing two levels into one. There are two functions we can use here:
fct_collapse()
fct_recode()
Let us look at fct_collapse()
first. After specifying the categorical variable, we specify the new value followed by a character vector of the existing values. Remember, the new value is not enclosed in quotes (single or double) but the existing values must be a character
vector.
fct_count( fct_collapse( channel, Search = c("Paid Search", "Organic Search") ) ) ## # A tibble: 7 x 2 ## f n ## <fct> <int> ## 1 (Other) 6073 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Display 3375 ## 5 Search 144063 ## 6 Referral 35615 ## 7 Social 8031
In the case of fct_recode()
, each value being changed must be specified in a new line. Similar to
fct_collapse()
, the new value is not enclosed in quotes but the existing values must be.
fct_count( fct_recode( channel, Search = "Paid Search", Search = "Organic Search" ) ) ## # A tibble: 7 x 2 ## f n ## <fct> <int> ## 1 (Other) 6073 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Display 3375 ## 5 Search 144063 ## 6 Referral 35615 ## 7 Social 8031
Retain only those channels which have driven a minimum traffic of 5000 to the website
Instead of having all the channels, we desire to retain only those channels which have driven at least 5000 visits to the website. What about the rest of the channels which have driven less than 5000? We will recategorize them as Other. Keep in mind that we already have a (Other) level in our data. fct_lump_min()
will lump together all levels which do not have a minimum count specified. In our case study, only Display drives less than 5000 visits and it will be categorized into Other.
fct_count(fct_lump_min(channel, 5000)) ## # A tibble: 7 x 2 ## f n ## <fct> <int> ## 1 (Other) 6073 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Organic Search 139668 ## 5 Referral 35615 ## 6 Social 8031 ## 7 Other 7770
Retain only top 3 referring channels and categorize rest into Other
Suppose you decide to retain only the top 3 channels in terms of the traffic driven to the website. In our
case study, these are Direct, Organic Search and Referral. We want to retain these 3 levels and categorize the rest as Other. fct_lump_n()
will retain top n
levels by count/frequency and lump the rest into Other.
fct_count(fct_lump_n(channel, 3)) ## # A tibble: 4 x 2 ## f n ## <fct> <int> ## 1 Direct 39853 ## 2 Organic Search 139668 ## 3 Referral 35615 ## 4 Other 29262
In our case, n
is 3 and hence the top 3 channels in terms of traffic driven are retained while the rest are
lumped into Other.
Retain only those channels which have driven at least 2% of the overall traffic
In the second scenario above, we retained channels based on minimum traffic driven by them to the website.
The criteria was count of visits. If you want to specify the criteria as a percentage or proportion instead of count, use fct_lump_prop()
. The criteria is a value between 0 and 1. In our case study, we want to retain channels that have driven at least 2% of the overall traffic. Hence, we have specified the criteria as 0.02
.
fct_count(fct_lump_prop(channel, 0.02)) ## # A tibble: 7 x 2 ## f n ## <fct> <int> ## 1 (Other) 6073 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Organic Search 139668 ## 5 Referral 35615 ## 6 Social 8031 ## 7 Other 7770
As you can see, only Display drives less than 2% of overall traffic and has been lumped into Other.
Retain the following channels and merge the rest into Other
- Organic Search
- Direct
- Referral
In the previous scenarios, we have been retaining or lumping channels based on some criteria like count or
percentage of traffic driven to the website. In this scenario, we want to retain certain levels by specifying
their labels and combine the rest into Other. While we can use fct_collapse()
or fct_recode()
, a more
appropriate function would be fct_other()
. We will do a comparison of the three functions in a short while.
fct_other()
has two arguments, keep
and drop
. keep
is used when we know the levels we want to retain
and drop
is used when we know the levels we want to drop. In this scenario, we know the levels we want to
retain and hence we will use the keep
argument and specify them. Organic Search, Direct and Referral will be retained while the rest of the channels will be lumped into Other.
fct_count( fct_other( channel, keep = c("Organic Search", "Direct", "Referral")) ) ## # A tibble: 4 x 2 ## f n ## <fct> <int> ## 1 Direct 39853 ## 2 Organic Search 139668 ## 3 Referral 35615 ## 4 Other 29262
Merge the following channels into Other and retain rest of them:
- Display
- Paid Search
In this scenario, we know the levels we want to drop and hence we will use the drop
argument and specify
them. Display and Paid Search will be lumped into Other while the rest of the channels will be retained.
fct_count( fct_other( channel, drop = c("Display", "Paid Search") ) ) ## # A tibble: 7 x 2 ## f n ## <fct> <int> ## 1 (Other) 6073 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Organic Search 139668 ## 5 Referral 35615 ## 6 Social 8031 ## 7 Other 7770
In the previous scenario, we said we will compare fct_other()
with fct_collapse()
and fct_recode()
. Let us use the other two functions as well and see the difference.
# collapse fct_count( fct_collapse( channel, Other = c("(Other)", "Affiliate", "Display", "Paid Search", "Social") ) ) ## Warning: Unknown levels in `f`: Affiliate ## # A tibble: 5 x 2 ## f n ## <fct> <int> ## 1 Other 21874 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Organic Search 139668 ## 5 Referral 35615 # recode fct_count( fct_recode( channel, Other = "(Other)", Other = "Affiliate", Other = "Display", Other = "Paid Search", Other = "Social" ) ) ## Warning: Unknown levels in `f`: Affiliate ## # A tibble: 5 x 2 ## f n ## <fct> <int> ## 1 Other 21874 ## 2 Affiliates 7388 ## 3 Direct 39853 ## 4 Organic Search 139668 ## 5 Referral 35615
As you can observe, fct_other()
requires less typing and is easier to specify.
Anonymize the data set before sharing it with your colleagues
Anonymizing data is extremely important when you are sharing sensitive data with others. Here, we want
to anonymize the channels which drive traffic to the website so that we can share it with others without
divulging the names of the channels. fct_anon()
allows us to anonymize the levels in the data. Using the
prefix
argument, we can specify the prefix to be used while anonymizing the data.
fct_count(fct_anon(channel, prefix = "ch_")) ## # A tibble: 8 x 2 ## f n ## <fct> <int> ## 1 ch_1 6073 ## 2 ch_2 39853 ## 3 ch_3 35615 ## 4 ch_4 3375 ## 5 ch_5 8031 ## 6 ch_6 7388 ## 7 ch_7 139668 ## 8 ch_8 4395
Key Functions
Function | Description |
---|---|
fct_collapse()
|
Collapse factor levels |
fct_recode()
|
Recode factor levels |
fct_lump_min()
|
Lump factor levels with count lesser than specified value |
fct_lump_n()
|
Lump all levels except the top n levels |
fct_lump_prop()
|
Lump factor levels with count lesser than specified proportion |
fct_lump_lowfreq()
|
Lump together least frequent levels |
fct_other()
|
Replace levels with Other level |
fct_anon()
|
Anonymize factor levels |
Add / Remove Levels
In this small section, we will learn to:
- add new levels
- drop levels
- make missing values explicit
Add a new level, Blog
fct_expand()
allows us to add new levels to the data. The label of the new level must be specified after the variable name and must be enclosed in quotes. If the level already exists, it will be ignored. Let us add a new level, Blog
.
levels(fct_expand(channel, "Blog")) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" ## [9] "Blog"
Drop existing level
On the other hand, fct_drop()
will drop levels which have no values i.e. unused levels. If you want
to drop only specific levels, use the only
argument and specify the name of the level in quotes. Let us drop the new level we added in the previous example.
levels(fct_drop(fct_expand(channel, "Blog"))) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social"
Make missing values explicit
In our data set, the gender column has many missing values, and in R, missing values are represented by NA
. Suppose you are sharing the data or analysis with someone who is not an R user, and does not know what NA
represents. In such a scenario, we can use the fct_explicit_na()
function to make the missing values in the gender column explicit i.e. it will appear as (Missing)
instead of NA
. This will help non R users to understand that there are missing values in the data.
fct_count(fct_explicit_na(data$gender)) ## # A tibble: 3 x 2 ## f n ## <fct> <int> ## 1 female 40565 ## 2 male 61617 ## 3 (Missing) 142216
Key Functions
Function | Description |
---|---|
fct_expand()
|
Add additional levels to a factor |
fct_drop()
|
Drop unused factor levels |
fct_explicit_na()
|
Make missing values explicit |
Change Order of Levels
In this last section, we will learn how to change the order of the levels. We will look at the following scenarios from our case study:
We want to make
- Organic Search the first level
- Referral the third level
- Display the last level
Make Organic Search the first level
In this scenario, we want the levels to appear in a certain order. In the first case, we want Organic Search to be the first level. fct_relevel()
allows us to manually reorder the levels. To move a level to the beginning, specify the label (it must be enclosed in quotes).
levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_relevel(channel, "Organic Search")) ## [1] "Organic Search" "(Other)" "Affiliates" "Direct" ## [5] "Display" "Paid Search" "Referral" "Social"
Make Referral the third level
The after
argument is useful when we want to move the level to the end or anywhere between the beginning and end. In the second case, we want Referral to be the third level. After specifying the label, use the after
argument and specify the level after which Referral should appear. Since we want to move it to the third position, we will set the value of after
to 2
i.e. Referral should come after the second position.
levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_relevel(channel, "Referral", after = 2)) ## [1] "(Other)" "Affiliates" "Referral" "Direct" ## [5] "Display" "Organic Search" "Paid Search" "Social"
Make Display the last level
In this last case, we want to move Display to the end. If you know the number of levels, you can specify a value here. In our data, there are eight channels i.e. eight levels, so we can set the value of after
to 7
. What happens when we do not know the number of levels or if they tend to vary? In such cases, to move a level to the end, set the value of after
to Inf
.
levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_relevel(channel, "Display", after = Inf)) ## [1] "(Other)" "Affiliates" "Direct" "Organic Search" ## [5] "Paid Search" "Referral" "Social" "Display"
Let us now look at a scenario where we want to order the levels by
- frequency (largest to smallest)
- order of appearance (in data)
Order levels by frequency
In the first case, the levels with the most frequency should appear at the top. fct_infreq()
will order the
levels by their frequency.
# reorder levels levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_infreq(channel)) ## [1] "Organic Search" "Direct" "Referral" "Social" ## [5] "Affiliates" "(Other)" "Paid Search" "Display"
Order levels by appearance
In the second case, the order of the levels should be the same as the order of their appearance in the data. fct_inorder()
will order the levels according to the order in which they appear in the data.
# reorder levels levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_inorder(channel)) ## [1] "Organic Search" "Direct" "Referral" "Affiliates" ## [5] "(Other)" "Social" "Display" "Paid Search"
Reverse the order of the levels
The order of the levels can be reversed using fct_rev()
.
# reorder levels levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_rev(channel)) ## [1] "Social" "Referral" "Paid Search" "Organic Search" ## [5] "Display" "Direct" "Affiliates" "(Other)"
Randomly shuffle the order of the levels
The order of the levels can be randomly shuffled using fct_shuffle()
.
# reorder levels levels(channel) ## [1] "(Other)" "Affiliates" "Direct" "Display" ## [5] "Organic Search" "Paid Search" "Referral" "Social" levels(fct_shuffle(channel)) ## [1] "Paid Search" "Direct" "Social" "Affiliates" ## [5] "Referral" "(Other)" "Display" "Organic Search"
Key Functions
Function | Description |
---|---|
fct_relevel()
|
Reorder factor levels |
fct_shift()
|
Shift factor levels |
fct_infreq()
|
Reorder factor levels by frequency |
fct_rev()
|
Reverse order of factor levels |
fct_inorder()
|
Reorder factor levels by first appearance |
fct_shuffle()
|
Randomly shuffle factor levels |
Your Turn…
Display the count/frequency of the following variables in the descending order
device
landing_page
exit_page
Check if
laptop
is a level in thedevice
column.Combine the following levels in
landing_page
intoAccount
My Account
Register
Sign In
Your Info
Combine levels in
landing_page
that drive less than 1000 visits.Get top 10 landing and exit pages.
Get landing pages that drive at least 5% of the total traffic to the website.
Retain only the following levels in the
browser
column:Chrome
Firefox
Safari
Edge
Anonymize landing and exit page levels.
Make
Home
first level in thelanding_page
column.Make
Apparel
second level in thelanding_page
column.Make
Specials
last level in thelanding_page
column.Order the levels in the browser by frequency:
Order the levels in landing page by appearance:
Shuffle the levels in os
Reverse the levels in browser
*As the reader of this blog, you are our most important critic and commentator. We value your opinion and want to know what we are doing right, what we could do better, what areas you would like to see us publish in, and any other words of wisdom you are willing to pass our way.
We welcome your comments. You can email to let us know what you did or did not like about our blog as well as what we can do to make our post better.*
Email: [email protected]
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.