Handling Categorical Data in R – Part 2

[This article was first published on Rsquared Academy Blog - Explore Discover Learn, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is part 2 of a series on “Handling Categorical Data in R where we are learning to read, store, summarize, visualize & manipulate categorical data..”

In part 1 of this series, we understood what categorical data is, how R stores it using factor and got an overview of the case study we will be working on throughout this series.

In this article, we will learn to summarize categorical data. In the process, we will do a deep dive on working with tables in R and explore a diverse set of packages.

Resources

You can download all the data sets, R scripts, practice questions and their solutions from our GitHub repository.

Introduction

Categorical data cannot be summarized in the same way as numeric data. It does not make sense to look at range, standard deviation etc. since data consists of a few distinct values only. So how do we summarize such data? We can look at

  • count/frequency
  • proportion
  • cumulative frequency
  • cross table
  • contingency table etc.

In this section, we will explore the above ways of summarizing categorical data. We will also spend some time learning about tables as you will be using them extensively while working with categorical data. R has many packages for tabulating data and we list and explore all of them in the R scripts shared in the GitHub repository.

Number of Categories

From our case study, we want to know the number of devices used to browse the website, the name of the devices and the proportion of traffic they drive to our website. Let us read the case study data set before we analyze the website traffic.

# read data
data <- readRDS('analytics.rds')

Let us begin with the number of devices. To view the number of groups/categories in a categorical variable, use nlevels().

nlevels(data$device)
## [1] 3

There are 3 categories of devices used by the visitors to browse the website. This can also be used for data sanitization i.e. as an analyst you know that there are only 3 valid categories of device into which any visitor can be classified into. If you see more than 3 categories, you might want to check if there are any issues in data collection or processing. Now that we know there are 3 categories of devices, let us check if they are valid. The levels() function will return the labels of the groups.

Category Names

Knowing the number of levels is useful but not sufficient. levels() is one of the most useful functions when it comes to dealing with categorical data.

levels(data$device)
## [1] "Desktop" "Mobile"  "Tablet"

Other functions that you can use include unique() and fct_unique(). Both these functions will return the unique names/labels along with the levels while levels() returns the labels of the levels.

unique(data$device)
## [1] Desktop Mobile  Tablet 
## Levels: Desktop Mobile Tablet
fct_unique(data$device)
## [1] Desktop Mobile  Tablet 
## Levels: Desktop Mobile Tablet

Names & Counts

So we have checked the number of devices and their names. Let us now examine their distribution i.e. count/frequency. table() and summary() will display the levels and their counts while fct_count() will return a tibble with 2 columns (level & count). It is extremely useful for further data processing or visualization (using ggplot2).

table(data$device)
## 
## Desktop  Mobile  Tablet 
##  177282   63482    3634
fct_count(data$device)
## # A tibble: 3 x 2
##   f            n
##   <fct>    <int>
## 1 Desktop 177282
## 2 Mobile   63482
## 3 Tablet    3634
summary(data$device)
## Desktop  Mobile  Tablet 
##  177282   63482    3634

Tables

In the previous section, we used the table() function to tabulate categorical data. We will recreate the tabulation for device and store it in a new variable tab.

tab <- table(data$device)
tab
## 
## Desktop  Mobile  Tablet 
##  177282   63482    3634

What does this function return? It is not a vector, list, data.frame or matrix. Let us use the class() function to check the class of the object returned by table(). It returns an object of the class table. This is a new type of object. Let us spend some time understanding tables as they are useful for organizing and summarizing categorical data. table is also the most used object when it comes to dealing with categorical data.

The table() function returns the counts of the categories but let us say we want to view the proportion or percentage instead of counts i.e. the proportion or percentage of traffic driven to our website by the different devices. The proportions() or prop.table() function comes in handy in such cases. It takes a table object as input (tab in our case).

prop.table(tab)
## 
##    Desktop     Mobile     Tablet 
## 0.72538237 0.25974844 0.01486919
proportions(tab)
## 
##    Desktop     Mobile     Tablet 
## 0.72538237 0.25974844 0.01486919

To get the percentages, multiply the output by 100. Use the round() function to round the decimal places according to your requirements.

proportions(tab) * 100
## 
##   Desktop    Mobile    Tablet 
## 72.538237 25.974844  1.486919
round(proportions(tab) * 100, 2)
## 
## Desktop  Mobile  Tablet 
##   72.54   25.97    1.49

So far, we have used table() to tabulate a single categorical variable. It can be used for a lot more than just tabulating data. We can examine the relationship between two categorical variables as well as create multidimensional tables. Let us look at the relationship between gender and device in our case study. Does gender affect the type of device used? To answer this, we will create a two way or cross table. In the table() function, we can specify multiple variables by separating them with a comma.

tab2 <- table(data$gender, data$device)
tab2
##         
##          Desktop Mobile Tablet
##   female   32803   7268    494
##   male     46418  14503    696
##   <NA>     98061  41711   2444

Keep in mind that the order of the variables matter. Rows represent the first variable while column represents the second.

table(data$device, data$gender)
##          
##           female  male  <NA>
##   Desktop  32803 46418 98061
##   Mobile    7268 14503 41711
##   Tablet     494   696  2444

The proportions() function works with two way tables as well.

proportions(tab2)
##         
##              Desktop      Mobile      Tablet
##   female 0.134219593 0.029738378 0.002021293
##   male   0.189927904 0.059341729 0.002847814
##   <NA>   0.401234871 0.170668336 0.010000082
proportions(tab2) * 100
##         
##             Desktop     Mobile     Tablet
##   female 13.4219593  2.9738378  0.2021293
##   male   18.9927904  5.9341729  0.2847814
##   <NA>   40.1234871 17.0668336  1.0000082

We would like to introduce another function at this point of time, margin.table(). What does this function do? It computes the marginal frequencies i.e. the sum of the rows or columns. It takes a table object as input. The margin argument allows us to specify whether we want the sum of rows or columns. 1 indicates rows and 2 indicates columns.

margin.table(tab2, 1) # sum of rows
## 
## female   male   <NA> 
##  40565  61617 142216
margin.table(tab2, 2) # sum of columns
## 
## Desktop  Mobile  Tablet 
##  177282   63482    3634

If the margin argument is NULL (which it is by default), the function returns the sum of all cells of the table.

margin.table(tab2)
## [1] 244398

table() does not display row or column labels. It does display the group labels though. Let us revisit the output from tab2. You can observe that while it includes the group labels, the row and column labels are missing. The output from the dimnames() function shows the group labels of the variables but the row & column labels are absent.

dimnames(tab2)
## [[1]]
## [1] "female" "male"   NA      
## 
## [[2]]
## [1] "Desktop" "Mobile"  "Tablet"
names(tab2)
## NULL
names(dimnames(tab2)) 
## [1] "" ""

The output from names(dimnames(tab2)) is also empty. Let us add the variable names as the row & column labels to tab2.

names(dimnames(tab2)) <- c("Gender", "Device")
tab2
##         Device
## Gender   Desktop Mobile Tablet
##   female   32803   7268    494
##   male     46418  14503    696
##   <NA>     98061  41711   2444

Now look at the output from tab2 and you can observe the difference. The same is also visible when we run dimnames(tab2).

dimnames(tab2)
## $Gender
## [1] "female" "male"   NA      
## 
## $Device
## [1] "Desktop" "Mobile"  "Tablet"

To add margin totals to the table, use addmargins(). Like proportions() and margin.table(), it also takes a table object as the input.

addmargins(tab2)
##         Device
## Gender   Desktop Mobile Tablet    Sum
##   female   32803   7268    494  40565
##   male     46418  14503    696  61617
##   <NA>     98061  41711   2444 142216
##   Sum     177282  63482   3634 244398

rowSums() returns the row total while colSums() returns the column total. They are similar to margin.table().

rowSums(tab2)
## female   male   <NA> 
##  40565  61617 142216
colSums(tab2)
## Desktop  Mobile  Tablet 
##  177282   63482    3634

xtabs() is another way of creating multidimensional tables in R. In comparison to table(), it

  • uses formula notation for input
  • the data argument ensures variable names are referenced instead of using $ i.e. data$variable
  • displays row & column labels by default
tabx <- xtabs(~gender+device, data = data)
tabx
##         device
## gender   Desktop Mobile Tablet
##   female   32803   7268    494
##   male     46418  14503    696
##   <NA>     98061  41711   2444

The following functions work with xtabs() as well

  • proportions()
  • margin.table()
  • addmargins()
proportions(tabx)
##         device
## gender       Desktop      Mobile      Tablet
##   female 0.134219593 0.029738378 0.002021293
##   male   0.189927904 0.059341729 0.002847814
##   <NA>   0.401234871 0.170668336 0.010000082
margin.table(tabx, 1)
## gender
## female   male   <NA> 
##  40565  61617 142216
margin.table(tabx, 2)
## device
## Desktop  Mobile  Tablet 
##  177282   63482    3634
addmargins(tabx)
##         device
## gender   Desktop Mobile Tablet    Sum
##   female   32803   7268    494  40565
##   male     46418  14503    696  61617
##   <NA>     98061  41711   2444 142216
##   Sum     177282  63482   3634 244398

So far, we have been working with one or two dimensional tables. Both the table() and xtabs() functions are capable of creating multidimensional tables. Keep in mind that multidimensional tables are complex and it becomes increasingly difficult to understand or interpret them.

tab3 <- xtabs(~gender+device+channel, data = data)
tab3
## , , channel = (Other)
## 
##         device
## gender   Desktop Mobile Tablet
##   female     786    258      0
##   male      1063    507     19
##   <NA>      2173   1186     81
## 
## , , channel = Affiliates
## 
##         device
## gender   Desktop Mobile Tablet
##   female    1314     60      0
##   male      1714    169      0
##   <NA>      3518    548     65
## 
## , , channel = Direct
## 
##         device
## gender   Desktop Mobile Tablet
##   female    4785    977     59
##   male      7010   2381     95
##   <NA>     15824   8292    430
## 
## , , channel = Display
## 
##         device
## gender   Desktop Mobile Tablet
##   female     123    753    104
##   male       210    491     73
##   <NA>       554    911    156
## 
## , , channel = Organic Search
## 
##         device
## gender   Desktop Mobile Tablet
##   female   17109   4480    282
##   male     25016   9563    448
##   <NA>     54071  27223   1476
## 
## , , channel = Paid Search
## 
##         device
## gender   Desktop Mobile Tablet
##   female     645    230     22
##   male       887    478     26
##   <NA>      1274    782     51
## 
## , , channel = Referral
## 
##         device
## gender   Desktop Mobile Tablet
##   female    7387     74      0
##   male      9251    185      0
##   <NA>     18052    615     51
## 
## , , channel = Social
## 
##         device
## gender   Desktop Mobile Tablet
##   female     654    436     27
##   male      1267    729     35
##   <NA>      2595   2154    134

ftable stands for flat tables and is useful for printing attractive tables. It makes it easy to read and interpret multidimensional tables. In the next example, we will use ftable() to print the tables we have created in the previous examples and compare the outputs.

ftable(tabx)
##        device Desktop Mobile Tablet
## gender                             
## female          32803   7268    494
## male            46418  14503    696
## NA              98061  41711   2444
ftable(tab2)
##        Device Desktop Mobile Tablet
## Gender                             
## female          32803   7268    494
## male            46418  14503    696
## NA              98061  41711   2444
ftable(tab3)
##                channel (Other) Affiliates Direct Display Organic Search Paid Search Referral Social
## gender device                                                                                      
## female Desktop             786       1314   4785     123          17109         645     7387    654
##        Mobile              258         60    977     753           4480         230       74    436
##        Tablet                0          0     59     104            282          22        0     27
## male   Desktop            1063       1714   7010     210          25016         887     9251   1267
##        Mobile              507        169   2381     491           9563         478      185    729
##        Tablet               19          0     95      73            448          26        0     35
## NA     Desktop            2173       3518  15824     554          54071        1274    18052   2595
##        Mobile             1186        548   8292     911          27223         782      615   2154
##        Tablet               81         65    430     156           1476          51       51    134

By default, missing values (NAs) are excluded from tables. Let us modify the gender data from our case study a bit and see how the table() function deals with missing values. We won’t explicitly specify NA as a level while recreating the gender data.

gen <- as.factor(as.character(data$gender))
table(gen)
## gen
## female   male 
##  40565  61617

As you can see, table() excludes missing values while tabulating the data. In order to ensure that missing values are also counted, we can use the useNA argument. It can take two values:

  • ifany
  • always

In the first case, it will show NA as a level and the count only if there are missing values in the data. In the second case, it will always show NA as a level irrespective of whether there are missing values in the data or not.

table(gen, useNA = "ifany")
## gen
## female   male   <NA> 
##  40565  61617 142216
table(data$device, useNA = "always")
## 
## Desktop  Mobile  Tablet    <NA> 
##  177282   63482    3634       0

In this final section on tables, we will learn how to select/access the different parts of a table. We will use [ operator to select rows and columns of a table (it is similar to selecting data from a data.frame). Below are a few examples:

  • select first row
tab2[1, ]           
## Desktop  Mobile  Tablet 
##   32803    7268     494
  • select first column
tab2[, 1]           
## female   male   <NA> 
##  32803  46418  98061
  • select first two rows
tab2[1:2, ]         
##         Device
## Gender   Desktop Mobile Tablet
##   female   32803   7268    494
##   male     46418  14503    696
  • select first two columns
tab2[, 1:2]        
##         Device
## Gender   Desktop Mobile
##   female   32803   7268
##   male     46418  14503
##   <NA>     98061  41711
  • select nth row
tab2[2, ]          
## Desktop  Mobile  Tablet 
##   46418   14503     696
  • select nth column
tab2[, 2]          
## female   male   <NA> 
##   7268  14503  41711
  • select row by group label
tab2["female", ]   
## Desktop  Mobile  Tablet 
##   32803    7268     494
  • select column by group label
tab2[, "Mobile"]   
## female   male   <NA> 
##   7268  14503  41711

Before we end this section, let us learn how to test if an object is of class table using is.table().

is.table(tab2)
## [1] TRUE

Next, we will look at different R packages for two way/contingency tables.

Contingency Table

For cross tables with output similar to SAS or SPSS, use any of the below:

  • CrossTable() from the gmodels package
  • ds_cross_table() from the descriptr package
gmodels::CrossTable(data$device, data$gender)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  102182 
## 
##  
##              | data$gender 
##  data$device |    female |      male | Row Total | 
## -------------|-----------|-----------|-----------|
##      Desktop |     32803 |     46418 |     79221 | 
##              |    58.228 |    38.334 |           | 
##              |     0.414 |     0.586 |     0.775 | 
##              |     0.809 |     0.753 |           | 
##              |     0.321 |     0.454 |           | 
## -------------|-----------|-----------|-----------|
##       Mobile |      7268 |     14503 |     21771 | 
##              |   218.694 |   143.975 |           | 
##              |     0.334 |     0.666 |     0.213 | 
##              |     0.179 |     0.235 |           | 
##              |     0.071 |     0.142 |           | 
## -------------|-----------|-----------|-----------|
##       Tablet |       494 |       696 |      1190 | 
##              |     0.986 |     0.649 |           | 
##              |     0.415 |     0.585 |     0.012 | 
##              |     0.012 |     0.011 |           | 
##              |     0.005 |     0.007 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |     40565 |     61617 |    102182 | 
##              |     0.397 |     0.603 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
descriptr::ds_cross_table(data, device, gender)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  244398 
## 
## ----------------------------------------------------------------------------
## |              |                          gender                           |
## ----------------------------------------------------------------------------
## |       device |       female |         male |           NA |    Row Total |
## ----------------------------------------------------------------------------
## |      Desktop |        32803 |        46418 |        98061 |       177282 |
## |              |        0.134 |         0.19 |        0.401 |              |
## |              |         0.19 |         0.26 |         0.55 |         0.73 |
## |              |         0.81 |         0.75 |         0.69 |              |
## ----------------------------------------------------------------------------
## |       Mobile |         7268 |        14503 |        41711 |        63482 |
## |              |         0.03 |        0.059 |        0.171 |              |
## |              |         0.11 |         0.23 |         0.66 |         0.26 |
## |              |         0.18 |         0.24 |         0.29 |              |
## ----------------------------------------------------------------------------
## |       Tablet |          494 |          696 |         2444 |         3634 |
## |              |        0.002 |        0.003 |         0.01 |              |
## |              |         0.14 |         0.19 |         0.67 |         0.01 |
## |              |         0.01 |         0.01 |         0.02 |              |
## ----------------------------------------------------------------------------
## | Column Total |        40565 |        61617 |       142216 |       244398 |
## |              |        0.166 |        0.252 |        0.582 |              |
## ----------------------------------------------------------------------------

We list and explore different R packages for summarizing categorical data in our GitHub repository.

Key Functions

Your Turn…

  1. Display the number of levels in

    • browser
    • channel
    • landing_page
    • exit_page
  2. Display the categories in

    • os
    • channel
    • browser
    • gender
    • user_type
  3. Display the count/frequency of

    • channel
    • user_type
  4. Examine the distribution of the following and summarize your observations:

    • channel by user_type
    • device by purchase_flag
    • channel by device
    • channel by purchase_flag
    • user_type by purchase_flag

*As the reader of this blog, you are our most important critic and commentator. We value your opinion and want to know what we are doing right, what we could do better, what areas you would like to see us publish in, and any other words of wisdom you are willing to pass our way.

We welcome your comments. You can email to let us know what you did or did not like about our blog as well as what we can do to make our post better.*

Email:

To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog - Explore Discover Learn.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)