Site icon R-bloggers

Exploring Categorical Data With Inspectdf

[This article was first published on Alastair Rushworth, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exploring categorical data with inspectdf

What’s inspectdf and what’s it for?

I often find myself viewing and reviewing dataframes throughout the course of an analysis, and a substantial amount of time can be spent rewriting the same code to do this. inspectdf is an R package designed to make common exploratory tools a bit more useful and easy to use.

In particular, it’s very powerful be able to quickly see the contents of categorical features. In this article, we’ll summarise how to use the inspect_cat() function from inspectdf for summarising and visualising categorical columns.

First of all, you’ll need to have the inspectdf package installed. You can get it from github using

library(devtools)
install_github("alastairrushworth/inspectdf")

Then load the package in. We’ll also load dplyr for the starwars data and for the pipe %>%.

library(inspectdf)
library(dplyr)

# check out the starwars help file
?starwars

Tabular summaries using inspect_cat()

The starwars data that comes bundled with dplyr has 7 columns that have character class, and is therefore a nice candidate for illustrating the use of inspect_cat. We can see this quickly using the inspect_types() function from inspectdf.

starwars %>% inspect_types()

## # A tibble: 4 x 4
##   type        cnt  pcnt col_name 
##   <chr>     <int> <dbl> <list>   
## 1 character     7 53.8  <chr [7]>
## 2 list          3 23.1  <chr [3]>
## 3 numeric       2 15.4  <chr [2]>
## 4 integer       1  7.69 <chr [1]>

Using inspect_cat() is very straightforward:

star_cat <- starwars %>% inspect_cat()
star_cat

## # A tibble: 7 x 5
##   col_name     cnt common common_pcnt levels           
##   <chr>      <int> <chr>        <dbl> <list>           
## 1 eye_color     15 brown        24.1  <tibble [15 × 3]>
## 2 gender         5 male         71.3  <tibble [5 × 3]> 
## 3 hair_color    13 none         42.5  <tibble [13 × 3]>
## 4 homeworld     49 Naboo        12.6  <tibble [49 × 3]>
## 5 name          87 Ackbar        1.15 <tibble [87 × 3]>
## 6 skin_color    31 fair         19.5  <tibble [31 × 3]>
## 7 species       38 Human        40.2  <tibble [38 × 3]>

So what does this tell us? Each row in the tibble returned from inspect_cat() corresponds to each categorical column (factor, logical or character) in the starwars dataframe.

A table of relative frequencies of eye_color can be retrieved by typing

star_cat$levels$eye_color

## # A tibble: 15 x 3
##    value           prop   cnt
##    <chr>          <dbl> <int>
##  1 brown         0.241     21
##  2 blue          0.218     19
##  3 yellow        0.126     11
##  4 black         0.115     10
##  5 orange        0.0920     8
##  6 red           0.0575     5
##  7 hazel         0.0345     3
##  8 unknown       0.0345     3
##  9 blue-gray     0.0115     1
## 10 dark          0.0115     1
## 11 gold          0.0115     1
## 12 green, yellow 0.0115     1
## 13 pink          0.0115     1
## 14 red, blue     0.0115     1
## 15 white         0.0115     1

There isn’t anything here that can’t be obtained by using the base table() function with some post-processing. inspect_cat() automates some of that functionality and wraps it into a single, convenient function.

Visualising categorical columns with show_plot()

An important feature of inspectdf is the ability to visualise dataframe summaries. Visualising categories can be challenging, because categorical columns can be very rich and contain many unique levels. A simple stacked barplot can be produced using show_plot()

star_cat %>% show_plot()

< !-- -->

Like the star_cat tibble returned by inspect_cat(), each row of the plot is a single column, split by the relative frequency of occurrence of each unique entry.

Combining rare entries with show_plot()

Some of the categorical columns like name seems to have a lot of unique entries. We should expect this – names often are unique (or almost) in a small dataset. If we scaled this analysis up to a dataset with millions of rows, there would be so many names with very small relative frequencies that the name bars would be very difficult to see. show_plot() can help with this too!

star_cat %>% show_plot(high_cardinality = 1)

< !-- -->

By setting the argument high_cardinality = 1 all entries that occur only once are combined into a single group labelled high cardinality. This makes it easier to see when some entries occur only once (or extremely rarely).

Playing with color options in show_plot()

It’s been pointed out that the default ggplot color theme isn’t particularly friendly to color-blind audiences. A more color-blind friendly theme is available by specifying col_palette = 1:

star_cat %>% show_plot(col_palette = 1)

< !-- -->

I’m also quite fond of the 80s theme by choosing col_palette = 2:

star_cat %>% show_plot(col_palette = 2)

< !-- -->

There are 5 palettes at the moment, so have a play around. Note that the color palettes have not yet hit the CRAN version of inspectdf – that will come soon in an update, but for now you can get them from the github version of the package using the code at the start of the article.

Comments? Suggestions? Issues?

Any feedback is welcome! Find me on twitter at rushworth_a or write a github issue.

To leave a comment for the author, please follow the link and comment on their blog: Alastair Rushworth.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.