Exploring Categorical Data With Inspectdf

[This article was first published on Alastair Rushworth, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exploring categorical data with inspectdf

What’s inspectdf and what’s it for?

I often find myself viewing and reviewing dataframes throughout the
course of an analysis, and a substantial amount of time can be spent
rewriting the same code to do this. inspectdf is an R package designed
to make common exploratory tools a bit more useful and easy to use.

In particular, it’s very powerful be able to quickly see the contents of
categorical features. In this article, we’ll summarise how to use the
inspect_cat() function from inspectdf for summarising and
visualising categorical columns.

First of all, you’ll need to have the inspectdf package installed. You
can get it from github using

<span class="n">library</span><span class="p">(</span><span class="n">devtools</span><span class="p">)</span><span class="w">
</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"alastairrushworth/inspectdf"</span><span class="p">)</span><span class="w">

Then load the package in. We’ll also load dplyr for the starwars
data and for the pipe %>%.

<span class="n">library</span><span class="p">(</span><span class="n">inspectdf</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">

</span><span class="c1"># check out the starwars help file</span><span class="w">
</span><span class="o">?</span><span class="n">starwars</span><span class="w">

Tabular summaries using inspect_cat()

The starwars data that comes bundled with dplyr has 7 columns that
have character class, and is therefore a nice candidate for illustrating
the use of inspect_cat. We can see this quickly using the
inspect_types() function from inspectdf.

<span class="n">starwars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">inspect_types</span><span class="p">()</span><span class="w">
## # A tibble: 4 x 4
##   type        cnt  pcnt col_name 
##   <chr>     <int> <dbl> <list>   
## 1 character     7 53.8  <chr [7]>
## 2 list          3 23.1  <chr [3]>
## 3 numeric       2 15.4  <chr [2]>
## 4 integer       1  7.69 <chr [1]>

Using inspect_cat() is very straightforward:

<span class="n">star_cat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">starwars</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">inspect_cat</span><span class="p">()</span><span class="w">
</span><span class="n">star_cat</span><span class="w">
## # A tibble: 7 x 5
##   col_name     cnt common common_pcnt levels           
##   <chr>      <int> <chr>        <dbl> <list>           
## 1 eye_color     15 brown        24.1  <tibble [15 × 3]>
## 2 gender         5 male         71.3  <tibble [5 × 3]> 
## 3 hair_color    13 none         42.5  <tibble [13 × 3]>
## 4 homeworld     49 Naboo        12.6  <tibble [49 × 3]>
## 5 name          87 Ackbar        1.15 <tibble [87 × 3]>
## 6 skin_color    31 fair         19.5  <tibble [31 × 3]>
## 7 species       38 Human        40.2  <tibble [38 × 3]>

So what does this tell us? Each row in the tibble returned from
inspect_cat() corresponds to each categorical column (factor,
logical or character) in the starwars dataframe.

  • The cnt column tells you how many unique levels there are for each
    column. For example, there are 15 unique entries in the eye_color
  • The common column prints the most commonly occurring entry. For
    example, the most common eye_color is brown. The percentage
    occurrence is 24.1% which is shown under common_pcnt.
  • A full list of levels and occurrence frequency is provided in the
    list column levels.

A table of relative frequencies of eye_color can be retrieved by

<span class="n">star_cat</span><span class="o">$</span><span class="n">levels</span><span class="o">$</span><span class="n">eye_color</span><span class="w">
## # A tibble: 15 x 3
##    value           prop   cnt
##    <chr>          <dbl> <int>
##  1 brown         0.241     21
##  2 blue          0.218     19
##  3 yellow        0.126     11
##  4 black         0.115     10
##  5 orange        0.0920     8
##  6 red           0.0575     5
##  7 hazel         0.0345     3
##  8 unknown       0.0345     3
##  9 blue-gray     0.0115     1
## 10 dark          0.0115     1
## 11 gold          0.0115     1
## 12 green, yellow 0.0115     1
## 13 pink          0.0115     1
## 14 red, blue     0.0115     1
## 15 white         0.0115     1

There isn’t anything here that can’t be obtained by using the base
table() function with some post-processing. inspect_cat() automates
some of that functionality and wraps it into a single, convenient

Visualising categorical columns with show_plot()

An important feature of inspectdf is the ability to visualise
dataframe summaries. Visualising categories can be challenging, because
categorical columns can be very rich and contain many unique levels. A
simple stacked barplot can be produced using show_plot()

<span class="n">star_cat</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">show_plot</span><span class="p">()</span><span class="w">

Like the star_cat tibble returned by inspect_cat(), each row of the
plot is a single column, split by the relative frequency of occurrence
of each unique entry.

  • Some of the bars are labelled, but in cases where the bars are
    small, the labels are not shown. If you encounter categorical
    columns with really long strings, labels can be suppressed
    altogether with show_plot(text_labels = FALSE).
  • Missing values or NAs are shown as gray bars. In this case, there
    are quite a few starwars characters whose homeworld is not
    unknown or missing.

Combining rare entries with show_plot()

Some of the categorical columns like name seems to have a lot of
unique entries. We should expect this – names often are unique (or
almost) in a small dataset. If we scaled this analysis up to a dataset
with millions of rows, there would be so many names with very small
relative frequencies that the name bars would be very difficult to see.
show_plot() can help with this too!

<span class="n">star_cat</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">show_plot</span><span class="p">(</span><span class="n">high_cardinality</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

By setting the argument high_cardinality = 1 all entries that occur
only once are combined into a single group labelled high
. This makes it easier to see when some entries occur only
once (or extremely rarely).

  • In the above, it’s now obvious that no two people in the starwars
    data share the same name, and that many come from a unique
    homeworld or species.
  • By setting high_cardinality = 2 or even greater, it’s possible to
    group the ‘long-tail’ of rare categories even further. With larger
    datasets, this becomes increasingly important for visualisation.
  • A practical reason to combine rare entries, is plotting speed – it
    can take a long time to render a plot with tens of thousands (or
    more) unique bars! Using the high_cardinality argument can reduce
    this dramatically.

Playing with color options in show_plot()

It’s been pointed out that the default ggplot color theme isn’t
particularly friendly to color-blind audiences. A more color-blind
friendly theme
available by specifying col_palette = 1:

<span class="n">star_cat</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">show_plot</span><span class="p">(</span><span class="n">col_palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

I’m also quite fond of the 80s theme by choosing col_palette = 2:

<span class="n">star_cat</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">show_plot</span><span class="p">(</span><span class="n">col_palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">

There are 5 palettes at the moment, so have a play around. Note that the
color palettes have not yet hit the CRAN version of inspectdf – that
will come soon in an update, but for now you can get them from the
github version of the package using the code at the start of the

Comments? Suggestions? Issues?

Any feedback is welcome! Find me on twitter at
rushworth_a or write a github

To leave a comment for the author, please follow the link and comment on their blog: Alastair Rushworth.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)