forcats 0.1.0 šŸˆšŸˆšŸˆšŸˆ

August 31, 2016
By

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

If you use packages from the tidyverse (likeĀ tibbleĀ andĀ readr) you donā€™t need to worry about getting factors when you donā€™t want them. But factors are a useful data structure in their own right, particularly for modelling and visualisation, because they allow you to control the order of the levels. Working with factors in base R can be a little frustrating because of a handful of missing tools. The goal of forcats is to fill in those missing pieces so you can access the power of factors with a minimum of pain.

Install forcats with:

install.packages("forcats")

forcats provides two main types of tools to change either theĀ valuesĀ or theĀ orderĀ of the levels. Iā€™ll call out some of the most important functions below, using using the includedĀ gss_catĀ dataset which contains a selection of categorical variables from theĀ General Social Survey.

library(dplyr)
library(ggplot2)
library(forcats)

gss_cat
#> # A tibble: 21,483 Ɨ 9
#>    year       marital   age   race        rincome            partyid
#>   <int>        <fctr> <int> <fctr>         <fctr>             <fctr>
#> 1  2000 Never married    26  White  $8000 to 9999       Ind,near rep
#> 2  2000      Divorced    48  White  $8000 to 9999 Not str republican
#> 3  2000       Widowed    67  White Not applicable        Independent
#> 4  2000 Never married    39  White Not applicable       Ind,near rep
#> 5  2000      Divorced    25  White Not applicable   Not str democrat
#> 6  2000       Married    25  White $20000 - 24999    Strong democrat
#> # ... with 2.148e+04 more rows, and 3 more variables: relig <fctr>,
#> #   denom <fctr>, tvhours <int>

Change level values

You can recode specified factor levels withĀ fct_recode():

gss_cat %>% count(partyid)
#> # A tibble: 10 Ɨ 2
#>              partyid     n
#>               <fctr> <int>
#> 1          No answer   154
#> 2         Don't know     1
#> 3        Other party   393
#> 4  Strong republican  2314
#> 5 Not str republican  3032
#> 6       Ind,near rep  1791
#> # ... with 4 more rows

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
#> # A tibble: 10 Ɨ 2
#>                 partyid     n
#>                  <fctr> <int>
#> 1             No answer   154
#> 2            Don't know     1
#> 3           Other party   393
#> 4    Republican, strong  2314
#> 5      Republican, weak  3032
#> 6 Independent, near rep  1791
#> # ... with 4 more rows

Note that unmentioned levels are left as is, and the order of the levels is preserved.

fct_lump()Ā allows you to lump the rarest (or most common) levels in to a new ā€œotherā€ level. The default behaviour is to collapse the smallest levels in to other, ensuring that itā€™s still the smallest level. For the religion variable that tells us that Protestants out number all other religions, which is interesting, but we probably want more level.

gss_cat %>% 
  mutate(relig = fct_lump(relig)) %>% 
  count(relig)
#> # A tibble: 2 Ɨ 2
#>        relig     n
#>       <fctr> <int>
#> 1      Other 10637
#> 2 Protestant 10846

Alternatively you can supply a number of levels to keep,Ā n, or minimum proportion for inclusion,Ā prop. If you use negative values,Ā fct_lump()will change direction, and combine the most common values while preserving the rarest.

gss_cat %>% 
  mutate(relig = fct_lump(relig, n = 5)) %>% 
  count(relig)
#> # A tibble: 6 Ɨ 2
#>        relig     n
#>       <fctr> <int>
#> 1      Other   913
#> 2  Christian   689
#> 3       None  3523
#> 4     Jewish   388
#> 5   Catholic  5124
#> 6 Protestant 10846

gss_cat %>% 
  mutate(relig = fct_lump(relig, prop = -0.10)) %>% 
  count(relig)
#> # A tibble: 12 Ɨ 2
#>                     relig     n
#>                    <fctr> <int>
#> 1               No answer    93
#> 2              Don't know    15
#> 3 Inter-nondenominational   109
#> 4         Native american    23
#> 5               Christian   689
#> 6      Orthodox-christian    95
#> # ... with 6 more rows

Change level order

There are four simple helpers for common operations:

  • fct_relevel()Ā is similar toĀ stats::relevel()Ā but allows you to move any number of levels to the front.
  • fct_inorder()Ā orders according to the first appearance of each level.
  • fct_infreq()Ā orders from most common to rarest.
  • fct_rev()Ā reverses the order of levels.

fct_reorder()Ā andĀ fct_reorder2()Ā are useful for visualisations.Ā fct_reorder()Ā reorders the factor levels by another variable. This is useful when you map a categorical variable to position, as shown in the following example which shows the average number of hours spent watching television across religions.

relig <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig, aes(tvhours, relig)) + geom_point()
reorder-1ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

reorder-2

fct_reorder2()Ā extends the same idea to plots where a factor is mapped to another aesthetic, like colour. The defaults are designed to make legends easier to read for line plots, as shown in the following example looking at marital status by age.

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  group_by(age, marital) %>%
  count() %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop)) +
  geom_line(aes(colour = marital))
reorder2-1ggplot(by_age, aes(age, prop)) +
  geom_line(aes(colour = fct_reorder2(marital, age, prop))) +
  labs(colour = "marital")
 reorder2-2

Learning more

You can learn more about forcats inĀ R for data science, and on theĀ forcats website.

PleaseĀ let me knowĀ if you have more factor problems that forcats doesnā€™t help with!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)