R pitfall #3: friggin’ factors

December 15, 2011
By

(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

lines = factor(LETTERS)
lines
# [1] A B C D E F G H...
# Levels: A B C D E F G H...

linesNA = ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
linesNA
#  [1]  1  2 NA  4  5  6 NA  8...

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

linesNA = factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
linesNA
# [1] 1    2    <NA> 4    5    6    <NA> 8...
# Levels: 1 2 4 5 6 8...

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

linesNA = factor(as.character(ifelse(lines %in%
                 c('C', 'G', 'P'), NA, lines)))
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

To leave a comment for the author, please follow the link and comment on his blog: Quantum Forest » rblogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.