# R pitfall #3: friggin’ factors

December 15, 2011
By

(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to `NA`. Using simple letters as example names he was baffled by the result of the following code:

```lines = factor(LETTERS)
lines
# [1] A B C D E F G H...
# Levels: A B C D E F G H...

linesNA = ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
linesNA
#  [1]  1  2 NA  4  5  6 NA  8...
```

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

```linesNA = factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
linesNA
# [1] 1    2    <NA> 4    5    6    <NA> 8...
# Levels: 1 2 4 5 6 8...
```

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

```linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...
```

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

```linesNA = factor(as.character(ifelse(lines %in%
c('C', 'G', 'P'), NA, lines)))
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...
```

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: , , ,