**R programming – The Chemical Statistician**, and kindly contributed to R-bloggers)

When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable. However, I recently noticed something very strange about this function.

Consider the built-in data set “iris” and its character variable “Species”. Here are the possible values of “Species”, as shown by the levels() function.

> levels(iris$Species) [1] "setosa" "versicolor" "virginica"

Now, let’s remove all rows containing “setosa”. I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again.

> iris2 = subset(iris, Species != 'setosa') > table(iris2$Species) setosa versicolor virginica 0 50 50 > levels(iris2$Species) [1] "setosa" "versicolor" "virginica"

The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.

According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function. Here is my interpretation of the intent behind the creators of base R: The possible values of a character variable are fundamental attributes of that variable, which should not be altered because of changes in the data.

Obviously, this can cause a lot of confusion and produce wrong information, so here is my solution: From now on, I will use the unique() function to find the possible values of a character variable. Here is the result.

> unique(iris2$Species) [1] versicolor virginica Levels: setosa versicolor virginica

This is the output that I expect; “setosa” does not appear in the resulting vector. However, unique() stills hows the original levels, which include “setosa” – that’s a nice feature.

I thank my colleagues Layne Newhouse, Jack Davis, and Dmity Shopin for their valuable discussion about this on LinkedIn.

**leave a comment**for the author, please follow the link and comment on their blog:

**R programming – The Chemical Statistician**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...