When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable.  However, I recently noticed something very strange about this function.

Consider the built-in data set “iris” and its character variable “Species”.  Here are the possible values of “Species”, as shown by the levels() function.

> levels(iris$Species)

[1] "setosa" "versicolor" "virginica"

Now, let’s remove all rows containing “setosa”.  I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again.

> iris2 = subset(iris, Species != 'setosa')
> table(iris2$Species)

    setosa versicolor virginica 
         0         50        50 

> levels(iris2$Species)

[1] "setosa" "versicolor" "virginica"

The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.

According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function.  Here is my interpretation of the intent behind the creators of base R: The possible values of a character variable are fundamental attributes of that variable, which should not be altered because of changes in the data.

Obviously, this can cause a lot of confusion and produce wrong information, so here is my solution: From now on, I will use the unique() function to find the possible values of a character variable.  Here is the result.

> unique(iris2$Species)

[1] versicolor virginica 
Levels: setosa versicolor virginica


This is the output that I expect; “setosa” does not appear in the resulting vector.  However, unique() stills hows the original levels, which include “setosa” – that’s a nice feature.


I thank my colleagues Layne Newhouse, Jack Davis, and Dmity Shopin for their valuable discussion about this on LinkedIn.

