When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable. However, I recently noticed something very strange about this function.
Consider the built-in data set “iris” and its character variable “Species”. Here are the possible values of “Species”, as shown by the levels() function.
> levels(iris$Species)  "setosa" "versicolor" "virginica"
Now, let’s remove all rows containing “setosa”. I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again.
> iris2 = subset(iris, Species != 'setosa') > table(iris2$Species) setosa versicolor virginica 0 50 50 > levels(iris2$Species)  "setosa" "versicolor" "virginica"
The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.
According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function. Here is my interpretation of the intent behind the creators of base R: The possible values of a character variable are fundamental attributes of that variable, which should not be altered because of changes in the data.
Obviously, this can cause a lot of confusion and produce wrong information, so here is my solution: From now on, I will use the unique() function to find the possible values of a character variable. Here is the result.
> unique(iris2$Species)  versicolor virginica Levels: setosa versicolor virginica
This is the output that I expect; “setosa” does not appear in the resulting vector. However, unique() stills hows the original levels, which include “setosa” – that’s a nice feature.
I thank my colleagues Layne Newhouse, Jack Davis, and Dmity Shopin for their valuable discussion about this on LinkedIn.