Use unique() instead of levels() to find the possible values of a character variable in R

Posted on March 10, 2018 by Eric Cai - The Chemical Statistician in R bloggers | 0 Comments

[This article was first published on R programming – The Chemical Statistician, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I first encountered R, I learned to use the levels() function to find the possible values of a categorical variable. However, I recently noticed something very strange about this function.

Consider the built-in data set “iris” and its character variable “Species”. Here are the possible values of “Species”, as shown by the levels() function.

> levels(iris$Species)

[1] "setosa" "versicolor" "virginica"

Now, let’s remove all rows containing “setosa”. I will use the table() function to confirm that no rows contain “setosa”, and then I will apply the levels() function to “Species” again.

> iris2 = subset(iris, Species != 'setosa')
> table(iris2$Species)

    setosa versicolor virginica 
         0         50        50 


> levels(iris2$Species)

[1] "setosa" "versicolor" "virginica"

The new data set “iris2” does not have any rows containing “setosa” as a possible value of “Species”, yet the levels() function still shows “setosa” in its output.

According to the user G5W in Stack Overflow, this is a desirable behaviour for the levels() function. Here is my interpretation of the intent behind the creators of base R: The possible values of a character variable are fundamental attributes of that variable, which should not be altered because of changes in the data.

Obviously, this can cause a lot of confusion and produce wrong information, so here is my solution: From now on, I will use the unique() function to find the possible values of a character variable. Here is the result.

> unique(iris2$Species)

[1] versicolor virginica 
Levels: setosa versicolor virginica

This is the output that I expect; “setosa” does not appear in the resulting vector. However, unique() stills hows the original levels, which include “setosa” – that’s a nice feature.

I thank my colleagues Layne Newhouse, Jack Davis, and Dmity Shopin for their valuable discussion about this on LinkedIn.

To leave a comment for the author, please follow the link and comment on their blog: R programming – The Chemical Statistician.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Use unique() instead of levels() to find the possible values of a character variable in R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)