A question and an answer about recoding several factors simultaneously in R

[This article was first published on eKonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data manipulation is a breeze with amazing packages like plyr and dplyr. Recoding factors, which could prove to be a daunting task especially for variables that have many categories, can easily be accomplished with these packages. However, it is important for those learning Data Science to understand how the basic R works.

In this regard, I seek help from R specialists about recoding factors using the base R. My question is about why one notation in recoding factors works while the other doesn’t. I’m sure for R enthusiasts, the answer and solution are straightforward. So, here’s the question.

In the following code, I generate a vector with five categories and 300 observations. I convert the vector to a factor and tabulate it.



Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.



Where 1 and 6 correspond to a and f.

Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.

So far so good.

Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, would work, but it didn’t. It returns varying and erroneous answers.
However, when I refer to levels explicitly, the script works as intended. See the script below.
Hence the question: Why one method works and the other doesn’t? Looking forward to responses from R experts.

The Answer


lebatsnok (https://stackoverflow.com/users/2787952/lebatsnok) answered the question on stackoverflow. The solution is simple. The following code works:



The problem with my approach, as explained by lebastsnok, is the following:

“levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you’re trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a “switch”, TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don’t. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters.”








To leave a comment for the author, please follow the link and comment on their blog: eKonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)