Recoding Variables in R: Pedagogic Considerations

October 20, 2012

(This article was first published on Citizen-Statistician » R Project, and kindly contributed to R-bloggers)

I was creating a dataset this last week in which I had to partition the observed responses to show how the ANOVA model partitions the variability. I had the observed Y (in this case prices for 113 bottles of wine), and a categorical predictor X (the region of France that each bottle of wine came from). I was going to add three columns to this data, the first showing the marginal mean, the second showing the effect, and the third showing the residual. To create the variable indicating the effect, I essentially wanted to recode a particular region to a particular effect:

  • Bordeaux ==> 9.11
  • Burgundy ==> 4.20
  • Languedoc ==> –9.30
  • Rhone ==> –0.75

As I was considering how to do this, it struck me that several options were available to me. Here are two solutions that come up when Googling how to do this.

Use the recode() function from the car package.

wine$Effect <- recode(wine$Region,
  " 'Bordeaux' = 9.11;
    'Bordeaux' = 4.20;
    'Languedoc' = -9.30;
    'Rhone' = -0.75 " )
This is a commonly suggested solution. The strings inside quotation marks, however, make it likely students (and teachers) will commit a syntax error. This is especially true when recoding a categorical variable into another categorical variable. R-wise (it’s a technical term) it also produces a factor, even though it is clear that the intent was to produce numerical values. This is of course, easily fixable using as.numeric(), but it can lead to confusion.
Another solution is to use indexing.
wine$Effect <- 9.11
wine$Effect[wine$Region == "Burgundy"] <- 4.20
wine$Effect[wine$Region == "Languedoc"] <- -9.30
wine$Effect[wine$Region == "Rhone"] <- -0.75
This solution is canonical in that it is clean and the R code is concise. (Note: This is what I ended up using to create this re-coded variable.) In my experience, however, this also means that students without a programming background don’t initially understand it. This alone makes it unattractive pedagogically.

A better solution pedagogically seems to be to create a new data frame of key-value pairs (in computer science this is called a hash table) and then use the join() function from the plyr package to `join’ the original data frame and the new data frame.

key <- data.frame(
  Region = c("Bordeaux", "Burgundy", "Languedoc", "Rhone"),
  Effect = c(9.11, 4.20, -9.33, -0.75)
join(wine, key, by = Region)

For me this is a useful way to teach how to recode variables. It has a direct link to the Excel VLOOKUP function, and also to ideas of relational databases. It also allows more generalizability in terms of being able to merge data sets using a common variable.

R-wise, it is not difficult syntax, since almost every student has successfully used the data.frame() function to create a data frame. The join() function is also easily explained.

To leave a comment for the author, please follow the link and comment on their blog: Citizen-Statistician » R Project. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)