Indexing with factors

November 8, 2012
By

(This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers)

This is a silly problem that bit me again recently. It’s an elementary mistake that I’ve somehow repeatedly failed to learn to avoid in eight years of R coding. Here’s an example to demonstrate.

Suppose we create a data frame with a categorical column, in this case the heights of ten adults along with their gender.

(heights <- data.frame(
  height_cm = c(153, 181, 150, 172, 165, 149, 174, 169, 198, 163),
  gender    = c("female", "male", "female", "male", "male", "female", "female", "male", "male", "female")
))

Using a factory fresh copy of R, the gender column will be assigned a factor with two levels: “female” and then “male”. This is all well and good, though the column can be kept as characters by setting stringsAsFactors = FALSE.

Now suppose that we want to assign a body weight to these people, based upon a gender average.

avg_body_weight_kg <- c(male = 78, female = 63)

Pop quiz: what does this next line of code give us?

avg_body_weight_kg[heights$gender]  

Well, the first value of heights$gender is “female”, so the first value should be 63, and the second value of heights$gender is “male”, so the second value should be 78, and so on. Let’s try it.

avg_body_weight_kg[heights$gender]  
#  male female   male female female   male   male female female   male 
#    78     63     78     63     63     78     78     63     63     78 

Uh-oh, the values are reversed. So what really happened? When you use a factor as an index, R silently converts it to an integer vector. That means that the first index of “female” is converted to 1, giving a value of 78, and so on.

The fundamental problem is that there are two natural interpretations of a factor index – character indexing or integer indexing. Since these can give conflicting results, ideally R would provide a warning when you use a factor index. Until such a change gets implemented, I suggest that best practice is to always explicitly convert factors to integer or to character before you use them in an index.

         
avg_body_weight_kg[as.character(heights$gender)]  
avg_body_weight_kg[as.integer(heights$gender)]

To leave a comment for the author, please follow the link and comment on his blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.