Changing the column names for model.matrix output

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this previous post, I showed how you can include a dummy variable for the baseline level in the output of the model.matrix function. In this post, I show how you can make changes to the column names of model.matrix‘s output to make downstream parsing a little easier.

Let’s use the iris dataset again:

data(iris)
str(iris)
# 'data.frame':	150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

x <- model.matrix(Sepal.Length ~ Species, iris)
head(x)
#   (Intercept) Speciesversicolor Speciesvirginica
# 1           1                 0                0
# 2           1                 0                0
# 3           1                 0                0
# 4           1                 0                0
# 5           1                 0                0
# 6           1                 0                0

Notice the default behavior for the column names of the returned matrix: for a given level, the column name is the name of the variable concatenated with the name of the level, with no spaces in between. For example, the last column in the matrix above represents the virginica level of the Species variable.

Because the concatenation happens with no characters in between the variable and level names, it can be hard to programmatically separate the two parts in the returned column names. We can make our life easier by having model.matrix return the variable and level names with some special character, e.g. ., in between.

We can achieve this by modifying the contrasts.arg function argument. In our example, the default value for this argument is list(Species = contrasts(iris$Species)). The code below shows what contrasts(iris$Species) is:

contrasts(iris$Species)
#            versicolor virginica
# setosa              0         0
# versicolor          1         0
# virginica           0         1

We can modify the column names of contrasts(iris$Species) to achieve the desired effect:

speciesContrast <- contrasts(iris$Species)
colnames(speciesContrast) <- paste0(".", colnames(speciesContrast))
x <- model.matrix(
  Sepal.Length ~ Species, 
  iris,
  contrasts.arg = list(Species = speciesContrast)
)
head(x)
#   (Intercept) Species.versicolor Species.virginica
# 1           1                  0                 0
# 2           1                  0                 0
# 3           1                  0                 0
# 4           1                  0                 0
# 5           1                  0                 0
# 6           1                  0                 0

We can do this programmatically for all factor variables in a data frame too. Here is our example data frame:

df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)),
                 y = factor(rep(c("d", "e", "f"), times = 3)),
                 z = 1:9)
str(df)
# 'data.frame':	9 obs. of  3 variables:
#  $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
#  $ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3
#  $ z: int  1 2 3 4 5 6 7 8 9

x <- model.matrix(~ ., data = df)
head(x)
#   (Intercept) xb xc ye yf z
# 1           1  0  0  0  0 1
# 2           1  1  0  1  0 2
# 3           1  0  1  0  1 3
# 4           1  0  0  0  0 4
# 5           1  1  0  1  0 5
# 6           1  0  1  0  1 6

Here is the code that adds --- between the variable and level names:

ChangeColnames <- function(x) {
  colnames(x) <- paste0("---", colnames(x))
  x
}

x <- model.matrix(
  ~ .,
  data = df,
  contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE],
                         function(x) ChangeColnames(contrasts(x)))
)
head(x)
#   (Intercept) x---b x---c y---e y---f z
# 1           1     0     0     0     0 1
# 2           1     1     0     1     0 2
# 3           1     0     1     0     1 3
# 4           1     0     0     0     0 4
# 5           1     1     0     1     0 5
# 6           1     0     1     0     1 6

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)