How to include all levels of a factor variable in a model matrix in R

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In R, the model.matrix function is used to create the design matrix for regression. In particular, it is used to expand factor variables into dummy variables (also known as “one-hot encoding“).

Let’s see this in action on the iris dataset:

data(iris)
str(iris)
# 'data.frame':	150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

x <- model.matrix(Sepal.Length ~ Species, iris)
head(x)
#   (Intercept) Speciesversicolor Speciesvirginica
# 1           1                 0                0
# 2           1                 0                0
# 3           1                 0                0
# 4           1                 0                0
# 5           1                 0                0
# 6           1                 0                0

model.matrix returns a column of ones labeled (Intercept) by default. Also note that while the Species factor has 3 levels (“setosa”, “versicolor” and “virginica”), the return value of model.matrix only has dummy variables for the latter two levels. For a factor variable, model.matrix treats the first level it encounters as the “baseline” level and will not produce a dummy variable for it. This is to avoid the problem of multi-collinearity.

However, there are situations where we might want dummy variables to be produced for all levels including the baseline level. (For example, when we do regularized regression, since multi-collinearity is no longer implies unidentifiability of the model.) We can induce this behavior by passing a specific value to the contrasts.arg argument:

x <- model.matrix(
  Sepal.Length ~ Species,
  data = iris,
  contrasts.arg = list(Species = contrasts(iris$Species, contrasts = FALSE)))
head(x)
#   (Intercept) Speciessetosa Speciesversicolor Speciesvirginica
# 1           1             1                 0                0
# 2           1             1                 0                0
# 3           1             1                 0                0
# 4           1             1                 0                0
# 5           1             1                 0                0
# 6           1             1                 0                0

Let’s have a closer look at what we passed as the value of Species in the list:

contrasts(iris$Species, contrasts = FALSE)
#            setosa versicolor virginica
# setosa          1          0         0
# versicolor      0          1         0
# virginica       0          0         1

Notice that there are 3 columns, one for each level. If we didn’t pass this special value in, the default would have had just 2 columns, one for each of the levels we see in the output:

contrasts(iris$Species, contrasts = TRUE)
#            versicolor virginica
# setosa              0         0
# versicolor          1         0
# virginica           0         1

It’s easy to modify the code above to include the baseline level for a different factor variable in another data frame. The code below is an example of how you can include the baseline level for all factor variables in the data frame.

df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)),
                 y = factor(rep(c("d", "e", "f"), times = 3)),
                 z = 1:9)
str(df)
# 'data.frame':	9 obs. of  3 variables:
#  $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
#  $ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3
#  $ z: int  1 2 3 4 5 6 7 8 9

# default: no dummy variable for baseline level
x <- model.matrix(~ ., data = df)
head(x)
#   (Intercept) xb xc ye yf z
# 1           1  0  0  0  0 1
# 2           1  1  0  1  0 2
# 3           1  0  1  0  1 3
# 4           1  0  0  0  0 4
# 5           1  1  0  1  0 5
# 6           1  0  1  0  1 6

# dummy variables for baseline levels included
x <- model.matrix(
  ~ .,
  data = df,
  contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE],
                         contrasts, contrasts = FALSE))
head(x)
#   (Intercept) xa xb xc yd ye yf z
# 1           1  1  0  0  1  0  0 1
# 2           1  0  1  0  0  1  0 2
# 3           1  0  0  1  0  0  1 3
# 4           1  1  0  0  1  0  0 4
# 5           1  0  1  0  0  1  0 5
# 6           1  0  0  1  0  0  1 6

References:

  1. StackOverflow. All Levels of a Factor in a Model Matrix in R.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)