The lm() function with categorical predictors

[This article was first published on Anything but R-bitrary, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What’s with those estimates?
By Ben Ogorek


In R, categorical variables can be added to a regression using the lm() function without a hint of extra work. But have you ever look at the resulting estimates and wondered exactly what they were?

First, let’s define a data set.
set.seed(12255)
n = 30
sigma = 2.0

AOV.df <- data.frame(category = c(rep("category1", n)
                                , rep("category2", n)
                                , rep("category3", n)), 
                            
                            j = c(1:n
                                , 1:n
                                , 1:n),

                            y = c(8.0  + sigma*rnorm(n)
                                , 9.5  + sigma*rnorm(n)
                                , 11.0 + sigma*rnorm(n))
                  )
In words, we have ten observations in three categories, randomly drawn from normal distributions with a common standard deviation of 1.2 and means of 8, 9, and 11, respectively. In statistical notation, where i = 1,2,3, identifies the category, and j=1,2,...,10, identifies the observation within category, we have y1j ~ N(8, 22), y2j ~ N(9.5, 22),  and y3j ~ N(11, 22).

To visualize the data, Winston Chang's Cookbook for R illustrates a very nice use of ggplot2 (Wickham, 2009) here, and as modified below.

library(ggplot2)

ggplot(AOV.df, aes(x= y, fill=category)) + 
geom_bar(binwidth = 1.5, alpha = .5, position = "identity")
Clearly there is separation, although more for categories 1 vs 3 and 2 vs 3 than 1 vs 2.

Let's use regression to recover the means in our simulation. As promised, the lm() function needs no coaxing to accept categorical variables:
AOV.lm <- lm(y ~ category, data = AOV.df)
summary(AOV.lm)
The output, on the other hand, can be puzzling to the uninitiated:


Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         8.4514     0.3405  24.823  < 2e-16 ***
categorycategory2   0.8343     0.4815   1.733   0.0867 .  
categorycategory3   3.0017     0.4815   6.234 1.58e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.865 on 87 degrees of freedom
Multiple R-squared: 0.3225, Adjusted R-squared: 0.307 
F-statistic: 20.71 on 2 and 87 DF,  p-value: 4.403e-08 

The estimate of our common standard deviation, i.e. "Residual standard error," is 1.865, close to the true value of 2.000. But what about the intercept of 8.45, category 2's estimate of 0.83, and the lack of a category 1 estimate? Furthermore, why is category 3's associated p-value so very small, whereas category 2's is only marginally significant? Surely there is nearly the separation between categories 2 and 3 as categories 1 and 3. Take a moment to form your opinion.

With category means of 8, 9.5, and 11 in mind, and standard errors on the order of 1/2, a natural guess is that the intercept is estimating 8, category 2's parameter is estimating 1.5 so that 8 + 1.5 = 9.5, and category 3's parameter is estimating 3.0 so that 8 + 3 = 11. Seems a little indirect, doesn't it?

The Classical Effects Model

The classical effects model, as presented in Rawlings, Pantula, and Dickey (1999, Chapter 9.2),  describes the generation of our data as yij = μ + τi + εij, for categories i = 1,2,3, and within-category observations j = 1,...,30. The effects model is nice; you can think of μ as a baseline level, and τ1, τ2, and τ3 as deviations from that baseline depending on the category. This model is one of the basics in classical Statistics.

But we statisticians have an embarrassing little secret - despite the appeal of the classical effects model, you can't actually know the parameters. And I don't mean that you can't know them without error, I mean that you can't even estimate them. In statistical terminology, these quantities (I'm supposed to refer to them as "functions" of the model parameters) are not estimable. 

When you think about the three category means, and our four location parameters μ and τ1 - τ3, it should make sense - there's just not the information in the data to identify these four parameters. In actuality, the classical effects model is known to be overparameterized. (Remember when I told you it was a secret? I lied.)

It all seems pointless at first, and Rawlings, Pantula and Dickey do present an alternative, the means model: y = μi + εij. If you guessed that the μi's are just the category means, you'd be right. So why doesn't R's lm() function use the μi's as parameters?

It turns out that there is some utility in the effects model after all. Even though you can't estimate  μ,τ1, τ2, and τ3 individually, consider what happens when you take a difference of observations from two categories:

y2j - y1j = (μ + τ2 + ε2j) - (μ + τ1 + ε2j
                =  τ2 -  τ1 + (ε2j - ε1j).

The term in the parentheses is just noise (a random variable with an expectation of zero), which means that y2j - y1j is an estimate of τ2 -  τ1.  The function τ2 -  τ1 is estimable! Linking the means model with the classical effects model, we have μ2 - μ1 = τ2 - τ1


I'll take a moment to reflect on the value proposition of the effects model over the cell means model. I'll argue that the effects model more clearly articulates the quantities we are interested in - the differences between the groups. When there are multiple factors,  additive effects provide a way to simplify a model. The number of cell means will grow exponentially with the number of factors, but in the absence of interaction, the number of effects grow on the order of the number of factors. Rawlings, Pantula, and Dickey, citing Hocking (1985), suggest that the means model is "more direct," but that the effects model "conveys the structure of the data, which in turn generates logical hypotheses and sums of squares in the analysis." What do you think?

Back to the lm() function


If you tried to estimate the parameters of the overparameterized classical effects model with standard least squares, you'd run into problems, specifically with the matrix inversion that is involved. There are ways around this. One is to attack the inversion problem directly with a generalized inverse, or pseudoinverse (Strang, 1993). Another is to use a reparameterization.

We already saw that there is a relationship between the parameters in the cell means model and the classical effects model. Indeed, the cell means parameterization is one way to proceed, but it's not the one that R uses. R's lm() function uses a reparameterization is called the reference cell model, where one of the τi's is set to zero to allow for a solution. Rawlings, Pantula, and Dickey say it is usually the last τi, but in the case of the lm() function, it is actually the first. 

With τ1 set to zero, the mean of category 1, μ + τ1 is really just μ, or μ* as well call it, because our constraint was really quite arbitrary. Think back to our lm() output - you guessed it, μ* is the intercept, and also the mean of category 1. The p-value for the test of whether the true first category mean equals 0, which is not likely to be useful.

Also since τ1 is set to zero, τ2 - τ1, estimable from the differences between the category means, is really just τ2, or τ*since the constraint was arbitrary. Again, thinking back to our lm() output, the quantity being estimated for category 2 is the difference in means between category 2 and category 1. Similarly category 3's coefficient is the difference between the category 3 mean and the category 1 mean. The p-values for these tests are more likely to be meaningful, as they are testing group level differences. Now it should be clear why the category 2 coefficients p-value is larger than category 3's (refer to the plot) and why there is not a category 1 coefficient (hint: it'd be really boring.)

To finish the discussion, set n = 100000. Regenerate the data, and run the lm(). You should arrive at:


Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.997530   0.006334  1262.7   <2e-16 ***
categorycategory2 1.499163   0.008957   167.4   <2e-16 ***
categorycategory3 4.004012   0.008957   447.0   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.003 on 299997 degrees of freedom
Multiple R-squared: 0.4048, Adjusted R-squared: 0.4048 
F-statistic: 1.02e+05 on 2 and 299997 DF,  p-value: < 2.2e-16 

It is clear that the intercept is indeed the mean of category 1, category 2's coefficient is 1.5 = 9.5 - 8, the difference of category 2's and category 1's mean. Finally, category 3's coefficent is 4 = 12 - 8, the difference of category 3's and category 1's mean.

Discussion

This article was meant to answer the question of why the lm() function works the way it does for a single factor regressor, written for those more familiar with regression and less familiar with the classical analysis of variance. We did not cover switching between parameterizations, getting p-values for the various testing combinations, or what happens when there are multiple factors. Please let me know whether this article was helpful, and if there is demand for future articles on this topic. 

This article from UCLA's Academic Technology Services discusses R's contrasts() function and other tools for switching between parameterizations, but I have not myself gone through the examples. 


All highlighted R-code segments were created by Pretty R at inside-R.org. Keep up with ours and other great articles relating to R on R-bloggers, and follow me on Twitter (@baogorek) for my latest research updates.

References


  • Hocking, R. R. (1985). The analysis of linear models. Brooks/Cole, Monterey, California, USA.
  • Plotting Distributions (ggplot2). Cookbook for R. URL http://wiki.stdout.org/rcookbook/Graphs/Plotting%20distributions%20%28ggplot2%29/ (accessed April 6, 2012).
  • Pretty R Syntax Highlighter. inside-R.org. URL http://www.inside-r.org/pretty-r (accessed April 8, 2012).
  • Rawlings, J. O., S. G. Pantula, and D. A. Dickey. (1998). Applied regression analysis: a research tool. Second edition. Springer-Verlag, New York, New York, USA.
  • R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
  • R Learning Module: Coding for categorical variables in regression models. UCLA: Academic Technology Services, Statistical Consulting Group. URL http://www.ats.ucla.edu/stat/r/modules/dummy_vars.htm (accessed April 7, 2012).
  • Strang, G. (1993). "The fundamental theorem of linear algebra." American Mathematical MonthlyVolume 100, number 9, pp 848 - 859.
  • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer New York.

To leave a comment for the author, please follow the link and comment on their blog: Anything but R-bitrary.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)