# Factor codings for models in R

(This article was first published on Heidi's stats blog - Rbloggers, and kindly contributed to R-bloggers)

I am holding an exercise on generalised models these days. Preparing a task on factor coding
in generalised linear models, I realised that the help on the internet on that is not so
easy to understand. At least what I found. So in order to help people who find this topic
confusing, I want to help out a little here.

Consider the following data

``````set.seed(29)
x <- gl(5, 10) # 5 factor's levels, each replicated 10 times
y <- rnorm(n = length(x), mean = as.integer(x), sd = 0.1)   # means= 1,2,3,4,5
dd <- data.frame(x = x, y = y)

library(ggplot2)
ggplot(aes(x = x, y = y), data = dd) + geom_boxplot()
``````

``````means <- tapply(y, x, mean)
means
``````
``````##         1         2         3         4         5
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
``````

Now we want to model the mean of y given x using the `lm()` function with the following codings:
dummy-coding, treatment-coding (where the reference category is 5), effect-coding
and split-coding.

To make the theory more general, we have a categorical variable ( X ) with ( K ) categories ( (a_1, dots, a_K) )

## Dummy-coding

Look at each level separately:
[ E(Y|X=a_{k}) = beta_k, quad k=1,dots,K ]

``````# dummy (each level singularly)
lm_dummy <- lm( y ~ x - 1, contrasts = list(x = contr.treatment(5)))
summary(lm_dummy)
``````
``````##
## Call:
## lm(formula = y ~ x - 1, contrasts = list(x = contr.treatment(5)))
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.191692 -0.065303 -0.002686  0.054485  0.214752
##
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)
## x1  0.99629    0.03293   30.25   <2e-16 ***
## x2  2.02271    0.03293   61.43   <2e-16 ***
## x3  3.02304    0.03293   91.80   <2e-16 ***
## x4  3.94827    0.03293  119.90   <2e-16 ***
## x5  5.01890    0.03293  152.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9991,	Adjusted R-squared:  0.999
## F-statistic: 1.014e+04 on 5 and 45 DF,  p-value: < 2.2e-16
``````
``````coef(lm_dummy)
``````
``````##        x1        x2        x3        x4        x5
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
``````
``````means
``````
``````##         1         2         3         4         5
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
``````

## Treatment-coding

Compare each category to the dummy-category ( a_d ):

 [ E(Y X=a_{d}) = beta_0 ]
 [ E(Y X=a_k) = beta_0 + beta_k, quad kneq d ]
``````# treatment (restrict one level to constant term, all other difference from it)
lm_treatment <- lm( y ~ x, contrast = list(x = contr.treatment(5, base = 5)))
summary(lm_treatment)
``````
``````##
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.treatment(5, base = 5)))
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.191692 -0.065303 -0.002686  0.054485  0.214752
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  5.01890    0.03293  152.41   <2e-16 ***
## x1          -4.02261    0.04657  -86.38   <2e-16 ***
## x2          -2.99620    0.04657  -64.34   <2e-16 ***
## x3          -1.99586    0.04657  -42.86   <2e-16 ***
## x4          -1.07064    0.04657  -22.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
``````
``````coef(lm_treatment)
``````
``````## (Intercept)          x1          x2          x3          x4
##    5.018904   -4.022612   -2.996198   -1.995863   -1.070635
``````
``````c(means[5], means[1:4] - means[5])
``````
``````##         5         1         2         3         4
##  5.018904 -4.022612 -2.996198 -1.995863 -1.070635
``````

## Effect-coding

Compare each category to the mean:

 [ E(Y X=a_k) = beta_0 + beta_k, quad k=1,dots,K-1 ]
 [ E(Y X=a_K) = beta_0 – sumlimits_{j=1}^{K-1} beta_j ]
``````# effect(deviation from overall average)
lm_effect <- lm(y ~ x, contrast = list(x = contr.sum(5)))
summary(lm_effect)
``````
``````##
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.sum(5)))
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.191692 -0.065303 -0.002686  0.054485  0.214752
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.00184    0.01473  203.84   <2e-16 ***
## x1          -2.00555    0.02945  -68.09   <2e-16 ***
## x2          -0.97914    0.02945  -33.24   <2e-16 ***
## x3           0.02120    0.02945    0.72    0.475
## x4           0.94643    0.02945   32.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
``````
``````coef(lm_effect)
``````
``````## (Intercept)          x1          x2          x3          x4
##  3.00184245 -2.00554992 -0.97913668  0.02119841  0.94642653
``````
``````c(mean(means), means[1:4] - mean(means))
``````
``````##                       1           2           3           4
##  3.00184245 -2.00554992 -0.97913668  0.02119841  0.94642653
``````

## Split-coding

Compare each category to the previous category (for ordered categories):

 [ E(Y X=a_1) = beta_0 ]
 [ E(Y X=a_k) = beta_0 + sumlimits_{j=1}^{k-1} beta_j, quad k=2,dots,K ]
``````# split coding
c <- rbind( c(0,0,0,0),
c(1,0,0,0),
c(1,1,0,0),
c(1,1,1,0),
c(1,1,1,1) )

lm_split <- lm(y ~ x, contrast = list(x = c))
summary(lm_split)
``````
``````##
## Call:
## lm(formula = y ~ x, contrasts = list(x = c))
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.191692 -0.065303 -0.002686  0.054485  0.214752
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.99629    0.03293   30.25   <2e-16 ***
## x1           1.02641    0.04657   22.04   <2e-16 ***
## x2           1.00034    0.04657   21.48   <2e-16 ***
## x3           0.92523    0.04657   19.87   <2e-16 ***
## x4           1.07064    0.04657   22.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
``````
``````coef(lm_split)
``````
``````## (Intercept)          x1          x2          x3          x4
##   0.9962925   1.0264132   1.0003351   0.9252281   1.0706351
``````
``````c(means[1], diff(means))
``````
``````##         1         2         3         4         5
## 0.9962925 1.0264132 1.0003351 0.9252281 1.0706351
``````

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...