Factor codings for models in R

[This article was first published on Heidi's stats blog - Rbloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am holding an exercise on generalised models these days. Preparing a task on factor coding
in generalised linear models, I realised that the help on the internet on that is not so
easy to understand. At least what I found. So in order to help people who find this topic
confusing, I want to help out a little here.

Consider the following data

<span class="n">set.seed</span><span class="p">(</span><span class="m">29</span><span class="p">)</span>
<span class="n">x</span> <span class="o"><-</span> <span class="n">gl</span><span class="p">(</span><span class="m">5</span><span class="p">,</span> <span class="m">10</span><span class="p">)</span> <span class="c1"># 5 factor's levels, each replicated 10 times
</span><span class="n">y</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">length</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">mean</span> <span class="o">=</span> <span class="n">as.integer</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">sd</span> <span class="o">=</span> <span class="m">0.1</span><span class="p">)</span>   <span class="c1"># means= 1,2,3,4,5
</span><span class="n">dd</span> <span class="o"><-</span> <span class="n">data.frame</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">)</span>

<span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">),</span> <span class="n">data</span> <span class="o">=</span> <span class="n">dd</span><span class="p">)</span> <span class="o">+</span> <span class="n">geom_boxplot</span><span class="p">()</span>

plot of chunk unnamed-chunk-1

<span class="n">means</span> <span class="o"><-</span> <span class="n">tapply</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mean</span><span class="p">)</span>
<span class="n">means</span>
##         1         2         3         4         5 
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041

Now we want to model the mean of y given x using the lm() function with the following codings:
dummy-coding, treatment-coding (where the reference category is 5), effect-coding
and split-coding.

To make the theory more general, we have a categorical variable ( X ) with ( K ) categories ( (a_1, dots, a_K) )

Dummy-coding

Look at each level separately:
[ E(Y|X=a_{k}) = beta_k, quad k=1,dots,K ]

<span class="c1"># dummy (each level singularly)
</span><span class="n">lm_dummy</span> <span class="o"><-</span> <span class="n">lm</span><span class="p">(</span> <span class="n">y</span> <span class="o">~</span> <span class="n">x</span> <span class="o">-</span> <span class="m">1</span><span class="p">,</span> <span class="n">contrasts</span> <span class="o">=</span> <span class="n">list</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">contr.treatment</span><span class="p">(</span><span class="m">5</span><span class="p">)))</span>
<span class="n">summary</span><span class="p">(</span><span class="n">lm_dummy</span><span class="p">)</span>
## 
## Call:
## lm(formula = y ~ x - 1, contrasts = list(x = contr.treatment(5)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x1  0.99629    0.03293   30.25   <2e-16 ***
## x2  2.02271    0.03293   61.43   <2e-16 ***
## x3  3.02304    0.03293   91.80   <2e-16 ***
## x4  3.94827    0.03293  119.90   <2e-16 ***
## x5  5.01890    0.03293  152.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9991,	Adjusted R-squared:  0.999 
## F-statistic: 1.014e+04 on 5 and 45 DF,  p-value: < 2.2e-16
<span class="n">coef</span><span class="p">(</span><span class="n">lm_dummy</span><span class="p">)</span>
##        x1        x2        x3        x4        x5 
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041
<span class="n">means</span>
##         1         2         3         4         5 
## 0.9962925 2.0227058 3.0230409 3.9482690 5.0189041

Treatment-coding

Compare each category to the dummy-category ( a_d ):

[ E(Y X=a_{d}) = beta_0 ]
[ E(Y X=a_k) = beta_0 + beta_k, quad kneq d ]
<span class="c1"># treatment (restrict one level to constant term, all other difference from it)
</span><span class="n">lm_treatment</span> <span class="o"><-</span> <span class="n">lm</span><span class="p">(</span> <span class="n">y</span> <span class="o">~</span> <span class="n">x</span><span class="p">,</span> <span class="n">contrast</span> <span class="o">=</span> <span class="n">list</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">contr.treatment</span><span class="p">(</span><span class="m">5</span><span class="p">,</span> <span class="n">base</span> <span class="o">=</span> <span class="m">5</span><span class="p">)))</span>
<span class="n">summary</span><span class="p">(</span><span class="n">lm_treatment</span><span class="p">)</span>
## 
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.treatment(5, base = 5)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.01890    0.03293  152.41   <2e-16 ***
## x1          -4.02261    0.04657  -86.38   <2e-16 ***
## x2          -2.99620    0.04657  -64.34   <2e-16 ***
## x3          -1.99586    0.04657  -42.86   <2e-16 ***
## x4          -1.07064    0.04657  -22.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947 
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
<span class="n">coef</span><span class="p">(</span><span class="n">lm_treatment</span><span class="p">)</span>
## (Intercept)          x1          x2          x3          x4 
##    5.018904   -4.022612   -2.996198   -1.995863   -1.070635
<span class="n">c</span><span class="p">(</span><span class="n">means</span><span class="p">[</span><span class="m">5</span><span class="p">],</span> <span class="n">means</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">]</span> <span class="o">-</span> <span class="n">means</span><span class="p">[</span><span class="m">5</span><span class="p">])</span>
##         5         1         2         3         4 
##  5.018904 -4.022612 -2.996198 -1.995863 -1.070635

Effect-coding

Compare each category to the mean:

[ E(Y X=a_k) = beta_0 + beta_k, quad k=1,dots,K-1 ]
[ E(Y X=a_K) = beta_0 – sumlimits_{j=1}^{K-1} beta_j ]
<span class="c1"># effect(deviation from overall average)
</span><span class="n">lm_effect</span> <span class="o"><-</span> <span class="n">lm</span><span class="p">(</span><span class="n">y</span> <span class="o">~</span> <span class="n">x</span><span class="p">,</span> <span class="n">contrast</span> <span class="o">=</span> <span class="n">list</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">contr.sum</span><span class="p">(</span><span class="m">5</span><span class="p">)))</span>
<span class="n">summary</span><span class="p">(</span><span class="n">lm_effect</span><span class="p">)</span>
## 
## Call:
## lm(formula = y ~ x, contrasts = list(x = contr.sum(5)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.00184    0.01473  203.84   <2e-16 ***
## x1          -2.00555    0.02945  -68.09   <2e-16 ***
## x2          -0.97914    0.02945  -33.24   <2e-16 ***
## x3           0.02120    0.02945    0.72    0.475    
## x4           0.94643    0.02945   32.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947 
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
<span class="n">coef</span><span class="p">(</span><span class="n">lm_effect</span><span class="p">)</span>
## (Intercept)          x1          x2          x3          x4 
##  3.00184245 -2.00554992 -0.97913668  0.02119841  0.94642653
<span class="n">c</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="n">means</span><span class="p">),</span> <span class="n">means</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">]</span> <span class="o">-</span> <span class="n">mean</span><span class="p">(</span><span class="n">means</span><span class="p">))</span>
##                       1           2           3           4 
##  3.00184245 -2.00554992 -0.97913668  0.02119841  0.94642653

Split-coding

Compare each category to the previous category (for ordered categories):

[ E(Y X=a_1) = beta_0 ]
[ E(Y X=a_k) = beta_0 + sumlimits_{j=1}^{k-1} beta_j, quad k=2,dots,K ]
<span class="c1"># split coding
</span><span class="n">c</span> <span class="o"><-</span> <span class="n">rbind</span><span class="p">(</span> <span class="n">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">),</span>
            <span class="n">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">),</span>
            <span class="n">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">0</span><span class="p">,</span><span class="m">0</span><span class="p">),</span>
            <span class="n">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">0</span><span class="p">),</span>
            <span class="n">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">)</span> <span class="p">)</span>

<span class="n">lm_split</span> <span class="o"><-</span> <span class="n">lm</span><span class="p">(</span><span class="n">y</span> <span class="o">~</span> <span class="n">x</span><span class="p">,</span> <span class="n">contrast</span> <span class="o">=</span> <span class="n">list</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">c</span><span class="p">))</span>
<span class="n">summary</span><span class="p">(</span><span class="n">lm_split</span><span class="p">)</span>
## 
## Call:
## lm(formula = y ~ x, contrasts = list(x = c))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191692 -0.065303 -0.002686  0.054485  0.214752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99629    0.03293   30.25   <2e-16 ***
## x1           1.02641    0.04657   22.04   <2e-16 ***
## x2           1.00034    0.04657   21.48   <2e-16 ***
## x3           0.92523    0.04657   19.87   <2e-16 ***
## x4           1.07064    0.04657   22.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 45 degrees of freedom
## Multiple R-squared:  0.9951,	Adjusted R-squared:  0.9947 
## F-statistic:  2293 on 4 and 45 DF,  p-value: < 2.2e-16
<span class="n">coef</span><span class="p">(</span><span class="n">lm_split</span><span class="p">)</span>
## (Intercept)          x1          x2          x3          x4 
##   0.9962925   1.0264132   1.0003351   0.9252281   1.0706351
<span class="n">c</span><span class="p">(</span><span class="n">means</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">diff</span><span class="p">(</span><span class="n">means</span><span class="p">))</span>
##         1         2         3         4         5 
## 0.9962925 1.0264132 1.0003351 0.9252281 1.0706351

To leave a comment for the author, please follow the link and comment on their blog: Heidi's stats blog - Rbloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)