Level fit summaries can be tricky in R

October 1, 2012
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.When modeling you often encounter what are commonly called categorical variables, which are called factors in R. Possible values of categorical variables or factors are called levels. We will demonstrate some of the needed in interpreting level based summaries, using R as an example.

Consider the following data frame:

d <- data.frame(x=c('a','a',
  'b','b','b','b','b','b','b','b','b','b', 'b', 'b',
  'c','c','c','c','c','c','c','c','c','c'),
   y=c(-5,5,
      1,1,1,1,1,1,1,1,1,1,1,0.8,
      -1,-1,-1,-1,-1,-1,-1,-1,-1,-1))

When we use such a frame in a model (such as a lm()) the categorical variable x needs to be converted into a numerical representation to be usable. The common way to do this is to introduce each observed value or level into a new zero/one indicator variable. For our input x we would expect three new numeric indicator varaibles: xa, xb and xc. Then the values of x are encoded in these three variables as given in the following table:









xa xb xc
x==’a’ 1.00 0.00 0.00
x==’b’ 0.00 1.00 0.00
x==’c’ 0.00 0.00 1.00

This sort of bookkeeping rapidly gets painful, so R has a lot of code deep down in its implementation of factors and contrasts to perform this encoding for you. However no matter how much you mess with contrasts(), C() or other commands (setting treatment to things like contr.sum and so on) you can not get over the fact that R really really wants to use 2 variables to encode 3 levels (and it will fight mightily to prevent you from fixing this). Largely this is because R is trying to prevent a linear dependence from making it to the fitter. Notice in our above table all encodings obey the linear relation xa + xb + xc = 1. This means one of these variables is redundant and something like the reasonable linear model xa + xb + xc is equivalent to the unreasonable model 1000000 xa + 1000000 xb + 1000000 xc - 999999. Now my strong preference would be to use modeling code that can deal with this by using simple methods like ridge regression. But R, like most traditional stat programs, doesn’t trust its linear algebra system to deal with linear dependencies- so they don’t allow the above encoding and force us to use variations of the encoding seen in the next table:









xb xc
x==’a’ 0.00 0.00
x==’b’ 1.00 0.00
x==’c’ 0.00 1.00

In this encoding both b and c are written as difference from the “hidden value” a. We no longer have any redundancy as a‘s encoding ([0,0]) no longer totals to 1. However, this leads to a new subtle problem in fit reporting. Consider fitting a linear model for y ~ a x + b on the example data:

m <- lm(y~x,data=d)
summary(m)

Call:
lm(formula = y ~ x, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0000  0.0000  0.0083  0.0167  5.0000 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.541e-15  1.091e+00   0.000    1.000
xb           9.833e-01  1.179e+00   0.834    0.414
xc          -1.000e+00  1.196e+00  -0.836    0.412

Residual standard error: 1.544 on 21 degrees of freedom
Multiple R-squared: 0.3002,	Adjusted R-squared: 0.2336 
F-statistic: 4.505 on 2 and 21 DF,  p-value: 0.02355 

The default factor encoding chose a as the reference or hidden level. The coefficients on remaining two levels (xb and xc) now represent how much boost you get relative to the hidden level. In this case there are large error bars around the hidden level (it only occurred twice and with wildly different y values). So we have horrid t-values and no reported evidence that the effects of b and c are significantly different than a. But this is misleading. This report does not tell us if there is a significant difference between xb and xc, the only reason we can’t tell if they are far from the hidden estimate of xa is uncertainty in estimating the (hidden) coefficient of xa itself. Uncertainties like this are not distances and don’t obey the triangle inequality or transitivity- not being able to show either of xb or xc is far from xa does not mean we can not show xb is far from xc. It just wasn’t in this report.

To fix this we apply a function to “re-level” the factor x so the hidden level to the most popular value (i.e. choosing which level is hidden and choosing it in such a way we expect the error bars to be small):

setRefToMostCommonLevel <- function(f) { 
   f <- as.factor(f) 
   t <- table(f)
   relevel(f,ref=as.integer(which(t>=max(t))[[1]]))
}
d$x <- setRefToMostCommonLevel(d$x)
d$x
m <- lm(y~x,data=d)
summary(m)

With the most popular level (b) as the reference we get the following result:

Call:
lm(formula = y ~ x, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0000  0.0000  0.0083  0.0167  5.0000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.9833     0.4456   2.207  0.03860 * 
xa           -0.9833     1.1789  -0.834  0.41362   
xc           -1.9833     0.6609  -3.001  0.00681 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.544 on 21 degrees of freedom
Multiple R-squared: 0.3002,	Adjusted R-squared: 0.2336 
F-statistic: 4.505 on 2 and 21 DF,  p-value: 0.02355 

We now have p ≤ 0.00681 significance (which is good, smaller is better for reported significances) that the outcome (y) difference between the level c the hidden level b is unlikely under the null-hypothesis.

Of course what was hiding the significance report were large error bars on the reference level. So we could try running the following code to select the level with the least variance in outcome to minimize the size of the reference error bars. So here is code that picks the level with minimum sample variance (a trade-off between number of observations and variation on associated y values):

setRefToLeastVarLevel <- function(f,y) { 
   f <- as.factor(f) 
   t <- sapply(levels(f),function(fi) var(y[f==fi]))
   if(all(sapply(t,is.na))) {
     setRefToMostCommonLevel(f)
   } else {
     relevel(f,
       ref=as.integer(which((!sapply(t,is.na))&(t<=min(t,na.rm=T)))[[1]]))
   }
}
d$x <- setRefToLeastVarLevel(d$x,d$y)
d$x
m <- lm(y~x,data=d)
summary(m)

This picks c as the reference level of minimum variance resulting in the following:

Call:
lm(formula = y ~ x, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0000  0.0000  0.0083  0.0167  5.0000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -1.0000     0.4881  -2.049  0.05320 . 
xb            1.9833     0.6609   3.001  0.00681 **
xa            1.0000     1.1957   0.836  0.41237   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.544 on 21 degrees of freedom
Multiple R-squared: 0.3002,	Adjusted R-squared: 0.2336 
F-statistic: 4.505 on 2 and 21 DF,  p-value: 0.02355 

Another approach is to look for summaries that are invariant under the choice of reference level. One such summary, though it is an clinical effect size not a significance, is the standard deviation of the levels (with the missing level added into the vector). This statistic is the root mean square coefficient size and is invariant over level choice (due to our appending a zero to represent the coefficient of the reference level):

sd(c(m$coefficients[grep('^x',names(m$coefficients))],0))

[1] 0.9916783

Unfortunately there is no easy relevel() invariant significance term, as when the reference level estimate is itself uncertain all reported differences are also uncertain and when the references level estimate is certain we can have many other levels return significant or insignificant.

The lesson is you have to be a bit careful in reading the significance summaries on factor levels. Model summary reports are only on how each level relates to the hidden level (not the value of the level relative to zero). So the items in the report could be facts about the levels you see or could be facts about the hidden level- you can’t tell which. This is fairly traditional in statistics.

We feel a better approach would be for the fitter to not use hidden levels (add the redundant level back in) and then enforce a bias for small coefficients by an explicit regularization term. In fact many forms of regularization don’t really make sense until you get more explicit control of the level encoding. Another fix would be to add a linear constraint that states the expected impact of the fit level coefficients is zero over the training data (which would eat one degree of freedom and make handing missing values easier, this would be written as ca xa + cb xb + cc xc = 0; where ca,cb and cc are the number of training rows with each of the given levels). But this thinking in term of effects is at odds with the traditional design of factors and contrasts in R. So it would be a lot of visible effort to try and trick R into consistently working in this way (and all of these approaches would complicate computation of significance levels a bit).

We advise: take just a little care and experiment with relevel() when appropriate. Always be wary when reading reported model coefficient significance tables.

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.