Generating ANOVA-like table from GLMM using parametric bootstrap

Posted on February 26, 2015 by grumble10 in R bloggers | 0 Comments

[This article was first published on biologyforfun » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article may also be found on RPubs: http://rpubs.com/hughes/63269

In the list of worst to best way to test for effect in GLMM the list on http://glmm.wikidot.com/faq state that parametric bootstrapping is among the best options. PBmodcomp in the pbkrtest package implement such parametric bootstrapping by comparing a full model to a null one. The function simulate data (the response vector) from the null model then fit these data to the null and full model and derive a likelihood ratio test for each of the simulated data. Then we can compare the observed likelihood ratio test to the null distribution generated from the many simulation and derive a p-value. The advantage of using such a method over the classical p-values derived from a chi-square test on the likelihood ratio test is that in the parametric bootstrap we do not assume any null distribution (like chi-square) but instead derive our own null distribution from the model and the data at hand. We do not make the assumption then that the likelihood ratio test statistic is chi-square distributed. I have made a little function that wraps around the PBmodcomp function to compute bootstrapped p-values for each term in a model by sequentially adding them. This lead to anova-like table that are typically obtained when one use the command anova on a glm object.

#the libraries used
library(lme4)
library(arm)
library(pbkrtest)
#the function
anova_merMod<-function(model,rand,w=NULL,seed=round(runif(1,0,100),0),nsim=50){
  data<-model@frame
  if(!is.null(w)){
    data<-data[,-grep("(weights)",names(data))]
  }
  
  resp<-names(model.frame(model))[1]
  #generate a list of reduced model formula
  fs<-list()
  fs[[1]]<-as.formula(paste(resp,"~ 1 +",rand))
  nb_terms<-length(attr(terms(model),"term.labels"))
  if(nb_terms>1){
    for(i in 1:nb_terms){
      tmp<-c(attr(terms(model),"term.labels")[1:i],rand)
      fs[[i+1]]<-reformulate(tmp,response=resp)
    }      
  }

  #fit the reduced model to the data
  
  fam<-family(model)[1]$family
  if(fam=="gaussian"){
    m_fit<-lapply(fs,function(x) lmer(x,data,REML=FALSE))
  } else if(fam=="binomial"){
    m_fit<-lapply(fs,function(x) glmer(x,data,family=fam,weights=w))
  }  else{
    m_fit<-lapply(fs,function(x) glmer(x,data,family=fam))
  }

  #compare nested model with one another and get LRT values (ie increase in the likelihood of the models as parameters are added)
  tab_out<-NULL
  
  for(i in 1:(length(m_fit)-1)){
    comp<-PBmodcomp(m_fit[[i+1]],m_fit[[i]],seed=seed,nsim=nsim)    
    term_added<-attr(terms(m_fit[[i+1]]),"term.labels")[length(attr(terms(m_fit[[i+1]]),"term.labels"))]
    #here are reported the bootstrapped p-values, ie not assuming any parametric distribution like chi-square to the LRT values generated under the null model
    #these p-values represent the number of time the simulated LRT value (under null model) are larger than the observe one
    tmp<-data.frame(term=term_added,LRT=comp$test$stat[1],p_value=comp$test$p.value[2])
    tab_out<-rbind(tab_out,tmp)
    print(paste("Variable ",term_added," tested",sep=""))
  }  
  print(paste("Seed set to:",seed))
  return(tab_out)  
}

You pass your GLMM model to the function together with the random part as character (see example below), if you fitted a binomial GLMM you also need to provide the weights as a vector, you can then set a seed and the last argument is the number of simulation to do, it is set by default to 50 for rapid checking purpose but if you want to report these results larger values (ie 1000, 10000) should be used.

Let’s look at a simple LMM example:

data(grouseticks)
m<-lmer(TICKS~cHEIGHT+YEAR+(1|BROOD),grouseticks)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: TICKS ~ cHEIGHT + YEAR + (1 | BROOD)
##    Data: grouseticks
## 
## REML criterion at convergence: 2755
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.406 -0.246 -0.036  0.146  5.807 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  BROOD    (Intercept) 87.3     9.34    
##  Residual             28.1     5.30    
## Number of obs: 403, groups:  BROOD, 118
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   5.4947     1.6238    3.38
## cHEIGHT      -0.1045     0.0264   -3.95
## YEAR96        4.1910     2.2424    1.87
## YEAR97       -4.3304     2.2708   -1.91
## 
## Correlation of Fixed Effects:
##         (Intr) cHEIGH YEAR96
## cHEIGHT -0.091              
## YEAR96  -0.726  0.088       
## YEAR97  -0.714  0.052  0.518
anova_merMod(model=m,rand="(1|BROOD)")
## [1] "Variable cHEIGHT tested"
## [1] "Variable YEAR tested"
## [1] "Seed set to: 63"
##      term   LRT p_value
## 1 cHEIGHT 14.55 0.01961
## 2    YEAR 14.40 0.01961

The resulting table show for each term in the model the likelihood ratio test, which is basically the decrease in deviance when going from the null to the full model and the p value, you may look at the PBtest line in the details of ?PBmodcomp to see how it is computed.

Now let’s see how to use the function with binomial GLMM:

#simulate some binomial data
x1<-runif(100,-2,2)
x2<-runif(100,-2,2)
group<-gl(n = 20,k = 5)
rnd.eff<-rnorm(20,mean=0,sd=1.5)
p<-1+0.5*x1-2*x2+rnd.eff[group]+rnorm(100,0,0.3)
y<-rbinom(n = 100,size = 10,prob = invlogit(p))
prop<-y/10
#fit a model
m<-glmer(prop~x1+x2+(1|group),family="binomial",weights = rep(10,100))
summary(m)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: prop ~ x1 + x2 + (1 | group)
## Weights: rep(10, 100)
## 
##      AIC      BIC   logLik deviance df.resid 
##    288.6    299.1   -140.3    280.6       96 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.334 -0.503  0.181  0.580  2.466 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 1.38     1.18    
## Number of obs: 100, groups:  group, 20
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.748      0.287    2.61   0.0092 ** 
## x1             0.524      0.104    5.02  5.3e-07 ***
## x2            -2.083      0.143  -14.56  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##    (Intr) x1    
## x1  0.090       
## x2 -0.205 -0.345
#takes some time
anova_merMod(m,rand = "(1|group)",w = rep(10,100))
## [1] "Variable x1 tested"
## [1] "Variable x2 tested"
## [1] "Seed set to: 98"
##   term      LRT p_value
## 1   x1   0.0429 0.80392
## 2   x2 502.0921 0.01961

For binomial model, the model must be fitted with proportion data and a vector of weights (ie the number of binomial trial) must be passed to the ‘w’ argument. Some warning message may pop up at the end of the function, this comes from convergence failure in PBmodcomp, this do not affect the results, you may read the article from the pbkrtest package: http://www.jstatsoft.org/v59/i09/ to understand better where this comes from.

Happy modeling and as Ben Bolker say: “When all else fails, don’t forget to keep p-values in perspective: http://www.phdcomics.com/comics/archive.php?comicid=905 “

Filed under: R and Stat Tagged: bootstrap, GLMM, R