[This article was first published on Statistical Reflections of a Medical Doctor » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the first technical post in this series, I conducted a numerical investigation of the biasedness of random effect predictions in generalized linear mixed models (GLMM), such as the ones used in the Surgeon Scorecard, I decided to undertake two explorations: firstly, the behavior of these estimates as more and more data are gathered for each individual surgeon and secondly whether the limiting behavior of these estimators critically depends on the underlying GLMM family. Note that the first question directly assesses whether the random effect estimators reflect the underlying (but unobserved) “true” value of the individual practitioner effect in logistic regression models for surgical complications. On the other hand the second simulation examines a separate issue, namely whether the non-linearity of the logistic regression model affects the convergence rate of the random effect predictions towards their true value.

For these simulations we will examine three different ranges of dataset sizes for each surgeon:

• small data (complication data from between 20-100 cases/ surgeon are available)
• large data (complications from between 200-1000 cases/surgeon)
• extra large data (complications from between 1000-2000 cases/surgeon)

We simulated 200 surgeons (“random effects”) from a normal distribution with a mean of zero and a standard deviation of 0.26, while the population average complication rate was set t0 5%. These numbers were chosen to reflect the range of values (average and population standard deviation) of the random effects in the Score Card dataset, while the use of 200 random effects was a realistic compromise with the computational capabilities of the Asus Transformer T100 2 in 1 laptop/tablet that I used for these analyses.

The following code was used to simulate the logistic case for small data (the large and extra large cases were simulated by changing the values of the Nmin and Nmax variables).

library(lme4)
library(mgcv)
## helper functions
logit<-function(x) log(x/(1-x))
invlogit<-function(x) exp(x)/(1+exp(x))

## simulate cases
simcase<-function(N,p) rbinom(N,1,p)
## simulation scenario
pall<-0.05; # global average
Nsurgeon<-200; # number of surgeons
Nmin<-20; # min number of surgeries per surgeon
Nmax<-100; # max number of surgeries per surgeon

## simulate individual surgical performance
## how many simulations of each scenario
set.seed(123465); # reproducibility
ind<-rnorm(Nsurgeon,0,.26) ; # surgical random effects
logitind<-logit(pall)+ind ; # convert to logits
pind<-invlogit(logitind); # convert to probabilities
Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon

complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)),
id=do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim)))
complications$id<-factor(complications$id)



A random effect and fixed effect model were fit to these data (the fixed effect model is simply a series of independent fits to the data for each random effect):

## Random Effects

fit2<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=2)
ran2<-ranef(fit2)[["id"]][,1]
c2<-cor(ran2,ind)
int2<-fixef(fit2)
ranind2<-ran2+int2

## Fixed Effects

fixfit<-vector("numeric",Nsurgeon)
for(i in 1:Nsurgeon) {
fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="binomial")$coef[1] }  The corresponding Gaussian GLMM cases were simulated by making minor changes to these codes. These are shown below: simcase<-function(N,p) rnorm(N,p,1) fit2<-glmer(ev~1+(1|id),data=complications,nAGQ=2) fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="gaussian")$coef[1]



The predicted random effects were assessed against the simulated truth by smoothing regression splines. In these regressions, the intercept yields the bias of the average of the predicted random effects vis-a-vis the truth, while the slope of the regression quantifies the amount of shrinkage effected by the mixed model formulation. For unbiased estimation not only would we like the intercept to be zero, but also the slope to be equal to one. In this case, the predicted random effect would be equal to its true (simulated) value. Excessive shrinkage would result in a slope that is substantially different from one. Assuming that the bias (intercept) is not different from zero, the relaxation of the slope towards one quantifies the consistency and the bias (or rather its rate of convergence) of these estimators using simulation techniques (or so it seems to me).

The use of smoothing (flexible), rather than simple linear regression, to quantify these relationships does away with a restrictive assumption: that the amount of shrinkage is the same throughout the range of the random effects:

## smoothing spline (flexible) fit
fitg<-gam(ran2~s(ind)
## linear regression
fitl<-lm(ran2~ind)


The following figure shows the results of the flexible regression (black with 95% CI, dashed black) v.s. the linear regression (red) and the expected (blue) line (intercept of zero, slope of one).

Predicted v.s. simulated random effects for logistic and linear mixed regression as a function of the number of observations per random effect (cluster size)

Several observations are worth noting in this figure.
First
, the flexible regression was indistinguishable from a linear regression in all cases; hence the red and black lines overlap. Stated in other terms, the amount of shrinkage was the same across the range of the random effect values.
Second, the intercept in all flexible models was (within machine precision) equal to zero. Consequently, when estimating a group of random effects their overall mean is (unbiasedly) estimated.
Third, the amount of shrinkage of individual random effects appears to be excessive for small sample sizes (i.e. few cases per surgeon). Increasing the number of cases decreases the shrinkage, i.e. the black and red lines come closer to the blue line as N is increased from 20-100 to 1000-2000. Conversely, for small cluster sizes the amount of shrinkage is so excessive that one may lose the ability to distinguish between individuals with very different complication rates. This is reflected by a regression line between the predicted and the simulated random effect value that is nearly horizontal.
Fourth, the rate of convergence of the predicted random effects to their true value critically depends upon the linearity of the regression model. In particular, the shrinkage of logistic regression model with 1000-2000 observations per case is almost the same at that of a linear model with 20-100 for the parameter values considered in this simulation.

An interesting question is whether these observations (overshrinkage of random effects from small sample sizes in logistic mixed regression) reflects the use of random effects in modeling, or whether they are simply due to the interplay between sample size and the non-linearity of the statistical model. Hence, I turned to fixed effects modeling of the same datasets. The results of these analyses are summarized in the following figure:

Difference between fixed effect estimates of random effects(black histograms) v.s. random effects predictions (density estimators: red lines) relative to their simulated (true) values

One notes that the distribution of the differences between the random and fixed effects relative to the true (simulated) values is nearly identical for the linear case (second row). In other words, the use of the implicit constraint of the mixed model, offers no additional advantage when estimating individual performance in this model. On the other hand there is a value in applying mixed modeling techniques for the logistic regression case. In particular, outliers (such as those arising for small samples) are eliminated by the use of random effect modeling. The difference between the fixed and the random effect approach progressively decreases for large sample sizes, implying that the benefit of the latter approach is lost for “extra large” cluster sizes.

One way to put these differences into perspective is to realize that the random effects for the logistic model correspond to log-odd ratios, relative to the population mean. Hence the difference between the predicted random effect and its true value, when exponentiated, corresponds to an Odd Ratio (OR). A summary of the odds ratios over the population of the random effects as a function of cluster size is shown below.

Metric 20-100  200-1000 1000-2000
Min    0.5082   0.6665    0.7938
Q25    0.8901   0.9323    0.9536
Median 1.0330   1.0420    1.0190
Mean   1.0530   1.0410    1.0300
Q75    1.1740   1.1340    1.1000
Max    1.8390   1.5910    1.3160



Even though the average Odds Ratio is close to 1, a substantial number of predicted random effects are far from the true value and yield ORs that are greater than 11% in either direction for small cluster sizes. These observations have implications for the Score Card (or similar projects): if one were to use Random Effects modeling to focus on individuals, then unless the cluster sizes (observations per individual) are substantial, one would run a substantial risk of misclassifying individuals, even though one would be right on average!

One could wonder whether these differences between the simulated truth and the predicted random effects arise as a result of the numerical algorithms of the lme4 package. The latter was used by both the Surgeon Score Card project and our simulations so far and thus it would be important to verify that it performs up to specs. The major tuning variable for the algorithm is the order of the Adaptive Gaussian Quadrature (argument nAGQ). We did not find any substantial departures when the order of the quadrature was varied from 0 to 1 and 2. However, there is a possibility that the algorithm fails for all AGQ orders as it has to calculate probabilities that are numerically close to the boundary of the parameter space. We thus decided to fit the same model from a Bayesian perspective using Markov Chain Monte Carlo (MCMC) methods. The following code will fit the Bayesian model and graphs the true values of the effects used in the simulated dataset against the Bayesian estimates (the posterior mean) and also the lme4 predictions. The latter tend to be numerically close to the posterior mode of the random effects when a Bayesian perspective is adopted.

## Fit the mixed effects logistic model from R using openbugs

library("glmmBUGS")
library(nlme)
fitBUGS = glmmBUGS(ev ~ 1, data=complications, effects="id", family="bernoulli")
startingValues = fitBUGS$startingValues source("getInits.R") require(R2WinBUGS) fitBUGSResult = bugs(fitBUGS$ragged, getInits, parameters.to.save = names(getInits()),
model.file="model.txt", n.chain=3, n.iter=6000, n.burnin=2000, n.thin=10,
program="openbugs", working.directory=getwd())

fitBUGSParams = restoreParams(fitBUGSResult , fitBUGS$ragged) sumBUGS<-summaryChain(fitBUGSParams ) checkChain(fitBUGSParams ) ## extract random effects cnames<-as.character(sort(as.numeric(row.names(sumBUGS$FittedRid))))
fitBUGSRE<-sumBUGS\$Rid[cnames,1]

## plot against the simulated (true) effects and the lme4 estimates

hist(ind,xlab="RE",ylim=c(0,3.8),freq=FALSE,main="")
lines(density(fitBUGSRE),main="Bayesian",xlab="RE",col="blue")
lines(density(ran2),col="red")
legend(legend=c("Truth","lme4","MCMC"),col=c("black","red","blue"),
bty="n",x=0.2,y=3,lwd=1)



The following figure shows the histogram of the true values of the random effects (black), the frequentist(lme4) estimates (red) and the Bayesian posterior means (blue).

It can be appreciated that both the Bayesian estimates and the lme4 predictions demonstrate considerable shrinkage relative to the true values for small cluster sizes (20-100). Hence, an lme4 numerical quirk seems an unlikely explanation for the shrinkage observed in the simulation.

Summing up:

• Random Effect modeling of binomial outcomes require a substantial number of observations per individual (cluster size) for the procedure to yield estimates of individual effects that are numerically indistinguishable  from the true values
• Fixed effect modeling is even worse an approach for this problem
• Bayesian fitting procedures do not appear to yield numerically different effects from their frequentist counterparts

These features should raise the barrier for accepting a random effects logistic modeling approach when the focus is on individual rather than population average effects. Even though the procedure is certainly preferable to fixed effects regression, the direct use of the value of the predicted individual random effects as an effect measure will be problematic for small cluster sizes (e.g. a small number of procedures per surgeon). In particular, a substantial proportion of these estimated effects is likely to be far from the truth even if the model is unbiased on the average. These observations are of direct relevance to the Surgical Score Card in which the number of observations per surgeon were far lower than the average value in our simulations: 60 (small), 600 (large) and 1500 (extra large):

 Procedure Code N (procedures) N(surgeons) Procedures per surgeon 51.23 201,351 21,479 9.37 60.5 78,763 5,093 15.46 60.29 73,752 7,898 9.34 81.02 52,972 5,624 9.42 81.07 106,689 6,214 17.17 81.08 102,716 6,136 16.74 81.51 494,576 13,414 36.87 81.54 1,190,631 18,029 66.04 Total 2,301,450 83,887 27.44