**Wiekvoet**, and kindly contributed to R-bloggers)

That is what I read the other day. For calculation of descriptive statistics, values below the LLOQ (lower limit of quantification) were set to…. Then I wondered, wasn’t there a trick in JAGS to incorporate the presence of missing data while estimating parameters of a distribution. How would that compare with standard methods such as imputing LLOQ at 0, half LLOQ, LLOQ or just ignoring the missing data? A simulation seems to be called for.

### Bayes estimation

A bit of searching gave me a way function to estimate the mean and standard deviation, here wrapped in a function so resulting data has a nice shape.

library(R2jags)

library(plyr)

library(ggplot2)

Bsensnorm <- function(yy,limit) {

isCensored <- is.na(yy)

model1 <- function() {

for ( i in 1:N ) {

isCensored[i] ~ dinterval( yy[i] , limit )

yy[i] ~ dnorm( mu , tau )

}

tau <- 1/pow(sigma,2)

sigma ~ dunif(0,100)

mu ~ dnorm(0,1E-6)

}

datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)

params <- c(‘mu’,’sigma’)

inits <- function() {

list(mu = rnorm(0,1),

sigma = runif(0,1))

}

jagsfit <- jags(datain, model=model1, inits=inits,

parameters=params,progress.bar=”gui”,

n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)

data.frame( system=’Bayes Uninf’,

mu_out=jagsfit$BUGSoutput$mean$mu,

sigma_out=jagsfit$BUGSoutput$mean$sigma)

}

### Somewhat informative prior

BsensInf <- function(yy,limit) {

isCensored <- is.na(yy)

model2 <- function() {

for ( i in 1:N ) {

isCensored[i] ~ dinterval( yy[i] , limit )

yy[i] ~ dnorm( mu , tau )

}

tau <- 1/pow(sigma,2)

sigma ~ dnorm(1,1) %_% T(0.001,)

mu ~ dnorm(0,1E-6)

}

datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)

params <- c(‘mu’,’sigma’)

inits <- function() {

list(mu = abs(rnorm(0,1))+.01,

sigma = runif(.1,1))

}

jagsfit <- jags(datain, model=model2, inits=inits, parameters=params,progress.bar=”gui”,

n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)

data.frame( system=’Bayes Inf’,

mu_out=jagsfit$BUGSoutput$mean$mu,

sigma_out=jagsfit$BUGSoutput$mean$sigma)

}

### Simple imputation

L0sensnorm <- function(yy,limit) {

yy[is.na(yy)] <- 0

data.frame(system=’as 0′,mu_out=mean(yy),sigma_out=sd(yy))

}

Lhsensnorm <- function(yy,limit) {

yy[is.na(yy)] <- limit/2

data.frame(system=’as half LOQ’,mu_out=mean(yy),sigma_out=sd(yy))

}

Llsensnorm <- function(yy,limit) {

yy[is.na(yy)] <- limit

data.frame(system=’as LOQ’,mu_out=mean(yy),sigma_out=sd(yy))

}

LNAsensnorm <- function(yy,limit) {

yy <- yy[!is.na(yy)]

data.frame(system=’as missing’,mu_out=mean(yy),sigma_out=sd(yy))

}

### Simulations

limitest <- function(yy,limit) {

yy[yy<limit] <- NA

do.call(rbind,list(

Bsensnorm(yy,limit),

BsensInf(yy,limit),

L0sensnorm(yy,limit),

Lhsensnorm(yy,limit),

Llsensnorm(yy,limit),

LNAsensnorm(yy,limit)))

}

simtest <- function(n,muin,sdin,limit) {

numnotna<- 0

while(numnotna<n/2) {

yy <- rnorm(n,muin,sdin)

numnotna <- sum(yy>limit)

}

ll <- limitest(yy,limit)

ll$n_na <- sum(yy<limit)

ll

}

datain <- expand.grid(n=c(5,6,8,10,12,15,20,50,80),muin=seq(1,3.5,.05),sdin=1,limit=1)

dataout <- mdply(datain,simtest)

### Results for means

After all simulations, a difference is calculated between the simulation mean and the estimated means. This made interpreting the plots much more easy.

dataout$dif <- dataout$mu_out-dataout$muin

#### Uninformed prior did not perform with very little data

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system %in% c(‘Bayes Uninf’,’Bayes Inf’),],aes(x=muin,y=dif)) + #group=Source,

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘difference observed and simulation mu’)

#### At low number of observations variation of the mean swamps bias.

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system!=’Bayes Uninf’,],aes(x=muin,y=dif)) + #group=Source,

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘difference observed and simulation mu’)

#### At intermediate number of observations method choice gets more important

ggplot(dataout[dataout$n %in% c(10,12,15) & dataout$system!=’Bayes xUninf’,],aes(x=muin,y=dif)) + #group=Source,

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘difference observed and simulation mu’)

#### At large number of observations bias gets far too large

At this point the missing option is not plotted any more, it is just not competitive. Setting at LLOQ gets a big positive bias, while 0 gets a negative bias, especially for 80 observations. There is no point in setting up an informed prior. Half LLOQ seems least biased out of the simple methods.

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘difference observed and simulation mu’)

### Standard deviation is negative biased for simple imputation

ggplot(dataout[dataout$n%in% c(5,6,8) & dataout$system!=’Bayes Uninf’,],aes(x=muin,y=sigma_out)) + #group=Source,

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘estimated sigma’)

ggplot(dataout[dataout$n%in% c(10,12,15) & dataout$system!=’Bayes Uninf’,],aes(x=muin,y=sigma_out)) + #group=Source,

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘estimated sigma’)

With a large number of observations half LLOQ results in a negative bias on the standard deviation.

ggplot(dataout[dataout$n %in% c(20,50,80) & dataout$system!=’as missing’,],aes(x=muin,y=sigma_out)) + #group=Source,

geom_smooth(method=’loess’) +

geom_point() +

facet_grid(system ~n , drop=TRUE) +

scale_x_continuous(‘simulation mu’) +

scale_y_continuous(‘estimated sigma’)

### Conclusion

Finally, setting data to <LLOQ is something that has implications. As a statistician, I think using <LLOQ to display data in listings and tables is a good thing. However, when subsequent calculations by statisticians are needed, they may cause more trouble than they resolve. I have seen simple PCA go down the drain by <LLOQ, which is, truth to say, a shame given the effort in getting data.

**leave a comment**for the author, please follow the link and comment on his blog:

**Wiekvoet**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...