Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I showed how one can easily summarize the outcome of a logistic regression. Here I want to show how this really depends on the data-points that are used to estimate the model. Taking a cue from the evolution of a correlation I have plotted the estimated Odds Ratios (ORs) depending on the number of included participants. The result is bad news for those working with small (< 750 participants) data-sets. “eval_reg” Function to estimate model parameters for subsets of data

eval_reg<-function(model){
mod<-model
dat<-mod\$data[sample(nrow(mod\$data)),]
vars<-names(coef(mod))
est<-data.frame(matrix(nrow=nrow(dat), ncol=length(vars)))
pb <- txtProgressBar(min = 50, max = nrow(dat), style = 3)

for(i in 50:nrow(dat)){
try(boot_mod<-update(mod, data=dat[1:i,]))
try(est[i,]<-exp(coef(boot_mod)))
setTxtProgressBar(pb, i)
}
est\$mod_nr<-1:length(dat[,1])
names(est)<-c(vars, ‘mod_nr’)
return(est)
}

As I randomized the order of data you can run it again and again to arrive at an even deeper mistrust as some of the resulting permutations will look like they stabilize earlier. On the balance you need to set the random-number seed to make it reproducible.

Run and plot the development

set.seed(29012001)

mod_eval<-eval_reg(gp_mod)

tmp<-melt(mod_eval,id=’mod_nr’)
tmp2<-tmp[tmp\$variable!='(Intercept)',]

ticks<-c(seq(.1, 1, by =.1), seq(0, 10, by =1), seq(10, 100, by =10))

ggplot(tmp2, aes(y=value, x = mod_nr, color = variable)) +
geom_line() +
geom_hline(y=1, linetype=2) +
labs(title = ‘Evolution of logistic regression’, y = ‘OR’, x = ‘number of participants’) +
scale_y_log10(breaks=ticks, labels = as.character(ticks)) +
theme_bw()

Update 29-01-2013:

I added my definition of the ticks on the log-scale. The packages needed are ggplot2 and reshape.