Bias in high dimensional optimism corrected bootstrap procedure

[This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been working in high dimensional analysis to predict drug response in rheumatoid arthritis patients and I was concerned to find the procedure called optimism corrected bootstrapping over-fits as p (number of features) increases. Optimism corrected bootstrapping is a way of trying to estimate the overfitting error of a dataset by resampling the dataset and using the training data to test on itself then the test data to calculate an estimate of the overfitting error. This estimate is then removed from the accuracy of the entire model over fitted against the whole data-set. The statistical procedure can be found here: http://thestatsgeek.com/2014/10/04/adjusting-for-optimismoverfitting-in-measures-of-predictive-ability-using-bootstrapping/. For this blog post, I am going to stick to using R code to show the problem with this method I have found.

Let’s see a Caret example of this problem using the iris dataset, I am just going to add a little Gaussian noise to reduce the predictive power of the variables.

library(caret)
iris$Sepal.Length <- iris$Sepal.Length + rnorm(0,1,n=100)
iris$Sepal.Width <- iris$Sepal.Width + rnorm(0,1,n=100)
iris$Petal.Length <- iris$Petal.Length + rnorm(0,1,n=100)
iris$Petal.Width <- iris$Petal.Width + rnorm(0,1,n=100)
iris <- subset(iris, iris$Species == 'versicolor' | iris$Species == 'virginica')
iris$Species <- droplevels(iris$Species)

So now let's run a glmnet model predicting Species with either cross validation (CV) or optimism corrected bootstrapping.

ctrl <- trainControl(method = 'cv',
                     summaryFunction=twoClassSummary,
                     classProbs=T,
                     savePredictions = T,
                     verboseIter = T)
fit1 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
                method="glmnet", # preProc=c("center", "scale")
                trControl=ctrl, metric = "ROC") #
print(fit1)

ROC
0.800
0.800
0.800
0.800
0.800
0.788
0.800
0.796
0.780

ctrl <- trainControl(method = 'optimism_boot',
                     summaryFunction=twoClassSummary,
                     classProbs=T,
                     savePredictions = T,
                     verboseIter = T)
fit2 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
              method="glmnet", # preProc=c("center", "scale")
              trControl=ctrl, metric = "ROC") #
print(fit1)

ROC
0.8072208
0.8075397
0.8073676
0.8070116
0.8069301
0.8071842
0.8062909
0.8052974
0.8038033

So they are about the same in low dimensions, however, if we replace all the variables of iris with many features that are just noise, a different picture emerges.

test <- matrix(rnorm(100*1000, mean = 0, sd = 1),
               nrow = 100, ncol = 1000, byrow = TRUE)
iris <- cbind(iris,test)
iris <- iris[,-1:-4]

So now cross validation can be run, we can see correctly that the data is very non predictive.

ctrl <- trainControl(method = 'cv',
                     summaryFunction=twoClassSummary,
                     classProbs=T,
                     savePredictions = T,
                     verboseIter = T)
fit3 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
              method="glmnet", # preProc=c("center", "scale")
              trControl=ctrl, metric = "ROC") #
print(fit3)

ROC
0.484
0.488
0.480
0.460
0.436
0.400
0.400
0.416
0.332

And now the optimism bootstrapping method can be run, we can see a very positive result is obtained even with just 1000 noise variables:

ctrl <- trainControl(method = 'optimism_boot',
                     summaryFunction=twoClassSummary,
                     classProbs=T,
                     savePredictions = T,
                     verboseIter = T)
fit4 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
              method="glmnet", # preProc=c("center", "scale")
              trControl=ctrl, metric = "ROC") #
print(fit4)

ROC
0.9361760
0.9361600
0.9345920
0.9200960
0.9123520
0.8929600
0.8940960
0.8886560
0.8356055

So, I think the main message of this data is to avoid the use of the optimism corrected bootstrap, especially in higher dimensions. This effect also occurs with lower dimensionality than shown here e.g. 100 variables. I tend to use LOOCV on smaller datasets and repeated CV on larger datasets. Simple bootstrapping, in my opinion, is also more suitable than optimism corrected bootstrapping, which appears to be a disaster of a method, unfortunately.

To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)