Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third post on prediction the deaths. The first one used randomforest, the second boosting (gbm). The aim of the third post was to use bagging. In contrast to the former posts I abandoned dplyr in this post. It gave some now you see now you don’t errors.

### Data

The data is supposed to be the same as previous.
library(ipred)
library(rpart)
library(lattice)

train$status <- 'train' test <- read.csv('test.csv') test$status <- 'test'
test$Survived <- NA tt <- rbind(test,train) # generate variables tt$Pclass <- factor(tt$Pclass) tt$Survived <- factor(tt$Survived) tt$age <- tt$Age tt$age[is.na(tt$age)] <- 35 tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100)) tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]) tt$Title <- gsub(' ','',tt$Title) tt$Title[tt$Title %in% c(‘Capt’,’Col’,’Don’,’Sir’,’Jonkheer’,’Major’)] <- 'Mr' tt$Title[tt$Title %in% c(‘Lady’,’Ms’,’theCountess’,’Mlle’,’Mme’,’Ms’,’Dona’)] <- 'Miss' tt$Title <- factor(tt$Title) tt$A <- factor(grepl('A',tt$Cabin)) tt$B <- factor(grepl('B',tt$Cabin)) tt$C <- factor(grepl('C',tt$Cabin)) tt$D <- factor(grepl('D',tt$Cabin)) tt$E <- factor(grepl('E',tt$Cabin)) tt$F <- factor(grepl('F',tt$Cabin)) tt$ncabin <- nchar(as.character(tt$Cabin)) tt$PC <- factor(grepl('PC',tt$Ticket)) tt$STON <- factor(grepl('STON',tt$Ticket)) tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin)) tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE) ### Age The first step is again to predict the missing ages. Even though we have I have all data available in one data.frame, I still think the correct approach is to create the age model using only the training data. Note that I am not too impressed with the age model. Perhaps this should also be optimized. forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in% c(‘Age’,’Sex’,’Pclass’,’SibSP’, ‘Parch’,’Fare’,’Title’,’Embarked’,’A’,’B’,’C’,’D’,’E’,’F’, ‘ncabin’,’PC’,’STON’,’oe’)] ipbag1 <- bagging(Age ~.,data=forage) ipbag1 Bagging regression trees with 25 bootstrap replications Call: bagging.data.frame(formula = Age ~ ., data = forage) plot(tt$Age~predict(ipbag1,tt))
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(ipbag1,tt[is.na(tt$AGE),]) ### Selecting the survival model ipred, the package in which bagging resides, comes with a nice general purpose cross validation utility. In the end, I decided the two parameters to be optimized are ns; the size of the bags and minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Nbagg, the number of bootstrap evaluations, just needs to be big enough. Regarding nbagg, I have the feeling that this particular problem, with relatively few records, it may be needed to have relatively high nbagg in order to have reproducible models. di1 <- subset(titanic,select=c( age,SibSp,Parch,Fare,Sex,Pclass, Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived)) dso <- expand.grid(ns=seq(100,300,25),nbagg=c(500),minsplit=1:6) la <- lapply(1:nrow(dso),function(ii) { ee <- errorest(Survived ~ ., ns=dso$ns[ii],
control=rpart.control(minsplit=dso$minsplit[ii], cp=0, xval=0,maxsurrogate=0), nbagg=dso$nbagg[ii],
model=bagging,
data=di1,
est.para=control.errorest(k=20)
)
cc <- c(ns=dso$ns[ii],minsplit=dso$minsplit[ii],nbagg=dso$nbagg[ii],error=ee$error)
print(cc)
cc
})
las <- do.call(rbind,la)
las <- as.data.frame(las)
xyplot(error ~ ns, groups= minsplit, data=las,auto.key=TRUE,type=’l’)

#### Predictions

Based on the plot I have chosen for ns=275 and minsplit=5. But, to be honest, in a previous run I had chosen ns=150 and minsplit=2. Obviously from this plot a silly choice. But, given the high variability in this plot between parameters which are relatively similar and the totally different result, I actually think there is relatively much noise in the validation. Thus what is actually seen is that there is relatively little difference between the settings.
Having said that, these new settings got me just over 0.8 in the Kaggle score, while the previous settings were just below.
bagmod <- bagging(Survived ~.,ns=275,nbagg=500,
control=rpart.control(minsplit=5, cp=0, xval=0,maxsurrogate=0),
data=di1)

pp <- predict(bagmod,test)

out <- data.frame(
PassengerId=test\$PassengerId,
Survived=pp,row.names=NULL)
write.csv(x=out,
file=’bag8aug.csv’,
row.names=FALSE,
quote=FALSE)