Predicting Titanic deaths on Kaggle VI: Stan

September 19, 2015
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is a bit a contradiction. Kaggle provides competitions on data science, while Stan is clearly part of the (Bayesian) statistics. Yet after using random forests, boosting and bagging, I also think this problem has a suitable size for Stan, which I understand can handle larger problems than older Bayesian software such as JAGS.

What I aim to do is enter a load of variables in the Stan model. Aliasing will be ignored, and I hope the hierarchical model will provide suitable shrinkage for terms which are not relevant.

Data

The data has been mildly adapted from previously. Biggest change is that I have decided to make age into a factor, based on the value when present, with a special level for missing. The alternative in context of a Bayesian model would be to make fitting age part of the model, but that seems more complex than I am willing to go at this point.

Model

The idea of the model is pretty simple. it will be logistic regression, with only factors in the independent variables. All levels in all factors are used. So when Sex is entered in the model, it will add two parameter values, one for male, one for female (see code below, variable fsex). When passenger class is entered in the model, it has three values (variabe fpclass). Upon the parameter values there is a prior distribution. I have chosen a normal prior distribution, around zero with standard deviation sdsex and sdpclass respectively. These have a common prior (sd1). I have used both half normal with standard deviation 3 as below and uniform (0,3) for the prior.
parameters {
real fsex[2];
real intercept;
real fpclass[3];
real sdsex;
real sdpclass;
real sd1;
}
transformed parameters {
real expect[ntrain];
for (i in 1:ntrain) {
expect[i] <- inv_logit(
intercept+
fsex[sex[i]]+
fpclass[pclass[i]]
);
}
}
model {
fsex ~ normal(0,sdsex);
fpclass ~ normal(0,sdpclass);

sdsex ~  normal(0,sd1);
sdpclass ~ normal(0,sd1);
sd1 ~ normal(0,3);
intercept ~ normal(0,1);
survived ~ bernoulli(expect);
}

Predictions

In this phase I have decided to make the predictions within the Stan program. The way which seemed to work is to duplicate all independent variables and do the predictions in the generated quantities section of the program. These predictions are actually probabilities of survival for each passenger for each MCMC sample. Hence a bit of post processing is used, the mean probability is calculated and a cut off of 0.5 is used to decide the final classification.
Since the Stan code runs pretty quickly after it has been compiled it is feasible to run this as a two stage process. A first run to examine the model parameters. A second run just to obtain the predictions.

Model 1

This is the parameter output of a model with just age and passenger class. It has been added here mostly to show a simple full coded example what the code looks like. With sdsex bigger than sdpclass it follows that sex had a bigger influence than class on the chance of survival.
Inference for Stan model: 8fb625a6ccf29aab919e1dcd494247aa.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.

mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
intercept    -0.07    0.03 0.83   -1.68   -0.65   -0.07    0.49    1.61   670 1.00
fsex[1]       1.40    0.04 0.89   -0.46    0.83    1.40    1.98    3.07   533 1.01
fsex[2]      -1.24    0.04 0.89   -3.11   -1.81   -1.23   -0.65    0.45   529 1.01
fpclass[1]    0.94    0.03 0.70   -0.40    0.50    0.93    1.33    2.34   536 1.01
fpclass[2]    0.12    0.03 0.69   -1.24   -0.31    0.11    0.52    1.55   527 1.01
fpclass[3]   -0.93    0.03 0.69   -2.25   -1.35   -0.94   -0.50    0.47   499 1.01
sdsex         2.06    0.05 1.20    0.76    1.23    1.74    2.49    5.36   648 1.00
sdpclass      1.40    0.04 0.90    0.50    0.84    1.15    1.67    3.86   661 1.00
sd1           2.41    0.04 1.27    0.71    1.45    2.18    3.11    5.48  1071 1.00
lp__       -421.36    0.09 2.12 -426.25 -422.56 -421.00 -419.82 -418.19   616 1.00

Samples were drawn using NUTS(diag_e) at Sun Sep 13 12:28:49 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Model 2

Model 2 has all linear effects in it. Predictions using this model were submitted, it gave a score of 0.78947 which was not an improvement over previous scores. I have one submission using bagging which did better, got same result with boosting and worse results with random forest.

Note that while this model looks pretty complex, this score was obtained without any interactions.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.

mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
intercept       0.28    0.05  0.74   -1.20   -0.16    0.28    0.77    1.68   229 1.01
sd1             0.80    0.03  0.23    0.43    0.63    0.77    0.97    1.31    63 1.05
fsex[1]         0.24    0.03  0.56   -0.64   -0.01    0.13    0.39    1.85   353 1.01
fsex[2]        -0.05    0.04  0.50   -1.29   -0.24    0.00    0.22    0.89   131 1.02
fpclass[1]      0.54    0.07  0.57   -0.39    0.15    0.53    0.90    1.77    60 1.03
fpclass[2]      0.15    0.06  0.50   -0.68   -0.27    0.16    0.50    1.15    76 1.01
fpclass[3]     -0.78    0.05  0.48   -1.66   -1.10   -0.79   -0.47    0.18   102 1.01
fembarked[1]    0.26    0.02  0.33   -0.29    0.06    0.22    0.43    1.02   298 1.01
fembarked[2]    0.08    0.02  0.32   -0.55   -0.07    0.06    0.20    0.78   456 1.00
fembarked[3]   -0.18    0.02  0.31   -0.80   -0.33   -0.18   -0.01    0.45   378 1.01
foe[1]         -0.32    0.08  0.53   -1.63   -0.65   -0.28    0.04    0.54    41 1.07
foe[2]          0.04    0.02  0.43   -0.73   -0.11    0.01    0.19    1.02   433 1.01
foe[3]          0.34    0.03  0.48   -0.41    0.04    0.26    0.57    1.54   227 1.03
fcabchar[1]    -0.25    0.18  0.46   -1.16   -0.49   -0.13    0.03    0.45     7 1.17
fcabchar[2]    -0.04    0.03  0.35   -0.86   -0.16   -0.04    0.10    0.62   143 1.03
fcabchar[3]     0.07    0.01  0.27   -0.47   -0.06    0.05    0.18    0.75   325 1.03
fcabchar[4]    -0.09    0.04  0.31   -0.85   -0.23   -0.03    0.08    0.48    55 1.05
fcabchar[5]     0.22    0.02  0.33   -0.32    0.00    0.17    0.40    1.03   219 1.02
fcabchar[6]     0.31    0.04  0.36   -0.25    0.02    0.25    0.59    1.12    76 1.03
fcabchar[7]    -0.21    0.06  0.40   -1.03   -0.43   -0.11    0.04    0.45    44 1.05
fage[1]         0.14    0.06  0.32   -0.38   -0.04    0.04    0.28    0.78    27 1.10
fage[2]         0.45    0.14  0.59   -0.12    0.02    0.17    0.77    1.68    19 1.17
fage[3]         0.10    0.15  0.53   -0.74   -0.15   -0.01    0.12    1.37    13 1.24
fage[4]        -0.10    0.01  0.31   -0.86   -0.25   -0.05    0.03    0.48   558 1.01
fage[5]         0.17    0.10  0.43   -0.52   -0.04    0.04    0.25    1.07    20 1.14
fage[6]         0.00    0.01  0.21   -0.42   -0.09   -0.01    0.09    0.47   250 1.01
fage[7]         0.09    0.03  0.20   -0.32   -0.01    0.05    0.22    0.47    50 1.05
fage[8]        -0.22    0.03  0.31   -1.06   -0.34   -0.15   -0.01    0.17   129 1.03
fage[9]        -0.02    0.07  0.39   -0.98   -0.16   -0.01    0.12    0.60    27 1.09
fage[10]        0.00    0.01  0.19   -0.43   -0.06    0.01    0.06    0.40   813 1.00
fticket[1]      0.07    0.03  0.21   -0.33   -0.06    0.04    0.22    0.51    54 1.05
fticket[2]     -0.04    0.03  0.34   -0.87   -0.17   -0.01    0.17    0.59   108 1.02
fticket[3]     -0.17    0.05  0.43   -1.23   -0.38   -0.09    0.06    0.60    69 1.04
fticket[4]      0.01    0.05  0.31   -0.48   -0.19    0.01    0.19    0.69    40 1.06
fticket[5]      0.02    0.02  0.36   -0.64   -0.13   -0.02    0.15    0.95   210 1.01
fticket[6]      0.03    0.05  0.32   -0.63   -0.15    0.02    0.21    0.55    37 1.07
fticket[7]      0.14    0.02  0.38   -0.40   -0.05    0.05    0.26    1.18   245 1.01
fticket[8]      0.01    0.01  0.35   -0.84   -0.12    0.02    0.20    0.73   929 1.01
fticket[9]      0.01    0.01  0.32   -0.69   -0.12    0.03    0.14    0.65   766 1.00
fticket[10]    -0.23    0.04  0.46   -1.53   -0.39   -0.08    0.03    0.38   149 1.02
fticket[11]    -0.02    0.01  0.32   -0.66   -0.16   -0.04    0.11    0.70  1001 1.01
fticket[12]     0.37    0.05  0.47   -0.16    0.03    0.20    0.57    1.58    82 1.04
fticket[13]    -0.20    0.02  0.34   -1.05   -0.36   -0.14    0.01    0.36   221 1.03
ftitle[1]       1.26    0.08  0.73   -0.03    0.75    1.20    1.74    2.85    79 1.03
ftitle[2]       0.47    0.04  0.70   -1.05    0.13    0.45    0.87    1.79   372 1.01
ftitle[3]      -2.27    0.03  0.66   -3.59   -2.59   -2.34   -1.88   -0.86   486 1.01
ftitle[4]       0.78    0.06  0.73   -0.86    0.35    0.89    1.25    2.10   144 1.01
fsibsp[1]       0.70    0.06  0.51   -0.33    0.36    0.69    1.09    1.69    77 1.04
fsibsp[2]       0.48    0.08  0.53   -0.62    0.14    0.49    0.93    1.46    49 1.06
fsibsp[3]       0.67    0.11  0.65   -0.60    0.23    0.61    1.16    1.85    35 1.07
fsibsp[4]      -1.58    0.04  0.64   -2.89   -2.03   -1.52   -1.16   -0.38   292 1.02
fparch[1]       0.25    0.02  0.33   -0.25    0.03    0.22    0.39    1.02   450 1.01
fparch[2]       0.04    0.01  0.33   -0.53   -0.13    0.00    0.17    0.83   521 1.01
fparch[3]      -0.03    0.06  0.38   -0.64   -0.22   -0.01    0.14    0.78    35 1.07
fparch[4]      -0.47    0.18  0.54   -1.48   -0.81   -0.31   -0.02    0.22     9 1.12
ffare[1]       -0.01    0.01  0.15   -0.39   -0.04    0.00    0.03    0.29   421 1.01
ffare[2]       -0.02    0.01  0.15   -0.41   -0.06    0.00    0.02    0.26   419 1.01
ffare[3]       -0.02    0.01  0.15   -0.35   -0.06    0.00    0.02    0.26   671 1.00
ffare[4]        0.00    0.01  0.16   -0.31   -0.05    0.00    0.04    0.36   586 1.01
ffare[5]        0.08    0.01  0.20   -0.15   -0.01    0.01    0.12    0.66   247 1.02
sdsex           0.51    0.04  0.44    0.03    0.21    0.35    0.68    1.64   123 1.01
sdpclass        0.75    0.04  0.36    0.32    0.48    0.68    0.94    1.59    64 1.04
sdembarked      0.43    0.02  0.32    0.06    0.21    0.39    0.53    1.24   328 1.01
sdoe            0.56    0.05  0.38    0.06    0.32    0.47    0.73    1.51    48 1.05
sdcabchar       0.39    0.03  0.26    0.03    0.18    0.36    0.59    0.97    61 1.04
sdage           0.33    0.06  0.29    0.02    0.09    0.23    0.53    0.89    23 1.15
sdticket        0.35    0.03  0.24    0.04    0.19    0.30    0.45    0.99    68 1.05
sdtitle         1.29    0.02  0.35    0.75    1.07    1.19    1.47    2.15   320 1.01
sdsibsp         1.00    0.02  0.37    0.49    0.74    0.95    1.16    1.95   222 1.01
sdparch         0.47    0.12  0.37    0.01    0.16    0.38    0.73    1.25    10 1.12
sdfare          0.14    0.02  0.17    0.00    0.02    0.09    0.19    0.60    92 1.03
lp__         -342.28    2.74 15.87 -372.33 -353.23 -344.26 -330.95 -310.12    34 1.09

Samples were drawn using NUTS(diag_e) at Sun Aug 30 13:00:19 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Code

# preparation and data reading section
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

train\$status <- ‘train’
test\$status <- ‘test’
test\$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt\$Embarked[tt\$Embarked==”] <- ‘S’
tt\$Embarked <- factor(tt\$Embarked)
tt\$Pclass <- factor(tt\$Pclass)
tt\$Survived <- factor(tt\$Survived)
tt\$age <- tt\$Age
tt\$age[is.na(tt\$age)] <- 999
tt\$age <- cut(tt\$age,c(0,2,5,9,12,15,21,55,65,100,1000))

tt\$Title <- sapply(tt\$Name,function(x) strsplit(as.character(x),'[.,]’)[[1]][2])
tt\$Title <- gsub(‘ ‘,”,tt\$Title)
tt\$Title[tt\$Title==’Dr’ & tt\$Sex==’female’] <- ‘Miss’
tt\$Title[tt\$Title %in% c(‘Capt’,’Col’,’Don’,’Sir’,’Jonkheer’,’Major’,’Rev’,’Dr’)] <- ‘Mr’
tt\$Title <- factor(tt\$Title)
# changed cabin character
tt\$cabchar <- substr(tt\$Cabin,1,1)
tt\$cabchar[tt\$cabchar %in% c(‘F’,’G’,’T’)] <- ‘X’;
tt\$cabchar <- factor(tt\$cabchar)
tt\$ncabin <- nchar(as.character(tt\$Cabin))
tt\$cn <- as.numeric(gsub(‘[[:space:][:alpha:]]’,”,tt\$Cabin))
tt\$oe <- factor(ifelse(!is.na(tt\$cn),tt\$cn%%2,-1))
tt\$Fare[is.na(tt\$Fare)]<- median(tt\$Fare,na.rm=TRUE)
tt\$ticket <- sub(‘[[:digit:]]+\$’,”,tt\$Ticket)
tt\$ticket <- toupper(gsub(‘(\\.)|( )|(/)’,”,tt\$ticket))
tt\$ticket[tt\$ticket %in% c(‘A2′,’A4′,’AQ3′,’AQ4′,’AS’)] <- ‘An’
tt\$ticket[tt\$ticket %in% c(‘SCA3′,’SCA4′,’SCAH’,’SC’,’SCAHBASLE’,’SCOW’)] <- ‘SC’
tt\$ticket[tt\$ticket %in% c(‘CASOTON’,’SOTONO2′,’SOTONOQ’)] <- ‘SOTON’
tt\$ticket[tt\$ticket %in% c(‘STONO2′,’STONOQ’)] <- ‘STON’
tt\$ticket[tt\$ticket %in% c(‘C’)] <- ‘CA’
tt\$ticket[tt\$ticket %in% c(‘SOC’,’SOP’,’SOPP’)] <- ‘SOP’
tt\$ticket[tt\$ticket %in% c(‘SWPP’,’WC’,’WEP’)] <- ‘W’
tt\$ticket[tt\$ticket %in% c(‘FA’,’FC’,’FCC’)] <- ‘F’
tt\$ticket[tt\$ticket %in% c(‘PP’,’PPP’,’LINE’,’LP’,’SP’)] <- ‘PPPP’
tt\$ticket <- factor(tt\$ticket)
tt\$fare <- cut(tt\$Fare,breaks=c(min(tt\$Fare)-1,quantile(tt\$Fare,seq(.2,.8,.2)),max(tt\$Fare)+1))

train <- tt[tt\$status==’train’,]
test <- tt[tt\$status==’test’,]

#end of preparation and data reading

options(width=90)

First model

datain <- list(

survived = c(0,1)[train\$Survived],

ntrain = nrow(train),

ntest=nrow(test),
sex=c(1:2)[train\$Sex],
psex=c(1:2)[test\$Sex],
pclass=c(1:3)[train\$Pclass],
ppclass=c(1:3)[test\$Pclass]
)

parameters=c(‘intercept’,’fsex’,’fpclass’,’sdsex’,’sdpclass’,’sd1′)
my_code <- ‘
data {
int ntrain;
int ntest;
int survived[ntrain];
int sex[ntrain];
int psex[ntest];
int pclass[ntrain];
int ppclass[ntest];

}
parameters {
real fsex[2];
real intercept;
real fpclass[3];
real sdsex;
real sdpclass;
real sd1;
}
transformed parameters {
real expect[ntrain];
for (i in 1:ntrain) {
expect[i] <- inv_logit(
intercept+
fsex[sex[i]]+
fpclass[pclass[i]]
);
}

}
model {
fsex ~ normal(0,sdsex);
fpclass ~ normal(0,sdpclass);

sdsex ~  normal(0,sd1);
sdpclass ~ normal(0,sd1);
sd1 ~ normal(0,3);
intercept ~ normal(0,1);
survived ~ bernoulli(expect);
}
generated quantities {
real pred[ntest];
for (i in 1:ntest) {
pred[i] <- inv_logit(
intercept+
fsex[psex[i]]+
fpclass[ppclass[i]]

);
}
}
‘

fit1 <- stan(model_code = my_code,
data = datain,
pars=parameters,
iter = 1000,
chains = 4,
open_progress=FALSE)

fit1

Second model

datain <- list(
survived = c(0,1)[train\$Survived],
ntrain = nrow(train),
ntest=nrow(test),
sex=c(1:2)[train\$Sex],
psex=c(1:2)[test\$Sex],
pclass=c(1:3)[train\$Pclass],
ppclass=c(1:3)[test\$Pclass],
embarked=c(1:3)[train\$Embarked],
pembarked=c(1:3)[test\$Embarked],
oe=c(1:3)[train\$oe],
poe=c(1:3)[test\$oe],
cabchar=c(1:7)[train\$cabchar],
pcabchar=c(1:7)[test\$cabchar],
age=c(1:10)[train\$age],
page=c(1:10)[test\$age],
ticket=c(1:13)[train\$ticket],
pticket=c(1:13)[test\$ticket],
title=c(1:4)[train\$Title],
ptitle=c(1:4)[test\$Title],
sibsp=c(1:4,rep(4,6))[train\$SibSp+1],
psibsp=c(1:4,rep(4,6))[test\$SibSp+1],
parch=c(1:4,rep(4,6))[train\$Parch+1],
pparch=c(1:4,rep(4,6))[test\$Parch+1],
fare=c(1:5)[train\$fare],
pfare=c(1:5)[test\$fare]
)

parameters=c(‘intercept’,’sd1′,
‘fsex’,’fpclass’,’fembarked’,
‘foe’,’fcabchar’,’fage’,
‘fticket’,’ftitle’,
‘fsibsp’,’fparch’,
‘ffare’,
‘sdsex’,’sdpclass’,’sdembarked’,
‘sdoe’,’sdcabchar’,’sdage’,
‘sdticket’,’sdtitle’,
‘sdsibsp’,’sdparch’,
‘sdfare’)
my_code <- ‘
data {
int ntrain;
int ntest;
int survived[ntrain];
int sex[ntrain];
int psex[ntest];
int pclass[ntrain];
int ppclass[ntest];
int embarked[ntrain];
int pembarked[ntest];
int oe[ntrain];
int poe[ntest];
int cabchar[ntrain];
int pcabchar[ntest];
int age[ntrain];
int page[ntest];
int ticket[ntrain];
int pticket[ntest];
int title[ntrain];
int ptitle[ntest];
int sibsp[ntrain];
int psibsp[ntest];
int parch[ntrain];
int pparch[ntest];
int fare[ntrain];
int pfare[ntest];

}
parameters {
real fsex[2];
real intercept;
real fpclass[3];
real fembarked[3];
real foe[3];
real fcabchar[7];
real fage[10];
real fticket[13];
real ftitle[4];
real fparch[4];
real fsibsp[4];
real ffare[5];
real sdsex;
real sdpclass;
real sdembarked;
real sdoe;
real sdcabchar;
real sdage;
real sdticket;
real sdtitle;
real sdparch;
real sdsibsp;
real sdfare;

real sd1;
}
transformed parameters {
real expect[ntrain];
for (i in 1:ntrain) {
expect[i] <- inv_logit(
intercept+
fsex[sex[i]]+
fpclass[pclass[i]]+
fembarked[embarked[i]]+
foe[oe[i]]+
fcabchar[cabchar[i]]+
fage[age[i]]+
fticket[ticket[i]]+
ftitle[title[i]]+
fsibsp[sibsp[i]]+
fparch[parch[i]]+
ffare[fare[i]]
);
}

}
model {
fsex ~ normal(0,sdsex);
fpclass ~ normal(0,sdpclass);
fembarked ~ normal(0,sdembarked);
foe ~ normal(0,sdoe);
fcabchar ~ normal(0,sdcabchar);
fage ~ normal(0,sdage);
fticket ~ normal(0,sdticket);
ftitle ~ normal(0,sdtitle);
fsibsp ~ normal(0,sdsibsp);
fparch ~ normal(0,sdparch);
ffare ~ normal(0,sdfare);

sdsex ~  normal(0,sd1);
sdpclass ~ normal(0,sd1);
sdembarked ~ normal(0,sd1);
sdoe ~ normal(0,sd1);
sdcabchar ~ normal(0,sd1);
sdage ~ normal(0,sd1);
sdticket ~ normal(0,sd1);
sdtitle ~ normal(0,sd1);
sdsibsp ~ normal(0,sd1);
sdparch ~ normal(0,sd1);
sdfare ~ normal(0,sd1);
sd1 ~ normal(0,1);
intercept ~ normal(0,1);

survived ~ bernoulli(expect);
}
generated quantities {
real pred[ntest];
for (i in 1:ntest) {
pred[i] <- inv_logit(
intercept+
fsex[psex[i]]+
fpclass[ppclass[i]]+
fembarked[pembarked[i]]+
foe[poe[i]]+
fcabchar[pcabchar[i]]+
fage[page[i]]+
fticket[pticket[i]]+
ftitle[ptitle[i]]+
fsibsp[psibsp[i]]+
fparch[pparch[i]]+
ffare[pfare[i]]

);
}
}
‘

fit1 <- stan(model_code = my_code,
data = datain,
pars=parameters,
iter = 1000,
chains = 4,
open_progress=FALSE)

fit1
fit2 <- stan(model_code = my_code,
data = datain,
fit=fit1,
pars=c(‘pred’),
iter = 2000,
warmup =200,
chains = 4,
open_progress=FALSE)

fit3 <- as.matrix(fit2)[,-419]
#plots of individual passengers
#plot(density(fit3[,1]))

#plot(density(fit3[,18]))
decide1 <- apply(fit3,2,function(x) mean(x)>.5)
decide2 <- apply(fit3,2,function(x) median(x)>.5)
#table(decide1,decide2)

out <- data.frame(
PassengerId=test\$PassengerId,
Survived=as.numeric(decide1),
row.names=NULL)
write.csv(x=out,
file=’stanlin.csv’,
row.names=FALSE,
quote=FALSE)

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.