Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

Decision tree1 is a model that recursively splits the input space into regions and defines local model for each resulted region. However, fitting decision tree model to complex data would not yield to accurate prediction in most cases, which can be termed as weak learner. But combining multiple decision trees together (called also ensemble models) using techniques such as aggregating and boosting can largely improve the model accuracy. Xgboost (short for Extreme gradient boosting) model is a tree-based algorithm that uses these types of techniques. It can be used for both classification and regression. In this paper we learn how to implement this model to predict the well known titanic data as we did in the previous papers using different kind of models.

## Data preparation

First we start by calling the packages needed and the titanic data

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(caret))
## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Let’s take a look at this data using the dplyr function glimpse.

glimpse(data)
## Observations: 891
## Variables: 12
## $PassengerId 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ... ##$ Survived     0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $Pclass 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3... ##$ Name         "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $Sex "male", "female", "female", "female", "male", "male", "... ##$ Age          22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $SibSp 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1... ##$ Parch        0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $Ticket "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ... ##$ Fare         7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $Cabin NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",... ##$ Embarked     "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

For prediction purposes some variables should be removed such as PassengerId, Name, Ticket, and Cabin. While some others should be converted to another suitable type. the following script performs these transformations but for more detail you can refer to my previous paper of logistic regression.

mydata<-data[,-c(1,4,9,11)]
mydata$Survived<-as.integer(mydata$Survived)
mydata<-modify_at(mydata,c("Pclass","Sex","Embarked","SibSp","Parch"), as.factor)

Now let’s check the summary of the transformed data.

summary(mydata)
##     Survived      Pclass      Sex           Age        SibSp   Parch
##  Min.   :0.0000   1:216   female:314   Min.   : 0.42   0:608   0:678
##  1st Qu.:0.0000   2:184   male  :577   1st Qu.:20.12   1:209   1:118
##  Median :0.0000   3:491                Median :28.00   2: 28   2: 80
##  Mean   :0.3838                        Mean   :29.70   3: 16   3:  5
##  3rd Qu.:1.0000                        3rd Qu.:38.00   4: 18   4:  4
##  Max.   :1.0000                        Max.   :80.00   5:  5   5:  5
##                                        NA's   :177     8:  7   6:  1
##       Fare        Embarked
##  Min.   :  0.00   C   :168
##  1st Qu.:  7.91   Q   : 77
##  Median : 14.45   S   :644
##  Mean   : 32.20   NA's:  2
##  3rd Qu.: 31.00
##  Max.   :512.33
## 

As we see, we have 177 missing values from age variable and 2 values from Embarked. For missing values we have two strategies, removing completely the missing values from the analysis, but doing so we will lose many data, or imputing them by one of the available imputation method to fix these values. Since we have large number of missing values compared to the total examples in the data it would be better to follow the latter strategy. Thankfully to mice package that is a very powerfull for this purpose and it provides many imputation methods for all variable types. We will opt for random forest method since in most cases can be the best choice. However, in order to respect the most important rule in machine learning, never touch the test data during the training process , we will apply this imputation after splitting the data.

## Data visualization

We have many tools outside modelization to investigate some relationships between variables like visualization tools. So we can visualize the relationship between each predictor and the target variable using the ggplot2 package.

library(ggplot2)
ggplot(mydata,aes(Sex,Survived,color=Sex))+
geom_point()+
geom_jitter()

The left side of the plot shows that higher fraction of females survived, whereas the right side shows the reverse situation for males where most of them died. We can induce from this plot that, ceteris paribus, this predictor is likely to be relevant for prediction.

ggplot(mydata,aes(Pclass,Survived,color=Pclass))+
geom_point()+
geom_jitter()

in this plot most of the first class passengers survived in contrast with the third class passengers where most of them died. However, for the second class, it seems equally balanced. Again this predictor also can be relevant.

ggplot(mydata,aes(SibSp,Survived,color=SibSp))+
geom_point()+
geom_jitter()

This predictor refers to the number of siblings a passenger has. It seems to be equally distributed given the target variable, and hence can be highly irrelevant. In other words, knowing the number of siblings of a particular passenger does not help to predict if this passenger survived or died.

ggplot(mydata,aes(Parch,Survived,color=Parch))+
geom_point()+
geom_jitter()

This predictor refers to the number of parents and children a passenger has. It seems that this predictor is slightly discriminative if we look closely at the level 0, passengers with no parents or children.

ggplot(mydata,aes(Embarked,Survived,color=Embarked))+
geom_point()+
geom_jitter()

We see that a passenger who is embarked from the port S is slightly highly to be died, while the other ports seem to be equally distributed.

For numeric variables we use the empirical densitiy givan the target variable as follows.

ggplot(mydata[complete.cases(mydata),], aes(Age,fill=as.factor(Survived)))+
geom_density(alpha=.5)

We see that some significant overlapping between the two conditional distribution may indicating less relevance related to this variable.

ggplot(mydata, aes(Fare,fill=as.factor(Survived)))+
geom_density(alpha=.5)

For this variables the conditional distribution are different, we see a spike close to zero reflecting the more death among third class.

we can also plot two predictors against each other. For instance let’s try with the two predictors, Sex and Pclass:

ggplot(mydata,aes(Sex,Pclass,color=as.factor(Survived)))+
geom_point(col="green",pch=16,cex=7)+
geom_jitter()

The majority of the survived females (blue points on the left) came from the first and the second class, while the majority of died males (red points on the right) came from the third class.

## Data partition

we take out 80% of the data as training set and the remaining will be served as testing set.

set.seed(1234)
index<-createDataPartition(mydata$Survived,p=0.8,list=FALSE) train<-mydata[index,] test<-mydata[-index,] Now we are ready to impute the missing values. suppressPackageStartupMessages(library(mice)) imput_train<-mice(train,m=3,seed=111, method = 'rf') ## Warning: Number of logged events: 30 train2<-complete(imput_train,1) summary(train2) From this output we see that we do not have missing values any more. ## Model training The xgboost model expects the predictors to be of numeric type, so we convert the factors to dummy variables by the help of the Matrix package suppressPackageStartupMessages(library(Matrix)) train_data<-sparse.model.matrix(Survived ~. -1, data=train2) Note that the -1 value added to the formula is to avoid adding a column as intercept with ones to our data. we can take a look at the structure of the data by the following str(train_data) ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## [email protected] i : int [1:3570] 1 3 5 8 17 20 23 24 27 28 ... ## [email protected] p : int [1:21] 0 178 329 713 1173 1886 2062 2086 2100 2114 ... ## [email protected] Dim : int [1:2] 713 20 ## [email protected] Dimnames:List of 2 ## .. ..$ : chr [1:713] "1" "2" "3" "4" ...
##   .. ..$: chr [1:20] "Pclass1" "Pclass2" "Pclass3" "Sexmale" ... ## [email protected] x : num [1:3570] 1 1 1 1 1 1 1 1 1 1 ... ## [email protected] factors : list() We know that many machine learning algorithms require the inputs to be in a specific type. The input types supported by xgboost algorithm are: matrix, dgCMatrix object rendered from the above package Matrix, or the xgboost class xgb.DMatrix. suppressPackageStartupMessages(library(xgboost)) We should first store the dependent variable in a separate vector, let’s call it train_label train_label<-train$Survived
dim(train_data)
## [1] 713  20
length(train$Survived) ## [1] 713 Now we bind the predictors, contained in the train_data , with the train_label vector as xgb.DMatrix object as follows train_final<-xgb.DMatrix(data = train_data,label=train_label) To train the model you must provide the inputs and specify the argument values if we do not want to keep the following values: • objective: for binary classification we use binary:logistic • eta (default=0.3): The learning rate. • gamma (default=0): also called min_split_loss, the minimum loss required for splitting further a particular node. • max_depth(default=6): the maximum depth of the tree. • min_child_weight(default=1): the minimum number of instances required in a node under which the node will be leaf. • subsample (default=1): with the default the model uses all the data at each tree, if 0.7 for instance, then the model randomly sample 70% of the data at each iteration, doing so we fight the overfiting problem. • colsample_bytree (default=1, select all columns): subsample ratio of columns at each iteration. • nthreads (default=2): number of cpu’s used in parallel processing. • nrounds : the number of boosting iterations. You can check the whole parameters by typing ?xgboost. It should be noted that the input data can feed into the model by two ways: It the data is of class xgb.DMatrix that contain both the predictors and the label, as we did, then we do not use the label argument. Otherwise, with any other class we provide both argument data and label. Let’s our first attempt will be made with 40 iterations and the default values for the other arguments. mymodel <- xgboost(data=train_final, objective = "binary:logistic", nrounds = 40) ## [1] train-error:0.148668 ## [2] train-error:0.133240 ## [3] train-error:0.130435 ## [4] train-error:0.137447 ## [5] train-error:0.127630 ## [6] train-error:0.117812 ## [7] train-error:0.115007 ## [8] train-error:0.109397 ## [9] train-error:0.102384 ## [10] train-error:0.103787 ## [11] train-error:0.103787 ## [12] train-error:0.102384 ## [13] train-error:0.100982 ## [14] train-error:0.098177 ## [15] train-error:0.098177 ## [16] train-error:0.096774 ## [17] train-error:0.096774 ## [18] train-error:0.098177 ## [19] train-error:0.093969 ## [20] train-error:0.091164 ## [21] train-error:0.086957 ## [22] train-error:0.085554 ## [23] train-error:0.085554 ## [24] train-error:0.082749 ## [25] train-error:0.082749 ## [26] train-error:0.082749 ## [27] train-error:0.079944 ## [28] train-error:0.075736 ## [29] train-error:0.074334 ## [30] train-error:0.074334 ## [31] train-error:0.072931 ## [32] train-error:0.072931 ## [33] train-error:0.070126 ## [34] train-error:0.070126 ## [35] train-error:0.070126 ## [36] train-error:0.068724 ## [37] train-error:0.067321 ## [38] train-error:0.061711 ## [39] train-error:0.061711 ## [40] train-error:0.063114 We can plot the error rates as follows  mymodel$evaluation_log %>%
ggplot(aes(iter, train_error))+
geom_point()

To evaluate the model we will use the test data that should follow all the above steps as the training data except for the missing values. since the test set is only used to evaluate the model so we will remove all the missing values.

test1 <- test[complete.cases(test),]
test2<-sparse.model.matrix(Survived ~. -1,data=test1)
test_label<-test1$Survived test_final<-xgb.DMatrix(data = test2, label=test_label) Then we use the predict function and confusionMatrix function from caret package, and since the predicted values are probabbilities we convert them to predicted classes using the threshold of 0.5 as follows: pred <- predict(mymodel, test_final) pred<-ifelse(pred>.5,1,0) confusionMatrix(as.factor(pred),as.factor(test_label)) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 81 13 ## 1 11 36 ## ## Accuracy : 0.8298 ## 95% CI : (0.7574, 0.8878) ## No Information Rate : 0.6525 ## P-Value [Acc > NIR] : 2.379e-06 ## ## Kappa : 0.6211 ## ## Mcnemar's Test P-Value : 0.8383 ## ## Sensitivity : 0.8804 ## Specificity : 0.7347 ## Pos Pred Value : 0.8617 ## Neg Pred Value : 0.7660 ## Prevalence : 0.6525 ## Detection Rate : 0.5745 ## Detection Prevalence : 0.6667 ## Balanced Accuracy : 0.8076 ## ## 'Positive' Class : 0 ##  with the default values we obtain a pretty good accuracy rate. The next step we fine tune the hyperparameters sing cross validation with the help of caret package. ## Fine tune the hyperparameters for the hyperparameters we try different grid values for the above arguments as follows: • eta: seq(0.2,1,0.2) • max_depth: seq(2,6,1) • min_child_weight: c(1,5,10) • colsample_bytree : seq(0.6,1,0.1) • nrounds : c(50,200 ,50) This requires training the model 375 times. grid_tune <- expand.grid( nrounds = c(50,200,50), max_depth = seq(2,6,1), eta = seq(0.2,1,0.2), gamma = 0, min_child_weight = 1, colsample_bytree = seq(0.6,1,0.1), subsample = 1 ) Then we use 5 folds cross validation as follows. control <- trainControl( method = "repeatedcv", number = 5, allowParallel = TRUE ) Now instead we use the train function from caret to train the model and we specify the method as xgbtree. train_data1 <- as.matrix(train_data) train_label1 <- as.factor(train_label) #mymodel2 <- train( # x = train_data1, # y = train_label1, # trControl = control, # tuneGrid = grid_tune, # method = "xgbTree") Note: This model took several minutes so we do not the model to be rerun again when rendering this document that is why i have commented the above script and have saved the results in csv file, then i have reloaded it again to continue our analysis. If you would like to run this model you can just uncomment the script. # results <- mymodel2$results
# write_csv(results, "xgb_results.csv")
## Parsed with column specification:
## cols(
##   eta = col_double(),
##   max_depth = col_double(),
##   gamma = col_double(),
##   colsample_bytree = col_double(),
##   min_child_weight = col_double(),
##   subsample = col_double(),
##   nrounds = col_double(),
##   Accuracy = col_double(),
##   Kappa = col_double(),
##   AccuracySD = col_double(),
##   KappaSD = col_double()
## )

Let’s now check the best hyperparameter values:

results %>%
arrange(-Accuracy) %>%
## # A tibble: 5 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##
## 1   0.2         4     0              0.6                1         1      50
## 2   0.2         6     0              0.6                1         1      50
## 3   0.8         2     0              0.8                1         1      50
## 4   0.4         3     0              0.6                1         1      50
## 5   0.2         3     0              1                  1         1     200
## # ... with 4 more variables: Accuracy , Kappa , AccuracySD ,
## #   KappaSD 

As we see the highest accuracy rate is about 81.34% with the related hyperparameter values as follows.

results %>%
arrange(-Accuracy) %>%
## # A tibble: 1 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##
## 1   0.2         4     0              0.6                1         1      50
## # ... with 4 more variables: Accuracy , Kappa , AccuracySD ,
## #   KappaSD 

Now we apply these values for the final model using the whole data uploadded at the beginning from the train.csv file, and then we call the file test.csv file for titanic data to submit our prediction to the kaggle competition.

imput_mydata<-mice(mydata,m=3,seed=111, method = 'rf')
## Warning: Number of logged events: 15
mydata_imp<-complete(imput_mydata,1)
my_data<-sparse.model.matrix(Survived ~. -1, data = mydata_imp)
mydata_label<-mydata$Survived data_final<-xgb.DMatrix(data = my_data,label=mydata_label) final_model <- xgboost(data=data_final, objective = "binary:logistic", nrounds = 50, max_depth = 4, eta = 0.2, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1) and we get the following result pred <- predict(mymodel, data_final) pred<-ifelse(pred>.5,1,0) confusionMatrix(as.factor(pred),as.factor(mydata_label)) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 518 60 ## 1 31 282 ## ## Accuracy : 0.8979 ## 95% CI : (0.8761, 0.917) ## No Information Rate : 0.6162 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.7806 ## ## Mcnemar's Test P-Value : 0.003333 ## ## Sensitivity : 0.9435 ## Specificity : 0.8246 ## Pos Pred Value : 0.8962 ## Neg Pred Value : 0.9010 ## Prevalence : 0.6162 ## Detection Rate : 0.5814 ## Detection Prevalence : 0.6487 ## Balanced Accuracy : 0.8840 ## ## 'Positive' Class : 0 ##  The accuracy rate with these values is about 90% . Now lets fit this model to the test.csv file. kag<-read_csv("../test.csv") ## Parsed with column specification: ## cols( ## PassengerId = col_double(), ## Pclass = col_double(), ## Name = col_character(), ## Sex = col_character(), ## Age = col_double(), ## SibSp = col_double(), ## Parch = col_double(), ## Ticket = col_character(), ## Fare = col_double(), ## Cabin = col_character(), ## Embarked = col_character() ## ) kag1<-kag[,-c(3,8,10)] kag1 <- modify_at(kag1,c("Pclass", "Sex", "Embarked", "SibSp", "Parch"), as.factor) summary(kag1) ## PassengerId Pclass Sex Age SibSp Parch ## Min. : 892.0 1:107 female:152 Min. : 0.17 0:283 0 :324 ## 1st Qu.: 996.2 2: 93 male :266 1st Qu.:21.00 1:110 1 : 52 ## Median :1100.5 3:218 Median :27.00 2: 14 2 : 33 ## Mean :1100.5 Mean :30.27 3: 4 3 : 3 ## 3rd Qu.:1204.8 3rd Qu.:39.00 4: 4 4 : 2 ## Max. :1309.0 Max. :76.00 5: 1 9 : 2 ## NA's :86 8: 2 (Other): 2 ## Fare Embarked ## Min. : 0.000 C:102 ## 1st Qu.: 7.896 Q: 46 ## Median : 14.454 S:270 ## Mean : 35.627 ## 3rd Qu.: 31.500 ## Max. :512.329 ## NA's :1 we have 86 missing values for Age and one for Far, using a good idea from a kaggler named Harrison Tietze who suggested to treat the persons with missing values as likely to be died. For instance he replaced the missing ages by the mean age of died persons from the train data. But for us we go even further and we consider all rows with missing values as died persons. Additionally, when inspecting the summary above we notice that we have an extra level (9) in the factor Parch that is not existed in the traind data, and hence the model does not allow such extra information. However, since this level has only two cases we can approximate this level by the closest one which is 6, then we drop the level 9 from this factor. kag1$Parch[kag1$Parch==9]<-6 kag1$Parch <- kag1$Parch %>% forcats::fct_drop() kag_died <- kag1[!complete.cases(kag1),] kag2 <- kag1[complete.cases(kag1),] So we only use the kag2 data for the prediction. DP<-sparse.model.matrix(PassengerId~.-1,data=kag2) head(DP) ## 6 x 20 sparse Matrix of class "dgCMatrix" ## [[ suppressing 20 column names 'Pclass1', 'Pclass2', 'Pclass3' ... ]] ## ## 1 . . 1 1 34.5 . . . . . . . . . . . . 7.8292 1 . ## 2 . . 1 . 47.0 1 . . . . . . . . . . . 7.0000 . 1 ## 3 . 1 . 1 62.0 . . . . . . . . . . . . 9.6875 1 . ## 4 . . 1 1 27.0 . . . . . . . . . . . . 8.6625 . 1 ## 5 . . 1 . 22.0 1 . . . . . 1 . . . . . 12.2875 . 1 ## 6 . . 1 1 14.0 . . . . . . . . . . . . 9.2250 . 1 predkag<-predict(final_model,DP) head(predkag) ## [1] 0.08700940 0.21760842 0.09857274 0.17517737 0.56304359 0.09141546 As we see the output is the probability of each instance, so we should convert this probabbilitis to classe labels: predkag<-ifelse(predkag>.5,1,0) Now first we cbined passengerId with the fitted values named as Survived, next we rbind with the first set kag1 : predkag2K<-cbind(kag2[,1],Survived=predkag) kag_died$Survived <- 0
predtestk <- rbind(predkag2K,kag_died[, c(1,9)])

Finally, we save the file as csv file to submit it to kaggle then check our rank :

write_csv(predtestk,"predxgbkag.csv")

## Conclusion:

Xgboost is the best machine learning algorithm nowadays due to its powerful capability to predict wide range of data from various domaines. Several win competitions in kaggle and elsewhere are achieved by this model. It can handle large and complex data with ease. The large number of Hyperparameters that has give the modeler a large possibilities to tune the model with respect to the data at their hand as well as to fight other problems such as overfitting, feature selection…ect.

1. Kevin P.Murphy 2012↩︎