Understanding Titanic Dataset with H2O’s AutoML, DALEX, and lares library

Posted on August 1, 2018 by Bernardo Lares in R bloggers | 0 Comments

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you have been studying or working with Machine Learning for at least a week, I am sure you have already played with the Titanic dataset! Today I bring some fun DALEX (Descriptive mAchine Learning EXplanations) functions to study the whole set’s response to the Survival feature and some individual explanation examples.

Before we start, I invite you all to install my personal library lares so you can follow step by step the following examples and use it to speed up your daily ML and Analytics tasks:

devtools::install_github("laresbernardo/lares")

A quick good model with h2o_automl

Let’s run the lares::h2o_automl function to generate a quick good model on the Titanic dataset. I can’t remember from which source I downloaded the files originally but I will re-share them with you so you can re-produce the following examples: Titanic Train Dataset (used as only dataset because we need the Survival data to train and test our supervised model).

Having the data in our working directory, we can now join them, label them, do some quick cleanings, and run h2o_automl to create a fast model. The outcome is an object (results) with all the model’s results: name, metrics, predictions, datasets, variable importances, etc.

NOTE: when using the lares::h2o_automl function with our data frame as it is, with no ‘train_test’ parameter, it will automatically split 70/30 for our training and testing sets (use ‘split’ in the function if you want to change this relation).

library(dplyr)
library(lares)

seed <- 123

train <- read.csv("train.csv")

dfm <- df %>%
  select(-Name, -Ticket, -PassengerId, -Cabin) %>%
  rename("tag" = "Survived") %>% 
  mutate(tag = as.factor(tag), 
         Pclass = as.factor(Pclass))

results <- lares::h2o_automl(df = dfm, seed = seed, max_time = 60)

Let’s quickly check our model’s performance:

lares::mplot_full(tag = results$score$tag, 
                  score = results$scores$score,
                  subtitle = "Titanic dataset")

Gives this plot:

Basically, we got a very nice performing model, with an AUC of 88.9% which splits quite well the Titanic’s survivals. Check out that from the top 25% scored-people, only 3% did not survive (the Captain? Sorry…) and 61% are true survivors.

Once we have this model, we can study which features did it chose as the most relevant and why. To understand this, there are probably more than a million ways, but today we are going to check the DALEX results.

The most important variables

Every dataset has relevant and irrelevant features. Sometimes it is our work as data scientist or analysts to detect which ones are these, how do they affect our independent variable and, most importantly, why. I happen to have the following function to help us see these results quickly:

lares::mplot_importance(var = results$importance$variable, 
                        imp = results$importance$percentage,
                        subtitle = "Titanic dataset")

Which will plot:

You can also plot something similar with the DALEX library but personally, I do like mine better! :$

Partial Dependency Plots (PDP)

Now that we know that Sex, Age, Fare, and Pclass are the most relevant features, we should check how the model detects the relationship between the target (Survival) and these features. Besides, these plots not only are incredibly powerful for communicating our insights to non-technical users, but will also help us implement with more confidence our models (less black-boxes) into production.

To start plotting with DALEX, we have to create an explainer. If you are using h2o or my functions above, this is all you need to do:

explainer <- lares::dalex_explainer(df = results$datasets$test, 
                                       model = results$model)

So, let’s check our PDPs on our main features. Note that we will have different outputs regarding our variables’ class: if it’s a numerical value, then well have a Partial Dependency Plot (line-plot); if we have a categorical or factor variable, then we will get a Merging Path Plot (dendogram-plot).

lares::dalex_variable(explainer, "Sex")

We get this plot:

How chivalrous of us! Basically, if you were a man and survived, you were quite lucky (spoiler: and rich!).

lares::dalex_variable(explainer, "Age")

We get this plot:

If you were a child, you shouldn’t even have to be scared when it all went ‘down’. Independently of your class, sex, and age, children were the ones who had a higher score, thus the highest probability of surviving. We get some picks around 33 years old and to hell with the elders.

lares::dalex_variable(explainer, "Fare")

Another plot:

Basically, if you paid less than 50 bucks, you didn’t pay for the boats nor the lifesavers. So rude!

lares::dalex_variable(explainer, "Pclass", force_class = "factor")

The plot:

To emphasize once more on the economical situation vs survivals, we see how the 1st and 2nd classes were “luckier” than the 3rd class passengers.

With these plots (and with a little help from the movie to put us in perspective) we now can understand better the macro situation of Titanic’s survivals. Now, let’s check some particular cases, as individuals, to go further into our analysis.

Individual Interpretation

The DALEX’s local interpretation functions are awesome! If you have used LIME before, in my taste, these are quite similar but better. Note than we can see that several predictors have zero contribution, while others have positive, and others negative contributions.

Before you start, let me tell you that each example we run lasts almost a minute to the plot… some patient when using this function!

Subject #1 (Randomly chose #23):

local23 <- lares::dalex_local(explainer, 
                              row = results$datasets$test["23",], 
                              plot = TRUE)

Gives this plot:

Here we can see a specific woman who scored pretty OK: 0.699 (can’t be seen in the image). The predicted value for this individual observation was positively and strongly influenced by the Sex = Female and Age = 15. Alternatively, the Pclass = 3 variable reduced this person’s probability of surviving.

Subject #2 (Worst score):

results$datasets$test[results$scores$score == min(results$scores$score),]
   tag Pclass  Sex Age SibSp Parch Fare Embarked
60   0      3 male  11     5     2 46.9        S

We have 1 person (out of 268) on our test set which scored 0.01. (Can’t help to mention that he really didn’t survive). Let’s study now this guy guy with our DALEX function:

local60 <- lares::dalex_local(explainer,
                              row = results$datasets$test["60",], 
                              plot = TRUE)

Gives this plot:

This poor boy sailed with a 3rd class ticket, having 11 years, traveling with 5 siblings and bot parents. The most important features were his Sex = male, and his low Fare/Pclass. Even though we noticed that children are more probable to have survived, this little man might be one of the exceptions because of (maybe) his amount of familiars onboard. If you think about it, the story behind it might be that he was in the ship with all his brothers and parents, which were 3rd class as well, and sinked with them instead of being save alone. Sad (but true) story.

Subject #3 (Best score):

If we repeat the prior example but with the highest score, we get 4 women, all 1st class, embarked through C Gates, and all 100% survivors. Let’s take a look at one:

local196 <- lares::dalex_local(explainer,
                               row = results$datasets$test["196",], 
                               plot = TRUE)

The plot:

This handsome old lady, 58, traveled in first class with no family members, and payed a 147$ fare. Our model detects that this person had a very high probability of surviving, mainly because she payed a lot and is a woman!

Conclussions

One way to understand a dataset is running a model and analyzing the Machine Learning’s intelligence behind.
Studying the important variables on a macro view and some particular cases on a micro view will give us confidence and a global understanding.
Partial dependence plots are a great way to extract insights from complex models. They can be very useful when showing our insights and results with other people.
With DALEX we can stop showing our Machine Learning models to the world as plain black boxes.
With the lares library we can automate and fasten our daily tasks for Analytics and Machine Learning jobs.

Hope this article was fun to read and something could be learnt! Don’t hesitate to comment beloew if you have any further question, comment or insight and I’ll be delighted to answer back as soon as possible. Please, keep in touch and feel free to contact me via Linkedin or email.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A quick good model with h2o_automl

The most important variables

Partial Dependency Plots (PDP)

Individual Interpretation

Subject #1 (Randomly chose #23):

Subject #2 (Worst score):

Subject #3 (Best score):

Conclussions

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)