Simplify your model: Supervised Assisted Feature Extraction for Machine Learning

[This article was first published on R in ResponsibleML on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The joint work with Anna Kozak and Przemyslaw Biecek.

The tools for Automated Machine Learning allow to quickly create accurate but complex predictive models. The opacity of such models is often a major obstacle for deploying models, especially in high-stake decision areas, such as medicine or finance. Therefore, methods for automated machine learning model building are important.

In this post, we present a framework for Surrogate Assisted Feature Extraction for Model Learning (SAFE ML). SAFE ML uses the elastic black box as a supervisor model to create an interpretable, yet still accurate glass box model. The main idea is to train a new interpretable model on newly engineered features extracted from the supervisor model.

The diagram of the SAFE ML Framework. The dotted line marks automated steps of the framework. The yellow area shows steps related to Feature Engineering steps. The green area shows steps related to human-model interaction steps. Source: https://doi.org/10.1016/j.dss.2021.113556.

The method can be described in 6 steps:
Step 1 Provide a raw tabular data set.

Step 2 Train a supervisor complex machine learning model on a provided data. This model does not need to be interpretable and is treated as a black box.

Step 3 Use SAFE to find variable transformations. (A) For continuous variables use the Partial Dependence Profiles to find changepoints that allow the best binning for the variable of interest. (B) For categorical variables, use clustering to merge some of the levels.

Step 4 Optionally, perform a feature selection on the new set of features that includes original variables from the raw data and variables transformed with the SAFE method.

Step 5 Fit a fully interpretable model on selected features. Models that can be used are, for example, logistic regression for classification problems or linear models for regression problems.

Step 6 Enjoy your fully interpretable and accurate model!

Use Case on Hotel Bookings Data

We show the SAFE on the example of data published in the paper Hotel booking demand datasets in the Data in Brief Journal. The data set contains information about reservations of hotels’ rooms. We aim to predict whether customers cancel their reservations or not. We are interested in developing an interpretable model to help hotel staff better understand what affects cancellations.

The interpretable vanilla logistic regression achieved AUC equal 0.6 and 0.5 on the train and test sets respectively. According to these results, the model is overfitted, and cannot be used (by the way, this is a good moment to recommend this song by Rafael Moral about overfitting). The linear model poorly explains cancellations, which may be due to large non-linearities in the data. Therefore, we will use the SAFE method to transform the variables to help the linear model capture the nonlinearities.

Let us use the random forest as a flexible supervisor model. Below is the example SAFE transformation of categorical variable deposit type that indicates if the customer deposited to guarantee the booking. The transformation is produced based on the predictions of the random forest. The variable composed of three levels was merged into two levels that divide customers into those who made a deposit and those who did not.

SAFE transformation of categorical variable deposit type. The tree shows merged levels of the variable, i.e., no deposit, and both deposit types that are merged into one level.

The SAFE transformations of categorical variables are based on PDP profiles. Below is the transformation of the variable arrival date week number. The transformations reflect the seasons of the year which affect the chance of cancellation differently. One continuous variable is transformed into five new binary variables that reflect the seasonality of the cancellations. Due to the random forest model, the lowest chance of cancellation is between the 22nd and 39th week of the year that corresponds to the beginning of the June and end of September. This range coincides with the holiday season when entire families go on vacation.

SAFE transformation of continuous variable arrival date week number. The blue line is a PDP profile for the variable under consideration and the segments between the red dashed lines are new features identified by the SAFE method.

The above two variables are only examples, we started with 26 variables from the original data set, and after transforming with the SAFE method we got 22 variables. The logistic regression built on transformed variables achieved AUC=0.68, so it is an interpreted model that can be used to predict whether a reservation will be canceled.

Software

For the use case, we used the rSAFE R package. Codes are available in the repository on GitHub. The SAFE method is also implemented in Python (SafeTransformer package).

Reference

To learn more about SAFE ML, its applications, and the benchmark, we encourage you to read a paper
Alicja Gosiewska, Anna Kozak, Przemyslaw Biecek, Simpler is better: Lifting interpretability-performance trade-off via automated feature engineering, Decision Support Systems, 2021, 113556, ISSN 0167–9236, https://doi.org/10.1016/j.dss.2021.113556.

Special thanks to Katarzyna Pękala for the useful comments.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

In order to see more R-related content visit https://www.r-bloggers.com.


Simplify your model: Supervised Assisted Feature Extraction for Machine Learning was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R in ResponsibleML on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)