The tools for Automated Machine Learning allow to quickly create accurate but complex predictive models. The opacity of such models is often a major obstacle for deploying models, especially in high-stake decision areas, such as medicine or finance. Therefore, methods for automated machine learning model building are important.
In this post, we present a framework for Surrogate Assisted Feature Extraction for Model Learning (SAFE ML). SAFE ML uses the elastic black box as a supervisor model to create an interpretable, yet still accurate glass box model. The main idea is to train a new interpretable model on newly engineered features extracted from the supervisor model.
The method can be described in 6 steps:
Step 1 Provide a raw tabular data set.
Step 2 Train a supervisor complex machine learning model on a provided data. This model does not need to be interpretable and is treated as a black box.
Step 3 Use SAFE to find variable transformations. (A) For continuous variables use the Partial Dependence Profiles to find changepoints that allow the best binning for the variable of interest. (B) For categorical variables, use clustering to merge some of the levels.
Step 4 Optionally, perform a feature selection on the new set of features that includes original variables from the raw data and variables transformed with the SAFE method.
Step 5 Fit a fully interpretable model on selected features. Models that can be used are, for example, logistic regression for classification problems or linear models for regression problems.
Step 6 Enjoy your fully interpretable and accurate model!
Use Case on Hotel Bookings Data
We show the SAFE on the example of data published in the paper Hotel booking demand datasets in the Data in Brief Journal. The data set contains information about reservations of hotels’ rooms. We aim to predict whether customers cancel their reservations or not. We are interested in developing an interpretable model to help hotel staff better understand what affects cancellations.
The interpretable vanilla logistic regression achieved AUC equal 0.6 and 0.5 on the train and test sets respectively. According to these results, the model is overfitted, and cannot be used (by the way, this is a good moment to recommend this song by Rafael Moral about overfitting). The linear model poorly explains cancellations, which may be due to large non-linearities in the data. Therefore, we will use the SAFE method to transform the variables to help the linear model capture the nonlinearities.
Let us use the random forest as a flexible supervisor model. Below is the example SAFE transformation of categorical variable deposit type that indicates if the customer deposited to guarantee the booking. The transformation is produced based on the predictions of the random forest. The variable composed of three levels was merged into two levels that divide customers into those who made a deposit and those who did not.
The SAFE transformations of categorical variables are based on PDP profiles. Below is the transformation of the variable arrival date week number. The transformations reflect the seasons of the year which affect the chance of cancellation differently. One continuous variable is transformed into five new binary variables that reflect the seasonality of the cancellations. Due to the random forest model, the lowest chance of cancellation is between the 22nd and 39th week of the year that corresponds to the beginning of the June and end of September. This range coincides with the holiday season when entire families go on vacation.
The above two variables are only examples, we started with 26 variables from the original data set, and after transforming with the SAFE method we got 22 variables. The logistic regression built on transformed variables achieved AUC=0.68, so it is an interpreted model that can be used to predict whether a reservation will be canceled.
To learn more about SAFE ML, its applications, and the benchmark, we encourage you to read a paper
Alicja Gosiewska, Anna Kozak, Przemyslaw Biecek, Simpler is better: Lifting interpretability-performance trade-off via automated feature engineering, Decision Support Systems, 2021, 113556, ISSN 0167–9236, https://doi.org/10.1016/j.dss.2021.113556.
Special thanks to Katarzyna Pękala for the useful comments.
If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.
In order to see more R-related content visit https://www.r-bloggers.com.
Simplify your model: Supervised Assisted Feature Extraction for Machine Learning was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.