Could AutoML win in the ‘Sliced’ Data Science Competition? The answer may shock you!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I’ll be taking a break from my normal explorations in the medical device domain to talk about Sliced. Sliced is a 2-hour data science competition streamed on Twitch and hosted by Meg Risdal and Nick Wan. Four competitors tackle a prediction problem in real time using whatever coding language or tools they prefer, grabbing bonus points along the way for Data Visualization and/or stumbling onto Golden Features (hint: always calculate the air density when training on weather data). Viewers can simply kick back on watch the contestants apply their trade or they can actively participate by submitting their own solutions and seeing how they stack up on the competition leaderboard!
Here are my observations after watching a few episodes:
* Participants do not typically implement more than 2 different model types, preferring to spend their time on Feature Engineering and tuning the hyperparameters of their preferred model
* Gradient boosting (XGBoost, Catboost, etc) is the dominant technique for tabular data
To clarify the first point above – the tuning is not totally manual; grid search functions are typically employed to identify the best hyperparameters from a superset of options. But the time pressure of the competition means that players can’t set up massive grids that lock up compute resources for too long. So it’s generally an iterative process over small grids that are expanded and contracted as needed based on intermediate results of the model predicting on a test set.
All this led me to wonder: given the somewhat manual process of hyperparameter optimization and the restricted number of model types… how would AutoML fare in Sliced? The rest of this post will attempt to answer that question, at least for an arbitrary Sliced episode.
- Setup
- Understanding the Ensemble
- Final Thoughts
- Session Info
- Appendix – Full Details of the Ensemble Model
Setup
For this exercise we’ll use the dataset and metrics from Episode 7 in which we are asked to predict whether or not a bank customer churned. The scoring metric is LogLoss. I’ll be using the free version of the h2o.ai framework and take the following approach to feature engineering and variable selection:
- All variables will be used (churn explained by everything) and no feature engineering except imputing means for missing values, converting nominal predictors to dummy variables, and removing ID column. This should give a fair look at how h2o will do given the bare minimum of attention to pre-processing and no attention to model selection or hyperparameter range.
Let’s get to it.
Load libraries
library(tidymodels) library(tidyverse) library(h2o) library(here) library(gt)
Load dataset
Read the data in as a csv and rename the attrition column
dataset <- read_csv(here("ep7/train.csv")) %>% mutate(churned = case_when( attrition_flag == 1 ~ "yes", TRUE ~ "no" ) %>% as_factor()) %>% select(-attrition_flag) holdout <- read_csv(here("ep7/test.csv")) dataset %>% gt_preview() %>% cols_align("center")
id | customer_age | gender | education_level | income_category | total_relationship_count | months_inactive_12_mon | credit_limit | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_trans_ct | total_ct_chng_q4_q1 | avg_utilization_ratio | churned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 8805 | 27 | F | Post-Graduate | Less than $40K | 3 | 2 | 1438.3 | 990 | 0.715 | 3855 | 73 | 1.147 | 0.688 | no |
2 | 4231 | 42 | F | College | Less than $40K | 6 | 4 | 3050.0 | 1824 | 0.771 | 1973 | 50 | 1.381 | 0.598 | no |
3 | 5263 | 47 | F | Unknown | Less than $40K | 3 | 3 | 1561.0 | 0 | 0.502 | 1947 | 28 | 0.556 | 0.000 | yes |
4 | 2072 | 44 | M | Uneducated | $80K – $120K | 1 | 3 | 25428.0 | 1528 | 0.725 | 13360 | 97 | 0.796 | 0.060 | no |
5 | 7412 | 54 | M | Graduate | $60K – $80K | 3 | 3 | 2947.0 | 2216 | 0.760 | 1744 | 53 | 0.606 | 0.752 | no |
6..7087 | |||||||||||||||
7088 | 7932 | 57 | F | High School | Less than $40K | 5 | 3 | 3191.0 | 2517 | 0.719 | 1501 | 35 | 0.591 | 0.789 | no |
Set Up Basic Recipe
This recipe starts with the model structure, imputes the mean for numeric predictors, and converts nominal variables to dummy. Id is also removed. Note that churned is described by . which means “all variables”.
basic_rec <- recipe(churned ~ ., data = dataset ) %>% step_impute_mean(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_rm(id)
Prep and Bake the Dataset
Prepping and baking functions will apply the recipe to the supplied dataset and make it usable in tibble format for passing to h2o.
baked_dataset_tbl <- basic_rec %>% prep() %>% bake(dataset) baked_dataset_tbl %>% gt_preview() %>% cols_align("center")
customer_age | total_relationship_count | months_inactive_12_mon | credit_limit | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_trans_ct | total_ct_chng_q4_q1 | avg_utilization_ratio | churned | gender_M | education_level_Doctorate | education_level_Graduate | education_level_High.School | education_level_Post.Graduate | education_level_Uneducated | education_level_Unknown | income_category_X.40K….60K | income_category_X.60K….80K | income_category_X.80K….120K | income_category_Less.than..40K | income_category_Unknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 27 | 3 | 2 | 1438.3 | 990 | 0.715 | 3855 | 73 | 1.147 | 0.688 | no | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 42 | 6 | 4 | 3050.0 | 1824 | 0.771 | 1973 | 50 | 1.381 | 0.598 | no | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 47 | 3 | 3 | 1561.0 | 0 | 0.502 | 1947 | 28 | 0.556 | 0.000 | yes | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
4 | 44 | 1 | 3 | 25428.0 | 1528 | 0.725 | 13360 | 97 | 0.796 | 0.060 | no | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
5 | 54 | 3 | 3 | 2947.0 | 2216 | 0.760 | 1744 | 53 | 0.606 | 0.752 | no | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
6..7087 | |||||||||||||||||||||||
7088 | 57 | 5 | 3 | 3191.0 | 2517 | 0.719 | 1501 | 35 | 0.591 | 0.789 | no | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Working in h2o
Convert Dataset to h2o
h2o must first be initialized and then the data can be coerced to h2o type. Using the h2o.describe() function shows a nice summary of the dataset and verifies that it was imported correctly.
h2o.init() ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 8 hours 4 minutes ## H2O cluster timezone: America/Los_Angeles ## H2O data parsing timezone: UTC ## H2O cluster version: 3.32.1.3 ## H2O cluster version age: 2 months and 5 days ## H2O cluster name: H2O_started_from_R_kingr17_afp345 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 0.85 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 ## R Version: R version 4.0.3 (2020-10-10) train_h2_tbl <- as.h2o(baked_dataset_tbl) ## | | | 0% | |======================================================================| 100% h2o.describe(train_h2_tbl) %>% gt() %>% cols_align("center")
Label | Type | Missing | Zeros | PosInf | NegInf | Min | Max | Mean | Sigma | Cardinality |
---|---|---|---|---|---|---|---|---|---|---|
customer_age | int | 0 | 0 | 0 | 0 | 26.0 | 73.000 | 4.630742e+01 | 7.9929340 | NA |
total_relationship_count | int | 0 | 0 | 0 | 0 | 1.0 | 6.000 | 3.818426e+00 | 1.5516740 | NA |
months_inactive_12_mon | int | 0 | 22 | 0 | 0 | 0.0 | 6.000 | 2.331546e+00 | 1.0103784 | NA |
credit_limit | real | 0 | 0 | 0 | 0 | 1438.3 | 34516.000 | 8.703875e+03 | 9152.3072697 | NA |
total_revolving_bal | int | 0 | 1705 | 0 | 0 | 0.0 | 2517.000 | 1.169674e+03 | 815.4734902 | NA |
total_amt_chng_q4_q1 | real | 0 | 5 | 0 | 0 | 0.0 | 3.355 | 7.601713e-01 | 0.2213876 | NA |
total_trans_amt | int | 0 | 0 | 0 | 0 | 563.0 | 18484.000 | 4.360548e+03 | 3339.1008390 | NA |
total_trans_ct | int | 0 | 0 | 0 | 0 | 10.0 | 139.000 | 6.465378e+01 | 23.3431055 | NA |
total_ct_chng_q4_q1 | real | 0 | 7 | 0 | 0 | 0.0 | 3.500 | 7.126414e-01 | 0.2363882 | NA |
avg_utilization_ratio | real | 0 | 1705 | 0 | 0 | 0.0 | 0.999 | 2.753678e-01 | 0.2755652 | NA |
churned | enum | 0 | 5956 | 0 | 0 | 0.0 | 1.000 | 1.597065e-01 | 0.3663595 | 2 |
gender_M | int | 0 | 3714 | 0 | 0 | 0.0 | 1.000 | 4.760158e-01 | 0.4994597 | NA |
education_level_Doctorate | int | 0 | 6762 | 0 | 0 | 0.0 | 1.000 | 4.599323e-02 | 0.2094852 | NA |
education_level_Graduate | int | 0 | 4876 | 0 | 0 | 0.0 | 1.000 | 3.120767e-01 | 0.4633737 | NA |
education_level_High.School | int | 0 | 5710 | 0 | 0 | 0.0 | 1.000 | 1.944131e-01 | 0.3957761 | NA |
education_level_Post.Graduate | int | 0 | 6745 | 0 | 0 | 0.0 | 1.000 | 4.839165e-02 | 0.2146075 | NA |
education_level_Uneducated | int | 0 | 6050 | 0 | 0 | 0.0 | 1.000 | 1.464447e-01 | 0.3535764 | NA |
education_level_Unknown | int | 0 | 6005 | 0 | 0 | 0.0 | 1.000 | 1.527935e-01 | 0.3598137 | NA |
income_category_X.40K….60K | int | 0 | 5865 | 0 | 0 | 0.0 | 1.000 | 1.725451e-01 | 0.3778802 | NA |
income_category_X.60K….80K | int | 0 | 6064 | 0 | 0 | 0.0 | 1.000 | 1.444695e-01 | 0.3515900 | NA |
income_category_X.80K….120K | int | 0 | 6007 | 0 | 0 | 0.0 | 1.000 | 1.525113e-01 | 0.3595411 | NA |
income_category_Less.than..40K | int | 0 | 4621 | 0 | 0 | 0.0 | 1.000 | 3.480530e-01 | 0.4763865 | NA |
income_category_Unknown | int | 0 | 6316 | 0 | 0 | 0.0 | 1.000 | 1.089165e-01 | 0.3115564 | NA |
Specify the Response and Predictors
In h2o we must identify the response column and the predictors which we do here. Unfortunately we can’t tidyselect here I don’t think.
y <- "churned" x <- setdiff(names(train_h2_tbl), y)
autoML Search and Optimization
Now we start the autoML session. You can specify the stopping rule by either total number of model or total duration in seconds. Since we’re simulating a timed Sliced competition, we’ll use max time. The longest I observed competitors training for was about 20 minutes, so we’ll use that here and then grab a coffee while it chugs. Notice that in this API we are not specifying any particular type of model or any hyperparameter range to optimize over.
# aml <- h2o.automl( # y = y, # x = x, # training_frame = train_h2_tbl, # project_name = "sliced_ep7_refactored_25bjuly2021", # max_runtime_secs = 1200, # seed = 07252021 # )
Leaderboard of Best Model
Not to be confused with the competition leaderboard, h2o will produce a “leaderboard” of models that it evaluated and ranked as part of its session. Here we access and observe the leaderboard and its best models.
# leaderboard_tbl <- aml@leaderboard %>% as_tibble() # a <- leaderboard_tbl %>% gt_preview() # a
model_id | auc | logloss | aucpr | mean_per_class_error | rmse | mse | |
---|---|---|---|---|---|---|---|
1 | StackedEnsemble_AllModels_AutoML_20210725_011103 | 0.9918194 | 0.08285245 | 0.9636402 | 0.06007334 | 0.1543060 | 0.02381035 |
2 | GBM_2_AutoML_20210725_011103 | 0.9914487 | 0.08636589 | 0.9626320 | 0.05620012 | 0.1569434 | 0.02463124 |
3 | StackedEnsemble_BestOfFamily_AutoML_20210725_011103 | 0.9914280 | 0.08451327 | 0.9625496 | 0.05620012 | 0.1556844 | 0.02423764 |
4 | GBM_grid__1_AutoML_20210725_011103_model_2 | 0.9913165 | 0.09009359 | 0.9616467 | 0.06286946 | 0.1600684 | 0.02562190 |
5 | GBM_4_AutoML_20210725_011103 | 0.9912436 | 0.09035254 | 0.9602742 | 0.07244469 | 0.1599096 | 0.02557107 |
6..37 | |||||||
38 | GLM_1_AutoML_20210725_011103 | 0.9152440 | 0.24794654 | 0.7267223 | 0.18132441 | 0.2727786 | 0.07440815 |
Extract Top Model
As expected, the top slot is an ensemble (in my limited experience it usually is). This is the one we’ll use to predict churn on the holdout set and submit to kaggle for our competition results. The ensemble model is extracted and stored as follows:
To prepare the holdout data predictions, we apply the basic recipe and convert to h2o, just as before with the training data.
Pre-Process the Holdout Set
We want the basic recipe applied to the holdout set so the model sees the same type of predictor variables when it goes to make predictions.
holdout_h2o_tbl <- basic_rec %>% prep() %>% bake(holdout) %>% as.h2o() ## | | | 0% | |======================================================================| 100%
Make Predictions
Predictions are made using h2o.predict().
top_model_basic_preds <- h2o.predict(top_model, newdata = holdout_h2o_tbl) %>% as_tibble() %>% bind_cols(holdout) %>% select(id, attrition_flag = yes) ## | | | 0% | |======================================================================| 100% top_model_basic_preds %>% gt_preview()
id | attrition_flag | |
---|---|---|
1 | 3005 | 0.0020805123 |
2 | 143 | 0.8462792238 |
3 | 5508 | 0.0003131364 |
4 | 6474 | 0.0001977184 |
5 | 9784 | 0.0017571210 |
6..3038 | ||
3039 | 9822 | 0.9895064264 |
Nice! A list of predictions for each id in the holdout set. Let’s write it to file and submit.
Export Results
top_model_basic_preds %>% write_csv(here("run1_basic_ep7.csv"))
Scoring the Submission
To submit the file for assessment, simply upload the csv to the interface on kaggle by browsing and selecting or dragging. In a few seconds our LogLoss on holdout set is revealed. The one we care about is the private score.
A private score of 0.06921! This is honestly way better than I expected for such minimal processing. This submission would have slotted me into 3rd place out of 31 entries, just behind eventual Episode 7 winner Ethan Douglas. And remember, I didn’t need to specify any modeling engine or range of hyperparameters to tune!
Understanding the Ensemble
Just so we aren’t totally naive, let’s dig into this model a bit and see what h2o built. For an ensemble, we want to interrogate the “metalearner” which can be thought of as the model that is made up of many models. These are S4 objects which require the @ operator to dig into the different slots / items (I just learned this about 90 seconds ago).
Extract the Ensemble Metalearner Model
metalearner_model <- h2o.getModel(top_model@model$metalearner$name) metalearner_model@model$model_summary %>% as_tibble() %>% gt_preview() %>% cols_align("center")
family | link | regularization | number_of_predictors_total | number_of_active_predictors | number_of_iterations | training_frame | |
---|---|---|---|---|---|---|---|
1 | binomial | logit | Elastic Net (alpha = 0.5, lambda = 0.00311 ) | 36 | 12 | 7 | levelone_training_StackedEnsemble_AllModels_AutoML_20210725_011103 |
Looks like we have 36 different models combined in a GLM, using Elastic Net regularization. The component models are either GBM or Deep Learning and it looks like they each have different hyperparameter grids that were searched across. I’ll put the full model output in the Appendix. Here are a list of the models that compose the ensemble and their relative importance to the ensemble. The top performing models take a greater weight.
Importance of Each Contributing Model in the Ensemble
h2o.varimp(metalearner_model) %>% as_tibble() %>% gt_preview() %>% cols_align("center")
variable | relative_importance | scaled_importance | percentage | |
---|---|---|---|---|
1 | GBM_2_AutoML_20210725_011103 | 0.7014266 | 1.0000000 | 0.16199734 |
2 | GBM_grid__1_AutoML_20210725_011103_model_18 | 0.6565727 | 0.9360534 | 0.15163815 |
3 | GBM_grid__1_AutoML_20210725_011103_model_9 | 0.6336205 | 0.9033312 | 0.14633725 |
4 | GBM_grid__1_AutoML_20210725_011103_model_19 | 0.4374311 | 0.6236307 | 0.10102651 |
5 | GBM_grid__1_AutoML_20210725_011103_model_3 | 0.4047249 | 0.5770026 | 0.09347288 |
6..35 | ||||
36 | GLM_1_AutoML_20210725_011103 | 0.0000000 | 0.0000000 | 0.00000000 |
We could also visualize the data above for scaled importance of each model within the ensemble:
h2o.varimp_plot(metalearner_model)
Importance of Features in the Original Dataset
Now let’s dig into the best individual model a bit to understand the parameters and feature importance of the original dataset. The top individual model is extracted and the variable importance can be displayed just like we did for the ensemble components.
top_individual_model <- h2o.getModel(metalearner_model@model$names[1]) metalearner_model@model$names[1] ## [1] "GBM_2_AutoML_20210725_011103" h2o.varimp(top_individual_model) %>% as_tibble() %>% gt() %>% cols_align("center")
variable | relative_importance | scaled_importance | percentage |
---|---|---|---|
total_trans_ct | 972.9781494 | 1.0000000000 | 0.2535628117 |
total_trans_amt | 773.8638306 | 0.7953558166 | 0.2016726572 |
total_ct_chng_q4_q1 | 567.8966675 | 0.5836684697 | 0.1479966183 |
total_revolving_bal | 500.5438232 | 0.5144450814 | 0.1304441413 |
total_relationship_count | 341.5622864 | 0.3510482600 | 0.0890127838 |
total_amt_chng_q4_q1 | 175.8822327 | 0.1807668885 | 0.0458357605 |
customer_age | 160.4104614 | 0.1648654305 | 0.0418037421 |
avg_utilization_ratio | 132.2019196 | 0.1358734722 | 0.0344524597 |
credit_limit | 72.8332520 | 0.0748559996 | 0.0189806977 |
months_inactive_12_mon | 70.7272568 | 0.0726915161 | 0.0184318652 |
gender_M | 37.5323524 | 0.0385747126 | 0.0097811126 |
income_category_X.80K....120K | 6.2846680 | 0.0064592077 | 0.0016378149 |
income_category_X.40K....60K | 6.1469312 | 0.0063176456 | 0.0016019200 |
education_level_High.School | 3.8736489 | 0.0039812291 | 0.0010094916 |
education_level_Graduate | 3.1741629 | 0.0032623167 | 0.0008272022 |
education_level_Unknown | 2.5144935 | 0.0025843268 | 0.0006552892 |
income_category_Less.than..40K | 1.8054181 | 0.0018555588 | 0.0004705007 |
education_level_Uneducated | 1.7907501 | 0.0018404834 | 0.0004666781 |
income_category_Unknown | 1.4599973 | 0.0015005448 | 0.0003804824 |
income_category_X.60K....80K | 1.4241761 | 0.0014637288 | 0.0003711472 |
education_level_Post.Graduate | 1.4217629 | 0.0014612486 | 0.0003705183 |
education_level_Doctorate | 0.8990833 | 0.0009240529 | 0.0002343054 |
h2o.varimp_plot(top_individual_model)
If we were interested in the hyperparameters of this individual GB model, we could look at them like this:
top_individual_model@parameters ## $model_id ## [1] "GBM_2_AutoML_20210725_011103" ## ## $training_frame ## [1] "automl_training_baked_dataset_tbl_sid_914f_1" ## ## $nfolds ## [1] 5 ## ## $keep_cross_validation_models ## [1] FALSE ## ## $keep_cross_validation_predictions ## [1] TRUE ## ## $score_tree_interval ## [1] 5 ## ## $fold_assignment ## [1] "Modulo" ## ## $ntrees ## [1] 101 ## ## $max_depth ## [1] 7 ## ## $stopping_metric ## [1] "logloss" ## ## $stopping_tolerance ## [1] 0.01187786 ## ## $seed ## [1] 7252024 ## ## $distribution ## [1] "bernoulli" ## ## $sample_rate ## [1] 0.8 ## ## $col_sample_rate ## [1] 0.8 ## ## $col_sample_rate_per_tree ## [1] 0.8 ## ## $histogram_type ## [1] "UniformAdaptive" ## ## $categorical_encoding ## [1] "Enum" ## ## $x ## [1] "customer_age" "total_relationship_count" ## [3] "months_inactive_12_mon" "credit_limit" ## [5] "total_revolving_bal" "total_amt_chng_q4_q1" ## [7] "total_trans_amt" "total_trans_ct" ## [9] "total_ct_chng_q4_q1" "avg_utilization_ratio" ## [11] "gender_M" "education_level_Doctorate" ## [13] "education_level_Graduate" "education_level_High.School" ## [15] "education_level_Post.Graduate" "education_level_Uneducated" ## [17] "education_level_Unknown" "income_category_X.40K....60K" ## [19] "income_category_X.60K....80K" "income_category_X.80K....120K" ## [21] "income_category_Less.than..40K" "income_category_Unknown" ## ## $y ## [1] "churned"
Final Thoughts
My takeaway is that h2o autoML is really quite powerful for purely predictive tasks and is even feasible for a timed competition like Sliced. I look forward to the upcoming playoffs and will be interested to try h2o on some new datasets to see if we just got lucky here, or if it’s really that good.
TLDR
AutoML wouldn’t have won this competition (at least with the minimal feature engineering I did), but it sure got way closer than I expected!
Thank you for your attention!
Session Info
sessionInfo() ## R version 4.0.3 (2020-10-10) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 18363) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] gt_0.2.2 here_1.0.0 h2o_3.32.1.3 forcats_0.5.0 ## [5] stringr_1.4.0 readr_1.4.0 tidyverse_1.3.0 yardstick_0.0.8 ## [9] workflowsets_0.0.2 workflows_0.2.3 tune_0.1.5 tidyr_1.1.3 ## [13] tibble_3.1.2 rsample_0.1.0 recipes_0.1.16 purrr_0.3.4 ## [17] parsnip_0.1.6.9000 modeldata_0.1.0 infer_0.5.4 ggplot2_3.3.5 ## [21] dplyr_1.0.7 dials_0.0.9 scales_1.1.1 broom_0.7.8 ## [25] tidymodels_0.1.3 ## ## loaded via a namespace (and not attached): ## [1] colorspace_2.0-0 ellipsis_0.3.2 class_7.3-17 rprojroot_2.0.2 ## [5] fs_1.5.0 rstudioapi_0.13 listenv_0.8.0 furrr_0.2.1 ## [9] bit64_4.0.5 prodlim_2019.11.13 fansi_0.5.0 lubridate_1.7.9.2 ## [13] xml2_1.3.2 codetools_0.2-18 splines_4.0.3 knitr_1.30 ## [17] jsonlite_1.7.1 pROC_1.16.2 dbplyr_2.0.0 compiler_4.0.3 ## [21] httr_1.4.2 backports_1.2.0 assertthat_0.2.1 Matrix_1.2-18 ## [25] cli_3.0.1 htmltools_0.5.0 tools_4.0.3 gtable_0.3.0 ## [29] glue_1.4.2 Rcpp_1.0.5 cellranger_1.1.0 DiceDesign_1.8-1 ## [33] vctrs_0.3.8 blogdown_0.15 iterators_1.0.13 timeDate_3043.102 ## [37] gower_0.2.2 xfun_0.19 globals_0.14.0 rvest_0.3.6 ## [41] lifecycle_1.0.0 future_1.20.1 MASS_7.3-53 ipred_0.9-9 ## [45] hms_0.5.3 parallel_4.0.3 yaml_2.2.1 sass_0.3.1 ## [49] rpart_4.1-15 stringi_1.5.3 foreach_1.5.1 checkmate_2.0.0 ## [53] lhs_1.1.1 hardhat_0.1.6 lava_1.6.8.1 rlang_0.4.11 ## [57] pkgconfig_2.0.3 bitops_1.0-6 evaluate_0.14 lattice_0.20-41 ## [61] bit_4.0.4 tidyselect_1.1.1 parallelly_1.21.0 plyr_1.8.6 ## [65] magrittr_2.0.1 bookdown_0.21 R6_2.5.0 generics_0.1.0 ## [69] DBI_1.1.0 pillar_1.6.1 haven_2.3.1 withr_2.3.0 ## [73] survival_3.2-7 RCurl_1.98-1.2 nnet_7.3-14 modelr_0.1.8 ## [77] crayon_1.4.1 utf8_1.2.1 rmarkdown_2.5 grid_4.0.3 ## [81] readxl_1.3.1 data.table_1.14.0 reprex_0.3.0 digest_0.6.27 ## [85] GPfit_1.0-8 munsell_0.5.0
Appendix - Full Details of the Ensemble Model
str(metalearner_model@model) ## List of 35 ## $ names : chr [1:37] "GBM_2_AutoML_20210725_011103" "GBM_grid__1_AutoML_20210725_011103_model_2" "GBM_4_AutoML_20210725_011103" "GBM_grid__1_AutoML_20210725_011103_model_11" ... ## $ original_names : NULL ## $ column_types : chr [1:37] "Numeric" "Numeric" "Numeric" "Numeric" ... ## $ domains :List of 37 ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : NULL ## ..$ : chr [1:2] "no" "yes" ## $ cross_validation_models : NULL ## $ cross_validation_predictions : NULL ## $ cross_validation_holdout_predictions_frame_id : NULL ## $ cross_validation_fold_assignment_frame_id : NULL ## $ model_summary :Classes 'H2OTable' and 'data.frame': 1 obs. of 7 variables: ## ..$ family : chr "binomial" ## ..$ link : chr "logit" ## ..$ regularization : chr "Elastic Net (alpha = 0.5, lambda = 0.00311 )" ## ..$ number_of_predictors_total : int 36 ## ..$ number_of_active_predictors: int 12 ## ..$ number_of_iterations : int 7 ## ..$ training_frame : chr "levelone_training_StackedEnsemble_AllModels_AutoML_20210725_011103" ## ..- attr(*, "header")= chr "GLM Model" ## ..- attr(*, "formats")= chr [1:7] "%s" "%s" "%s" "%d" ... ## ..- attr(*, "description")= chr "summary" ## $ scoring_history :Classes 'H2OTable' and 'data.frame': 2 obs. of 17 variables: ## ..$ timestamp : chr [1:2] "2021-07-25 01:26:33" "2021-07-25 01:26:33" ## ..$ duration : chr [1:2] " 0.000 sec" " 0.172 sec" ## ..$ iterations : int [1:2] 5 7 ## ..$ negative_log_likelihood : num [1:2] 571 571 ## ..$ objective : num [1:2] 0.0822 0.0837 ## ..$ alpha : num [1:2] 0.5 1 ## ..$ lambda : num [1:2] 0.00311 0.00311 ## ..$ deviance_train : num [1:2] 0.161 0.161 ## ..$ deviance_xval : num [1:2] 0.166 0.166 ## ..$ deviance_se : num [1:2] 0.075 0.0751 ## ..$ training_rmse : num [1:2] 0.152 0.152 ## ..$ training_logloss : num [1:2] 0.0806 0.0805 ## ..$ training_r2 : num [1:2] 0.828 0.828 ## ..$ training_auc : num [1:2] 0.992 0.992 ## ..$ training_pr_auc : num [1:2] 0.966 0.965 ## ..$ training_lift : num [1:2] 6.26 6.26 ## ..$ training_classification_error: num [1:2] 0.0295 0.0298 ## ..- attr(*, "header")= chr "Scoring History" ## ..- attr(*, "formats")= chr [1:17] "%s" "%s" "%d" "%.5f" ... ## ..- attr(*, "description")= chr "" ## $ cv_scoring_history :List of 5 ## ..$ :Classes 'H2OTable' and 'data.frame': 2 obs. of 23 variables: ## .. ..$ timestamp : chr [1:2] "2021-07-25 01:26:31" "2021-07-25 01:26:31" ## .. ..$ duration : chr [1:2] " 0.000 sec" " 0.205 sec" ## .. ..$ iterations : int [1:2] 5 7 ## .. ..$ negative_log_likelihood : num [1:2] 473 473 ## .. ..$ objective : num [1:2] 0.0684 0.0698 ## .. ..$ alpha : num [1:2] 0.5 1 ## .. ..$ lambda : num [1:2] 0.00311 0.00311 ## .. ..$ deviance_train : num [1:2] 0.166 0.166 ## .. ..$ deviance_test : num [1:2] 0.144 0.144 ## .. ..$ training_rmse : num [1:2] 0.154 0.154 ## .. ..$ training_logloss : num [1:2] 0.083 0.083 ## .. ..$ training_r2 : num [1:2] 0.826 0.826 ## .. ..$ training_auc : num [1:2] 0.992 0.992 ## .. ..$ training_pr_auc : num [1:2] 0.964 0.964 ## .. ..$ training_lift : num [1:2] 6.16 6.16 ## .. ..$ training_classification_error : num [1:2] 0.0297 0.0295 ## .. ..$ validation_rmse : num [1:2] 0.147 0.147 ## .. ..$ validation_logloss : num [1:2] 0.072 0.0721 ## .. ..$ validation_r2 : num [1:2] 0.83 0.829 ## .. ..$ validation_auc : num [1:2] 0.994 0.994 ## .. ..$ validation_pr_auc : num [1:2] 0.97 0.97 ## .. ..$ validation_lift : num [1:2] 6.71 6.71 ## .. ..$ validation_classification_error: num [1:2] 0.0288 0.0295 ## .. ..- attr(*, "header")= chr "Scoring History" ## .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ... ## .. ..- attr(*, "description")= chr "" ## ..$ :Classes 'H2OTable' and 'data.frame': 2 obs. of 23 variables: ## .. ..$ timestamp : chr [1:2] "2021-07-25 01:26:31" "2021-07-25 01:26:32" ## .. ..$ duration : chr [1:2] " 0.000 sec" " 0.192 sec" ## .. ..$ iterations : int [1:2] 5 7 ## .. ..$ negative_log_likelihood : num [1:2] 448 448 ## .. ..$ objective : num [1:2] 0.065 0.0664 ## .. ..$ alpha : num [1:2] 0.5 1 ## .. ..$ lambda : num [1:2] 0.00311 0.00311 ## .. ..$ deviance_train : num [1:2] 0.157 0.157 ## .. ..$ deviance_test : num [1:2] 0.179 0.179 ## .. ..$ training_rmse : num [1:2] 0.151 0.151 ## .. ..$ training_logloss : num [1:2] 0.0785 0.0785 ## .. ..$ training_r2 : num [1:2] 0.831 0.831 ## .. ..$ training_auc : num [1:2] 0.993 0.993 ## .. ..$ training_pr_auc : num [1:2] 0.968 0.968 ## .. ..$ training_lift : num [1:2] 6.21 6.21 ## .. ..$ training_classification_error : num [1:2] 0.0293 0.0293 ## .. ..$ validation_rmse : num [1:2] 0.157 0.157 ## .. ..$ validation_logloss : num [1:2] 0.0894 0.0894 ## .. ..$ validation_r2 : num [1:2] 0.811 0.812 ## .. ..$ validation_auc : num [1:2] 0.99 0.99 ## .. ..$ validation_pr_auc : num [1:2] 0.954 0.954 ## .. ..$ validation_lift : num [1:2] 6.5 6.5 ## .. ..$ validation_classification_error: num [1:2] 0.0305 0.0305 ## .. ..- attr(*, "header")= chr "Scoring History" ## .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ... ## .. ..- attr(*, "description")= chr "" ## ..$ :Classes 'H2OTable' and 'data.frame': 2 obs. of 23 variables: ## .. ..$ timestamp : chr [1:2] "2021-07-25 01:26:32" "2021-07-25 01:26:32" ## .. ..$ duration : chr [1:2] " 0.000 sec" " 0.208 sec" ## .. ..$ iterations : int [1:2] 5 7 ## .. ..$ negative_log_likelihood : num [1:2] 440 440 ## .. ..$ objective : num [1:2] 0.0638 0.0653 ## .. ..$ alpha : num [1:2] 0.5 1 ## .. ..$ lambda : num [1:2] 0.00311 0.00311 ## .. ..$ deviance_train : num [1:2] 0.155 0.155 ## .. ..$ deviance_test : num [1:2] 0.19 0.191 ## .. ..$ training_rmse : num [1:2] 0.149 0.149 ## .. ..$ training_logloss : num [1:2] 0.0774 0.0773 ## .. ..$ training_r2 : num [1:2] 0.832 0.832 ## .. ..$ training_auc : num [1:2] 0.993 0.993 ## .. ..$ training_pr_auc : num [1:2] 0.967 0.967 ## .. ..$ training_lift : num [1:2] 6.37 6.37 ## .. ..$ training_classification_error : num [1:2] 0.0294 0.0294 ## .. ..$ validation_rmse : num [1:2] 0.166 0.166 ## .. ..$ validation_logloss : num [1:2] 0.0948 0.0953 ## .. ..$ validation_r2 : num [1:2] 0.806 0.805 ## .. ..$ validation_auc : num [1:2] 0.99 0.99 ## .. ..$ validation_pr_auc : num [1:2] 0.959 0.959 ## .. ..$ validation_lift : num [1:2] 5.86 5.86 ## .. ..$ validation_classification_error: num [1:2] 0.0314 0.0321 ## .. ..- attr(*, "header")= chr "Scoring History" ## .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ... ## .. ..- attr(*, "description")= chr "" ## ..$ :Classes 'H2OTable' and 'data.frame': 2 obs. of 23 variables: ## .. ..$ timestamp : chr [1:2] "2021-07-25 01:26:32" "2021-07-25 01:26:32" ## .. ..$ duration : chr [1:2] " 0.000 sec" " 0.173 sec" ## .. ..$ iterations : int [1:2] 5 7 ## .. ..$ negative_log_likelihood : num [1:2] 470 470 ## .. ..$ objective : num [1:2] 0.0679 0.0694 ## .. ..$ alpha : num [1:2] 0.5 1 ## .. ..$ lambda : num [1:2] 0.00311 0.00311 ## .. ..$ deviance_train : num [1:2] 0.168 0.169 ## .. ..$ deviance_test : num [1:2] 0.137 0.136 ## .. ..$ training_rmse : num [1:2] 0.156 0.156 ## .. ..$ training_logloss : num [1:2] 0.0842 0.0842 ## .. ..$ training_r2 : num [1:2] 0.82 0.82 ## .. ..$ training_auc : num [1:2] 0.992 0.992 ## .. ..$ training_pr_auc : num [1:2] 0.963 0.963 ## .. ..$ training_lift : num [1:2] 6.17 6.17 ## .. ..$ training_classification_error : num [1:2] 0.0317 0.0314 ## .. ..$ validation_rmse : num [1:2] 0.136 0.136 ## .. ..$ validation_logloss : num [1:2] 0.0683 0.0682 ## .. ..$ validation_r2 : num [1:2] 0.856 0.856 ## .. ..$ validation_auc : num [1:2] 0.994 0.994 ## .. ..$ validation_pr_auc : num [1:2] 0.972 0.972 ## .. ..$ validation_lift : num [1:2] 6.61 6.61 ## .. ..$ validation_classification_error: num [1:2] 0.0206 0.0206 ## .. ..- attr(*, "header")= chr "Scoring History" ## .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ... ## .. ..- attr(*, "description")= chr "" ## ..$ :Classes 'H2OTable' and 'data.frame': 2 obs. of 23 variables: ## .. ..$ timestamp : chr [1:2] "2021-07-25 01:26:32" "2021-07-25 01:26:33" ## .. ..$ duration : chr [1:2] " 0.000 sec" " 0.179 sec" ## .. ..$ iterations : int [1:2] 5 7 ## .. ..$ negative_log_likelihood : num [1:2] 445 445 ## .. ..$ objective : num [1:2] 0.0645 0.0659 ## .. ..$ alpha : num [1:2] 0.5 1 ## .. ..$ lambda : num [1:2] 0.00311 0.00311 ## .. ..$ deviance_train : num [1:2] 0.157 0.157 ## .. ..$ deviance_test : num [1:2] 0.181 0.181 ## .. ..$ training_rmse : num [1:2] 0.149 0.149 ## .. ..$ training_logloss : num [1:2] 0.0784 0.0784 ## .. ..$ training_r2 : num [1:2] 0.832 0.832 ## .. ..$ training_auc : num [1:2] 0.992 0.992 ## .. ..$ training_pr_auc : num [1:2] 0.966 0.966 ## .. ..$ training_lift : num [1:2] 6.41 6.41 ## .. ..$ training_classification_error : num [1:2] 0.0275 0.0275 ## .. ..$ validation_rmse : num [1:2] 0.165 0.165 ## .. ..$ validation_logloss : num [1:2] 0.0905 0.0903 ## .. ..$ validation_r2 : num [1:2] 0.811 0.811 ## .. ..$ validation_auc : num [1:2] 0.991 0.992 ## .. ..$ validation_pr_auc : num [1:2] 0.964 0.964 ## .. ..$ validation_lift : num [1:2] 5.74 5.74 ## .. ..$ validation_classification_error: num [1:2] 0.0361 0.0369 ## .. ..- attr(*, "header")= chr "Scoring History" ## .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ... ## .. ..- attr(*, "description")= chr "" ## $ reproducibility_information_table :List of 3 ## ..$ :Classes 'H2OTable' and 'data.frame': 1 obs. of 26 variables: ## .. ..$ node : int 0 ## .. ..$ h2o : chr "127.0.0.1:54321" ## .. ..$ healthy : chr "true" ## .. ..$ last_ping : chr "1627201591256" ## .. ..$ num_cpus : int 4 ## .. ..$ sys_load : num 0.564 ## .. ..$ mem_value_size : num 70732923 ## .. ..$ free_mem : num 7.76e+08 ## .. ..$ pojo_mem : num 1.91e+08 ## .. ..$ swap_mem : num 0 ## .. ..$ free_disc : num 3.6e+10 ## .. ..$ max_disc : num 2.55e+11 ## .. ..$ pid : int 7916 ## .. ..$ num_keys : int 9133 ## .. ..$ tcps_active : chr "" ## .. ..$ open_fds : int -1 ## .. ..$ rpcs_active : chr "" ## .. ..$ nthreads : int 4 ## .. ..$ is_leader : chr "true" ## .. ..$ total_mem : num 6.27e+08 ## .. ..$ max_mem : num 1.04e+09 ## .. ..$ java_version : chr "Java 1.8.0_77 (from Oracle Corporation)" ## .. ..$ jvm_launch_parameters: chr "[-Xmx1g, -ea]" ## .. ..$ os_version : chr "Windows 10 10.0 (x86)" ## .. ..$ machine_physical_mem : num 1.7e+10 ## .. ..$ machine_locale : chr "en_US" ## .. ..- attr(*, "header")= chr "Node Information" ## .. ..- attr(*, "formats")= chr [1:26] "%d" "%s" "%s" "%s" ... ## .. ..- attr(*, "description")= chr "" ## ..$ :Classes 'H2OTable' and 'data.frame': 1 obs. of 13 variables: ## .. ..$ h2o_cluster_uptime : num 1394453 ## .. ..$ h2o_cluster_timezone : chr "America/Los_Angeles" ## .. ..$ h2o_data_parsing_timezone: chr "UTC" ## .. ..$ h2o_cluster_version : chr "3.32.1.3" ## .. ..$ h2o_cluster_version_age : chr "2 months and 5 days" ## .. ..$ h2o_cluster_name : chr "H2O_started_from_R_kingr17_afp345" ## .. ..$ h2o_cluster_total_nodes : int 1 ## .. ..$ h2o_cluster_free_memory : num 7.76e+08 ## .. ..$ h2o_cluster_total_cores : int 4 ## .. ..$ h2o_cluster_allowed_cores: int 4 ## .. ..$ h2o_cluster_status : chr "locked, healthly" ## .. ..$ h2o_internal_security : chr "false" ## .. ..$ h2o_api_extensions : chr "Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4" ## .. ..- attr(*, "header")= chr "Cluster Configuration" ## .. ..- attr(*, "formats")= chr [1:13] "%d" "%s" "%s" "%s" ... ## .. ..- attr(*, "description")= chr "" ## ..$ :Classes 'H2OTable' and 'data.frame': 2 obs. of 3 variables: ## .. ..$ input_frame: chr [1:2] "training_frame" "validation_frame" ## .. ..$ checksum : num [1:2] 3.75e+18 -1.00 ## .. ..$ espc : chr [1:2] "[0, 1772, 3544, 5316, 7088]" "-1" ## .. ..- attr(*, "header")= chr "Input Frames Information" ## .. ..- attr(*, "formats")= chr [1:3] "%s" "%d" "%d" ## .. ..- attr(*, "description")= chr "" ## $ training_metrics :Formal class 'H2OBinomialMetrics' [package "h2o"] with 5 slots ## .. ..@ algorithm: chr "glm" ## .. ..@ on_train : logi TRUE ## .. ..@ on_valid : logi FALSE ## .. ..@ on_xval : logi FALSE ## .. ..@ metrics :List of 30 ## .. .. ..$ __meta :List of 3 ## .. .. .. ..$ schema_version: int 3 ## .. .. .. ..$ schema_name : chr "ModelMetricsBinomialGLMV3" ## .. .. .. ..$ schema_type : chr "ModelMetricsBinomialGLM" ## .. .. ..$ model :List of 4 ## .. .. .. ..$ __meta:List of 3 ## .. .. .. .. ..$ schema_version: int 3 ## .. .. .. .. ..$ schema_name : chr "ModelKeyV3" ## .. .. .. .. ..$ schema_type : chr "Key<Model>" ## .. .. .. ..$ name : chr "metalearner_AUTO_StackedEnsemble_AllModels_AutoML_20210725_011103" ## .. .. .. ..$ type : chr "Key<Model>" ## .. .. .. ..$ URL : chr "/3/Models/metalearner_AUTO_StackedEnsemble_AllModels_AutoML_20210725_011103" ## .. .. ..$ model_checksum : chr "1672530778380092240" ## .. .. ..$ frame :List of 4 ## .. .. .. ..$ __meta:List of 3 ## .. .. .. .. ..$ schema_version: int 3 ## .. .. .. .. ..$ schema_name : chr "FrameKeyV3" ## .. .. .. .. ..$ schema_type : chr "Key<Frame>" ## .. .. .. ..$ name : chr "levelone_training_StackedEnsemble_AllModels_AutoML_20210725_011103" ## .. .. .. ..$ type : chr "Key<Frame>" ## .. .. .. ..$ URL : chr "...
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.