Could AutoML win in the ‘Sliced’ Data Science Competition? The answer may shock you!

[This article was first published on [R]eliability, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I’ll be taking a break from my normal explorations in the medical device domain to talk about Sliced. Sliced is a 2-hour data science competition streamed on Twitch and hosted by Meg Risdal and Nick Wan. Four competitors tackle a prediction problem in real time using whatever coding language or tools they prefer, grabbing bonus points along the way for Data Visualization and/or stumbling onto Golden Features (hint: always calculate the air density when training on weather data). Viewers can simply kick back on watch the contestants apply their trade or they can actively participate by submitting their own solutions and seeing how they stack up on the competition leaderboard!

Here are my observations after watching a few episodes:

* Participants do not typically implement more than 2 different model types, preferring to spend their time on Feature Engineering and tuning the hyperparameters of their preferred model

* Gradient boosting (XGBoost, Catboost, etc) is the dominant technique for tabular data

To clarify the first point above – the tuning is not totally manual; grid search functions are typically employed to identify the best hyperparameters from a superset of options. But the time pressure of the competition means that players can’t set up massive grids that lock up compute resources for too long. So it’s generally an iterative process over small grids that are expanded and contracted as needed based on intermediate results of the model predicting on a test set.

All this led me to wonder: given the somewhat manual process of hyperparameter optimization and the restricted number of model types… how would AutoML fare in Sliced? The rest of this post will attempt to answer that question, at least for an arbitrary Sliced episode.

Setup

For this exercise we’ll use the dataset and metrics from Episode 7 in which we are asked to predict whether or not a bank customer churned. The scoring metric is LogLoss. I’ll be using the free version of the h2o.ai framework and take the following approach to feature engineering and variable selection:

  • All variables will be used (churn explained by everything) and no feature engineering except imputing means for missing values, converting nominal predictors to dummy variables, and removing ID column. This should give a fair look at how h2o will do given the bare minimum of attention to pre-processing and no attention to model selection or hyperparameter range.

Let’s get to it.

Load libraries

library(tidymodels)
library(tidyverse)
library(h2o)
library(here)
library(gt)

Load dataset

Read the data in as a csv and rename the attrition column

dataset <- read_csv(here("ep7/train.csv")) %>%
  mutate(churned = case_when(
    attrition_flag == 1 ~ "yes",
    TRUE ~ "no"
  ) %>% as_factor()) %>%
  select(-attrition_flag)

holdout <- read_csv(here("ep7/test.csv"))

dataset %>%
  gt_preview() %>% 
  cols_align("center")
id customer_age gender education_level income_category total_relationship_count months_inactive_12_mon credit_limit total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1 avg_utilization_ratio churned
1 8805 27 F Post-Graduate Less than $40K 3 2 1438.3 990 0.715 3855 73 1.147 0.688 no
2 4231 42 F College Less than $40K 6 4 3050.0 1824 0.771 1973 50 1.381 0.598 no
3 5263 47 F Unknown Less than $40K 3 3 1561.0 0 0.502 1947 28 0.556 0.000 yes
4 2072 44 M Uneducated $80K – $120K 1 3 25428.0 1528 0.725 13360 97 0.796 0.060 no
5 7412 54 M Graduate $60K – $80K 3 3 2947.0 2216 0.760 1744 53 0.606 0.752 no
6..7087
7088 7932 57 F High School Less than $40K 5 3 3191.0 2517 0.719 1501 35 0.591 0.789 no

Set Up Basic Recipe

This recipe starts with the model structure, imputes the mean for numeric predictors, and converts nominal variables to dummy. Id is also removed. Note that churned is described by . which means “all variables”.

basic_rec <- recipe(churned ~ .,
  data = dataset
) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_rm(id)

Prep and Bake the Dataset

Prepping and baking functions will apply the recipe to the supplied dataset and make it usable in tibble format for passing to h2o.

baked_dataset_tbl <- basic_rec %>%
  prep() %>%
  bake(dataset)

baked_dataset_tbl %>%
  gt_preview() %>% 
  cols_align("center")
customer_age total_relationship_count months_inactive_12_mon credit_limit total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1 avg_utilization_ratio churned gender_M education_level_Doctorate education_level_Graduate education_level_High.School education_level_Post.Graduate education_level_Uneducated education_level_Unknown income_category_X.40K….60K income_category_X.60K….80K income_category_X.80K….120K income_category_Less.than..40K income_category_Unknown
1 27 3 2 1438.3 990 0.715 3855 73 1.147 0.688 no 0 0 0 0 1 0 0 0 0 0 1 0
2 42 6 4 3050.0 1824 0.771 1973 50 1.381 0.598 no 0 0 0 0 0 0 0 0 0 0 1 0
3 47 3 3 1561.0 0 0.502 1947 28 0.556 0.000 yes 0 0 0 0 0 0 1 0 0 0 1 0
4 44 1 3 25428.0 1528 0.725 13360 97 0.796 0.060 no 1 0 0 0 0 1 0 0 0 1 0 0
5 54 3 3 2947.0 2216 0.760 1744 53 0.606 0.752 no 1 0 1 0 0 0 0 0 1 0 0 0
6..7087
7088 57 5 3 3191.0 2517 0.719 1501 35 0.591 0.789 no 0 0 0 1 0 0 0 0 0 0 1 0

Working in h2o

Convert Dataset to h2o

h2o must first be initialized and then the data can be coerced to h2o type. Using the h2o.describe() function shows a nice summary of the dataset and verifies that it was imported correctly.

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         8 hours 4 minutes 
##     H2O cluster timezone:       America/Los_Angeles 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.1.3 
##     H2O cluster version age:    2 months and 5 days  
##     H2O cluster name:           H2O_started_from_R_kingr17_afp345 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.85 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.3 (2020-10-10)
train_h2_tbl <- as.h2o(baked_dataset_tbl)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
h2o.describe(train_h2_tbl) %>% 
  gt() %>% 
  cols_align("center")
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
customer_age int 0 0 0 0 26.0 73.000 4.630742e+01 7.9929340 NA
total_relationship_count int 0 0 0 0 1.0 6.000 3.818426e+00 1.5516740 NA
months_inactive_12_mon int 0 22 0 0 0.0 6.000 2.331546e+00 1.0103784 NA
credit_limit real 0 0 0 0 1438.3 34516.000 8.703875e+03 9152.3072697 NA
total_revolving_bal int 0 1705 0 0 0.0 2517.000 1.169674e+03 815.4734902 NA
total_amt_chng_q4_q1 real 0 5 0 0 0.0 3.355 7.601713e-01 0.2213876 NA
total_trans_amt int 0 0 0 0 563.0 18484.000 4.360548e+03 3339.1008390 NA
total_trans_ct int 0 0 0 0 10.0 139.000 6.465378e+01 23.3431055 NA
total_ct_chng_q4_q1 real 0 7 0 0 0.0 3.500 7.126414e-01 0.2363882 NA
avg_utilization_ratio real 0 1705 0 0 0.0 0.999 2.753678e-01 0.2755652 NA
churned enum 0 5956 0 0 0.0 1.000 1.597065e-01 0.3663595 2
gender_M int 0 3714 0 0 0.0 1.000 4.760158e-01 0.4994597 NA
education_level_Doctorate int 0 6762 0 0 0.0 1.000 4.599323e-02 0.2094852 NA
education_level_Graduate int 0 4876 0 0 0.0 1.000 3.120767e-01 0.4633737 NA
education_level_High.School int 0 5710 0 0 0.0 1.000 1.944131e-01 0.3957761 NA
education_level_Post.Graduate int 0 6745 0 0 0.0 1.000 4.839165e-02 0.2146075 NA
education_level_Uneducated int 0 6050 0 0 0.0 1.000 1.464447e-01 0.3535764 NA
education_level_Unknown int 0 6005 0 0 0.0 1.000 1.527935e-01 0.3598137 NA
income_category_X.40K….60K int 0 5865 0 0 0.0 1.000 1.725451e-01 0.3778802 NA
income_category_X.60K….80K int 0 6064 0 0 0.0 1.000 1.444695e-01 0.3515900 NA
income_category_X.80K….120K int 0 6007 0 0 0.0 1.000 1.525113e-01 0.3595411 NA
income_category_Less.than..40K int 0 4621 0 0 0.0 1.000 3.480530e-01 0.4763865 NA
income_category_Unknown int 0 6316 0 0 0.0 1.000 1.089165e-01 0.3115564 NA

Specify the Response and Predictors

In h2o we must identify the response column and the predictors which we do here. Unfortunately we can’t tidyselect here I don’t think.

y <- "churned"
x <- setdiff(names(train_h2_tbl), y)

autoML Search and Optimization

Now we start the autoML session. You can specify the stopping rule by either total number of model or total duration in seconds. Since we’re simulating a timed Sliced competition, we’ll use max time. The longest I observed competitors training for was about 20 minutes, so we’ll use that here and then grab a coffee while it chugs. Notice that in this API we are not specifying any particular type of model or any hyperparameter range to optimize over.

# aml <- h2o.automl(
#   y = y,
#   x = x,
#   training_frame = train_h2_tbl,
#   project_name = "sliced_ep7_refactored_25bjuly2021",
#   max_runtime_secs = 1200,
#   seed = 07252021
# )

Leaderboard of Best Model

Not to be confused with the competition leaderboard, h2o will produce a “leaderboard” of models that it evaluated and ranked as part of its session. Here we access and observe the leaderboard and its best models.

# leaderboard_tbl <- aml@leaderboard %>% as_tibble()
# a <- leaderboard_tbl %>% gt_preview()
# a
model_id auc logloss aucpr mean_per_class_error rmse mse
1 StackedEnsemble_AllModels_AutoML_20210725_011103 0.9918194 0.08285245 0.9636402 0.06007334 0.1543060 0.02381035
2 GBM_2_AutoML_20210725_011103 0.9914487 0.08636589 0.9626320 0.05620012 0.1569434 0.02463124
3 StackedEnsemble_BestOfFamily_AutoML_20210725_011103 0.9914280 0.08451327 0.9625496 0.05620012 0.1556844 0.02423764
4 GBM_grid__1_AutoML_20210725_011103_model_2 0.9913165 0.09009359 0.9616467 0.06286946 0.1600684 0.02562190
5 GBM_4_AutoML_20210725_011103 0.9912436 0.09035254 0.9602742 0.07244469 0.1599096 0.02557107
6..37
38 GLM_1_AutoML_20210725_011103 0.9152440 0.24794654 0.7267223 0.18132441 0.2727786 0.07440815

Extract Top Model

As expected, the top slot is an ensemble (in my limited experience it usually is). This is the one we’ll use to predict churn on the holdout set and submit to kaggle for our competition results. The ensemble model is extracted and stored as follows:

To prepare the holdout data predictions, we apply the basic recipe and convert to h2o, just as before with the training data.

Pre-Process the Holdout Set

We want the basic recipe applied to the holdout set so the model sees the same type of predictor variables when it goes to make predictions.

holdout_h2o_tbl <- basic_rec %>%
  prep() %>%
  bake(holdout) %>%
  as.h2o()
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Make Predictions

Predictions are made using h2o.predict().

top_model_basic_preds <- h2o.predict(top_model, newdata = holdout_h2o_tbl) %>%
  as_tibble() %>%
  bind_cols(holdout) %>%
  select(id, attrition_flag = yes)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
top_model_basic_preds %>%
  gt_preview() 
id attrition_flag
1 3005 0.0020805123
2 143 0.8462792238
3 5508 0.0003131364
4 6474 0.0001977184
5 9784 0.0017571210
6..3038
3039 9822 0.9895064264

Nice! A list of predictions for each id in the holdout set. Let’s write it to file and submit.

Export Results

top_model_basic_preds %>%
  write_csv(here("run1_basic_ep7.csv"))

Scoring the Submission

To submit the file for assessment, simply upload the csv to the interface on kaggle by browsing and selecting or dragging. In a few seconds our LogLoss on holdout set is revealed. The one we care about is the private score.

A private score of 0.06921! This is honestly way better than I expected for such minimal processing. This submission would have slotted me into 3rd place out of 31 entries, just behind eventual Episode 7 winner Ethan Douglas. And remember, I didn’t need to specify any modeling engine or range of hyperparameters to tune!

Understanding the Ensemble

Just so we aren’t totally naive, let’s dig into this model a bit and see what h2o built. For an ensemble, we want to interrogate the “metalearner” which can be thought of as the model that is made up of many models. These are S4 objects which require the @ operator to dig into the different slots / items (I just learned this about 90 seconds ago).

Extract the Ensemble Metalearner Model

metalearner_model <- h2o.getModel(top_model@model$metalearner$name)

metalearner_model@model$model_summary %>% 
  as_tibble() %>%
  gt_preview() %>% 
  cols_align("center")
family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
1 binomial logit Elastic Net (alpha = 0.5, lambda = 0.00311 ) 36 12 7 levelone_training_StackedEnsemble_AllModels_AutoML_20210725_011103

Looks like we have 36 different models combined in a GLM, using Elastic Net regularization. The component models are either GBM or Deep Learning and it looks like they each have different hyperparameter grids that were searched across. I’ll put the full model output in the Appendix. Here are a list of the models that compose the ensemble and their relative importance to the ensemble. The top performing models take a greater weight.

Importance of Each Contributing Model in the Ensemble

h2o.varimp(metalearner_model) %>%
  as_tibble() %>%
  gt_preview() %>%
  cols_align("center")
variable relative_importance scaled_importance percentage
1 GBM_2_AutoML_20210725_011103 0.7014266 1.0000000 0.16199734
2 GBM_grid__1_AutoML_20210725_011103_model_18 0.6565727 0.9360534 0.15163815
3 GBM_grid__1_AutoML_20210725_011103_model_9 0.6336205 0.9033312 0.14633725
4 GBM_grid__1_AutoML_20210725_011103_model_19 0.4374311 0.6236307 0.10102651
5 GBM_grid__1_AutoML_20210725_011103_model_3 0.4047249 0.5770026 0.09347288
6..35
36 GLM_1_AutoML_20210725_011103 0.0000000 0.0000000 0.00000000

We could also visualize the data above for scaled importance of each model within the ensemble:

h2o.varimp_plot(metalearner_model)

Importance of Features in the Original Dataset

Now let’s dig into the best individual model a bit to understand the parameters and feature importance of the original dataset. The top individual model is extracted and the variable importance can be displayed just like we did for the ensemble components.

top_individual_model <- h2o.getModel(metalearner_model@model$names[1])
metalearner_model@model$names[1]
## [1] "GBM_2_AutoML_20210725_011103"
h2o.varimp(top_individual_model) %>%
  as_tibble() %>%
  gt() %>% 
  cols_align("center")
variable relative_importance scaled_importance percentage
total_trans_ct 972.9781494 1.0000000000 0.2535628117
total_trans_amt 773.8638306 0.7953558166 0.2016726572
total_ct_chng_q4_q1 567.8966675 0.5836684697 0.1479966183
total_revolving_bal 500.5438232 0.5144450814 0.1304441413
total_relationship_count 341.5622864 0.3510482600 0.0890127838
total_amt_chng_q4_q1 175.8822327 0.1807668885 0.0458357605
customer_age 160.4104614 0.1648654305 0.0418037421
avg_utilization_ratio 132.2019196 0.1358734722 0.0344524597
credit_limit 72.8332520 0.0748559996 0.0189806977
months_inactive_12_mon 70.7272568 0.0726915161 0.0184318652
gender_M 37.5323524 0.0385747126 0.0097811126
income_category_X.80K....120K 6.2846680 0.0064592077 0.0016378149
income_category_X.40K....60K 6.1469312 0.0063176456 0.0016019200
education_level_High.School 3.8736489 0.0039812291 0.0010094916
education_level_Graduate 3.1741629 0.0032623167 0.0008272022
education_level_Unknown 2.5144935 0.0025843268 0.0006552892
income_category_Less.than..40K 1.8054181 0.0018555588 0.0004705007
education_level_Uneducated 1.7907501 0.0018404834 0.0004666781
income_category_Unknown 1.4599973 0.0015005448 0.0003804824
income_category_X.60K....80K 1.4241761 0.0014637288 0.0003711472
education_level_Post.Graduate 1.4217629 0.0014612486 0.0003705183
education_level_Doctorate 0.8990833 0.0009240529 0.0002343054
h2o.varimp_plot(top_individual_model)

If we were interested in the hyperparameters of this individual GB model, we could look at them like this:

top_individual_model@parameters
## $model_id
## [1] "GBM_2_AutoML_20210725_011103"
## 
## $training_frame
## [1] "automl_training_baked_dataset_tbl_sid_914f_1"
## 
## $nfolds
## [1] 5
## 
## $keep_cross_validation_models
## [1] FALSE
## 
## $keep_cross_validation_predictions
## [1] TRUE
## 
## $score_tree_interval
## [1] 5
## 
## $fold_assignment
## [1] "Modulo"
## 
## $ntrees
## [1] 101
## 
## $max_depth
## [1] 7
## 
## $stopping_metric
## [1] "logloss"
## 
## $stopping_tolerance
## [1] 0.01187786
## 
## $seed
## [1] 7252024
## 
## $distribution
## [1] "bernoulli"
## 
## $sample_rate
## [1] 0.8
## 
## $col_sample_rate
## [1] 0.8
## 
## $col_sample_rate_per_tree
## [1] 0.8
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $x
##  [1] "customer_age"                   "total_relationship_count"      
##  [3] "months_inactive_12_mon"         "credit_limit"                  
##  [5] "total_revolving_bal"            "total_amt_chng_q4_q1"          
##  [7] "total_trans_amt"                "total_trans_ct"                
##  [9] "total_ct_chng_q4_q1"            "avg_utilization_ratio"         
## [11] "gender_M"                       "education_level_Doctorate"     
## [13] "education_level_Graduate"       "education_level_High.School"   
## [15] "education_level_Post.Graduate"  "education_level_Uneducated"    
## [17] "education_level_Unknown"        "income_category_X.40K....60K"  
## [19] "income_category_X.60K....80K"   "income_category_X.80K....120K" 
## [21] "income_category_Less.than..40K" "income_category_Unknown"       
## 
## $y
## [1] "churned"

Final Thoughts

My takeaway is that h2o autoML is really quite powerful for purely predictive tasks and is even feasible for a timed competition like Sliced. I look forward to the upcoming playoffs and will be interested to try h2o on some new datasets to see if we just got lucky here, or if it’s really that good.

TLDR

AutoML wouldn’t have won this competition (at least with the minimal feature engineering I did), but it sure got way closer than I expected!

Thank you for your attention!

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gt_0.2.2           here_1.0.0         h2o_3.32.1.3       forcats_0.5.0     
##  [5] stringr_1.4.0      readr_1.4.0        tidyverse_1.3.0    yardstick_0.0.8   
##  [9] workflowsets_0.0.2 workflows_0.2.3    tune_0.1.5         tidyr_1.1.3       
## [13] tibble_3.1.2       rsample_0.1.0      recipes_0.1.16     purrr_0.3.4       
## [17] parsnip_0.1.6.9000 modeldata_0.1.0    infer_0.5.4        ggplot2_3.3.5     
## [21] dplyr_1.0.7        dials_0.0.9        scales_1.1.1       broom_0.7.8       
## [25] tidymodels_0.1.3  
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-0   ellipsis_0.3.2     class_7.3-17       rprojroot_2.0.2   
##  [5] fs_1.5.0           rstudioapi_0.13    listenv_0.8.0      furrr_0.2.1       
##  [9] bit64_4.0.5        prodlim_2019.11.13 fansi_0.5.0        lubridate_1.7.9.2 
## [13] xml2_1.3.2         codetools_0.2-18   splines_4.0.3      knitr_1.30        
## [17] jsonlite_1.7.1     pROC_1.16.2        dbplyr_2.0.0       compiler_4.0.3    
## [21] httr_1.4.2         backports_1.2.0    assertthat_0.2.1   Matrix_1.2-18     
## [25] cli_3.0.1          htmltools_0.5.0    tools_4.0.3        gtable_0.3.0      
## [29] glue_1.4.2         Rcpp_1.0.5         cellranger_1.1.0   DiceDesign_1.8-1  
## [33] vctrs_0.3.8        blogdown_0.15      iterators_1.0.13   timeDate_3043.102 
## [37] gower_0.2.2        xfun_0.19          globals_0.14.0     rvest_0.3.6       
## [41] lifecycle_1.0.0    future_1.20.1      MASS_7.3-53        ipred_0.9-9       
## [45] hms_0.5.3          parallel_4.0.3     yaml_2.2.1         sass_0.3.1        
## [49] rpart_4.1-15       stringi_1.5.3      foreach_1.5.1      checkmate_2.0.0   
## [53] lhs_1.1.1          hardhat_0.1.6      lava_1.6.8.1       rlang_0.4.11      
## [57] pkgconfig_2.0.3    bitops_1.0-6       evaluate_0.14      lattice_0.20-41   
## [61] bit_4.0.4          tidyselect_1.1.1   parallelly_1.21.0  plyr_1.8.6        
## [65] magrittr_2.0.1     bookdown_0.21      R6_2.5.0           generics_0.1.0    
## [69] DBI_1.1.0          pillar_1.6.1       haven_2.3.1        withr_2.3.0       
## [73] survival_3.2-7     RCurl_1.98-1.2     nnet_7.3-14        modelr_0.1.8      
## [77] crayon_1.4.1       utf8_1.2.1         rmarkdown_2.5      grid_4.0.3        
## [81] readxl_1.3.1       data.table_1.14.0  reprex_0.3.0       digest_0.6.27     
## [85] GPfit_1.0-8        munsell_0.5.0

Appendix - Full Details of the Ensemble Model

str(metalearner_model@model)
## List of 35
##  $ names                                           : chr [1:37] "GBM_2_AutoML_20210725_011103" "GBM_grid__1_AutoML_20210725_011103_model_2" "GBM_4_AutoML_20210725_011103" "GBM_grid__1_AutoML_20210725_011103_model_11" ...
##  $ original_names                                  : NULL
##  $ column_types                                    : chr [1:37] "Numeric" "Numeric" "Numeric" "Numeric" ...
##  $ domains                                         :List of 37
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : NULL
##   ..$ : chr [1:2] "no" "yes"
##  $ cross_validation_models                         : NULL
##  $ cross_validation_predictions                    : NULL
##  $ cross_validation_holdout_predictions_frame_id   : NULL
##  $ cross_validation_fold_assignment_frame_id       : NULL
##  $ model_summary                                   :Classes 'H2OTable' and 'data.frame': 1 obs. of  7 variables:
##   ..$ family                     : chr "binomial"
##   ..$ link                       : chr "logit"
##   ..$ regularization             : chr "Elastic Net (alpha = 0.5, lambda = 0.00311 )"
##   ..$ number_of_predictors_total : int 36
##   ..$ number_of_active_predictors: int 12
##   ..$ number_of_iterations       : int 7
##   ..$ training_frame             : chr "levelone_training_StackedEnsemble_AllModels_AutoML_20210725_011103"
##   ..- attr(*, "header")= chr "GLM Model"
##   ..- attr(*, "formats")= chr [1:7] "%s" "%s" "%s" "%d" ...
##   ..- attr(*, "description")= chr "summary"
##  $ scoring_history                                 :Classes 'H2OTable' and 'data.frame': 2 obs. of  17 variables:
##   ..$ timestamp                    : chr [1:2] "2021-07-25 01:26:33" "2021-07-25 01:26:33"
##   ..$ duration                     : chr [1:2] " 0.000 sec" " 0.172 sec"
##   ..$ iterations                   : int [1:2] 5 7
##   ..$ negative_log_likelihood      : num [1:2] 571 571
##   ..$ objective                    : num [1:2] 0.0822 0.0837
##   ..$ alpha                        : num [1:2] 0.5 1
##   ..$ lambda                       : num [1:2] 0.00311 0.00311
##   ..$ deviance_train               : num [1:2] 0.161 0.161
##   ..$ deviance_xval                : num [1:2] 0.166 0.166
##   ..$ deviance_se                  : num [1:2] 0.075 0.0751
##   ..$ training_rmse                : num [1:2] 0.152 0.152
##   ..$ training_logloss             : num [1:2] 0.0806 0.0805
##   ..$ training_r2                  : num [1:2] 0.828 0.828
##   ..$ training_auc                 : num [1:2] 0.992 0.992
##   ..$ training_pr_auc              : num [1:2] 0.966 0.965
##   ..$ training_lift                : num [1:2] 6.26 6.26
##   ..$ training_classification_error: num [1:2] 0.0295 0.0298
##   ..- attr(*, "header")= chr "Scoring History"
##   ..- attr(*, "formats")= chr [1:17] "%s" "%s" "%d" "%.5f" ...
##   ..- attr(*, "description")= chr ""
##  $ cv_scoring_history                              :List of 5
##   ..$ :Classes 'H2OTable' and 'data.frame':  2 obs. of  23 variables:
##   .. ..$ timestamp                      : chr [1:2] "2021-07-25 01:26:31" "2021-07-25 01:26:31"
##   .. ..$ duration                       : chr [1:2] " 0.000 sec" " 0.205 sec"
##   .. ..$ iterations                     : int [1:2] 5 7
##   .. ..$ negative_log_likelihood        : num [1:2] 473 473
##   .. ..$ objective                      : num [1:2] 0.0684 0.0698
##   .. ..$ alpha                          : num [1:2] 0.5 1
##   .. ..$ lambda                         : num [1:2] 0.00311 0.00311
##   .. ..$ deviance_train                 : num [1:2] 0.166 0.166
##   .. ..$ deviance_test                  : num [1:2] 0.144 0.144
##   .. ..$ training_rmse                  : num [1:2] 0.154 0.154
##   .. ..$ training_logloss               : num [1:2] 0.083 0.083
##   .. ..$ training_r2                    : num [1:2] 0.826 0.826
##   .. ..$ training_auc                   : num [1:2] 0.992 0.992
##   .. ..$ training_pr_auc                : num [1:2] 0.964 0.964
##   .. ..$ training_lift                  : num [1:2] 6.16 6.16
##   .. ..$ training_classification_error  : num [1:2] 0.0297 0.0295
##   .. ..$ validation_rmse                : num [1:2] 0.147 0.147
##   .. ..$ validation_logloss             : num [1:2] 0.072 0.0721
##   .. ..$ validation_r2                  : num [1:2] 0.83 0.829
##   .. ..$ validation_auc                 : num [1:2] 0.994 0.994
##   .. ..$ validation_pr_auc              : num [1:2] 0.97 0.97
##   .. ..$ validation_lift                : num [1:2] 6.71 6.71
##   .. ..$ validation_classification_error: num [1:2] 0.0288 0.0295
##   .. ..- attr(*, "header")= chr "Scoring History"
##   .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ...
##   .. ..- attr(*, "description")= chr ""
##   ..$ :Classes 'H2OTable' and 'data.frame':  2 obs. of  23 variables:
##   .. ..$ timestamp                      : chr [1:2] "2021-07-25 01:26:31" "2021-07-25 01:26:32"
##   .. ..$ duration                       : chr [1:2] " 0.000 sec" " 0.192 sec"
##   .. ..$ iterations                     : int [1:2] 5 7
##   .. ..$ negative_log_likelihood        : num [1:2] 448 448
##   .. ..$ objective                      : num [1:2] 0.065 0.0664
##   .. ..$ alpha                          : num [1:2] 0.5 1
##   .. ..$ lambda                         : num [1:2] 0.00311 0.00311
##   .. ..$ deviance_train                 : num [1:2] 0.157 0.157
##   .. ..$ deviance_test                  : num [1:2] 0.179 0.179
##   .. ..$ training_rmse                  : num [1:2] 0.151 0.151
##   .. ..$ training_logloss               : num [1:2] 0.0785 0.0785
##   .. ..$ training_r2                    : num [1:2] 0.831 0.831
##   .. ..$ training_auc                   : num [1:2] 0.993 0.993
##   .. ..$ training_pr_auc                : num [1:2] 0.968 0.968
##   .. ..$ training_lift                  : num [1:2] 6.21 6.21
##   .. ..$ training_classification_error  : num [1:2] 0.0293 0.0293
##   .. ..$ validation_rmse                : num [1:2] 0.157 0.157
##   .. ..$ validation_logloss             : num [1:2] 0.0894 0.0894
##   .. ..$ validation_r2                  : num [1:2] 0.811 0.812
##   .. ..$ validation_auc                 : num [1:2] 0.99 0.99
##   .. ..$ validation_pr_auc              : num [1:2] 0.954 0.954
##   .. ..$ validation_lift                : num [1:2] 6.5 6.5
##   .. ..$ validation_classification_error: num [1:2] 0.0305 0.0305
##   .. ..- attr(*, "header")= chr "Scoring History"
##   .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ...
##   .. ..- attr(*, "description")= chr ""
##   ..$ :Classes 'H2OTable' and 'data.frame':  2 obs. of  23 variables:
##   .. ..$ timestamp                      : chr [1:2] "2021-07-25 01:26:32" "2021-07-25 01:26:32"
##   .. ..$ duration                       : chr [1:2] " 0.000 sec" " 0.208 sec"
##   .. ..$ iterations                     : int [1:2] 5 7
##   .. ..$ negative_log_likelihood        : num [1:2] 440 440
##   .. ..$ objective                      : num [1:2] 0.0638 0.0653
##   .. ..$ alpha                          : num [1:2] 0.5 1
##   .. ..$ lambda                         : num [1:2] 0.00311 0.00311
##   .. ..$ deviance_train                 : num [1:2] 0.155 0.155
##   .. ..$ deviance_test                  : num [1:2] 0.19 0.191
##   .. ..$ training_rmse                  : num [1:2] 0.149 0.149
##   .. ..$ training_logloss               : num [1:2] 0.0774 0.0773
##   .. ..$ training_r2                    : num [1:2] 0.832 0.832
##   .. ..$ training_auc                   : num [1:2] 0.993 0.993
##   .. ..$ training_pr_auc                : num [1:2] 0.967 0.967
##   .. ..$ training_lift                  : num [1:2] 6.37 6.37
##   .. ..$ training_classification_error  : num [1:2] 0.0294 0.0294
##   .. ..$ validation_rmse                : num [1:2] 0.166 0.166
##   .. ..$ validation_logloss             : num [1:2] 0.0948 0.0953
##   .. ..$ validation_r2                  : num [1:2] 0.806 0.805
##   .. ..$ validation_auc                 : num [1:2] 0.99 0.99
##   .. ..$ validation_pr_auc              : num [1:2] 0.959 0.959
##   .. ..$ validation_lift                : num [1:2] 5.86 5.86
##   .. ..$ validation_classification_error: num [1:2] 0.0314 0.0321
##   .. ..- attr(*, "header")= chr "Scoring History"
##   .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ...
##   .. ..- attr(*, "description")= chr ""
##   ..$ :Classes 'H2OTable' and 'data.frame':  2 obs. of  23 variables:
##   .. ..$ timestamp                      : chr [1:2] "2021-07-25 01:26:32" "2021-07-25 01:26:32"
##   .. ..$ duration                       : chr [1:2] " 0.000 sec" " 0.173 sec"
##   .. ..$ iterations                     : int [1:2] 5 7
##   .. ..$ negative_log_likelihood        : num [1:2] 470 470
##   .. ..$ objective                      : num [1:2] 0.0679 0.0694
##   .. ..$ alpha                          : num [1:2] 0.5 1
##   .. ..$ lambda                         : num [1:2] 0.00311 0.00311
##   .. ..$ deviance_train                 : num [1:2] 0.168 0.169
##   .. ..$ deviance_test                  : num [1:2] 0.137 0.136
##   .. ..$ training_rmse                  : num [1:2] 0.156 0.156
##   .. ..$ training_logloss               : num [1:2] 0.0842 0.0842
##   .. ..$ training_r2                    : num [1:2] 0.82 0.82
##   .. ..$ training_auc                   : num [1:2] 0.992 0.992
##   .. ..$ training_pr_auc                : num [1:2] 0.963 0.963
##   .. ..$ training_lift                  : num [1:2] 6.17 6.17
##   .. ..$ training_classification_error  : num [1:2] 0.0317 0.0314
##   .. ..$ validation_rmse                : num [1:2] 0.136 0.136
##   .. ..$ validation_logloss             : num [1:2] 0.0683 0.0682
##   .. ..$ validation_r2                  : num [1:2] 0.856 0.856
##   .. ..$ validation_auc                 : num [1:2] 0.994 0.994
##   .. ..$ validation_pr_auc              : num [1:2] 0.972 0.972
##   .. ..$ validation_lift                : num [1:2] 6.61 6.61
##   .. ..$ validation_classification_error: num [1:2] 0.0206 0.0206
##   .. ..- attr(*, "header")= chr "Scoring History"
##   .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ...
##   .. ..- attr(*, "description")= chr ""
##   ..$ :Classes 'H2OTable' and 'data.frame':  2 obs. of  23 variables:
##   .. ..$ timestamp                      : chr [1:2] "2021-07-25 01:26:32" "2021-07-25 01:26:33"
##   .. ..$ duration                       : chr [1:2] " 0.000 sec" " 0.179 sec"
##   .. ..$ iterations                     : int [1:2] 5 7
##   .. ..$ negative_log_likelihood        : num [1:2] 445 445
##   .. ..$ objective                      : num [1:2] 0.0645 0.0659
##   .. ..$ alpha                          : num [1:2] 0.5 1
##   .. ..$ lambda                         : num [1:2] 0.00311 0.00311
##   .. ..$ deviance_train                 : num [1:2] 0.157 0.157
##   .. ..$ deviance_test                  : num [1:2] 0.181 0.181
##   .. ..$ training_rmse                  : num [1:2] 0.149 0.149
##   .. ..$ training_logloss               : num [1:2] 0.0784 0.0784
##   .. ..$ training_r2                    : num [1:2] 0.832 0.832
##   .. ..$ training_auc                   : num [1:2] 0.992 0.992
##   .. ..$ training_pr_auc                : num [1:2] 0.966 0.966
##   .. ..$ training_lift                  : num [1:2] 6.41 6.41
##   .. ..$ training_classification_error  : num [1:2] 0.0275 0.0275
##   .. ..$ validation_rmse                : num [1:2] 0.165 0.165
##   .. ..$ validation_logloss             : num [1:2] 0.0905 0.0903
##   .. ..$ validation_r2                  : num [1:2] 0.811 0.811
##   .. ..$ validation_auc                 : num [1:2] 0.991 0.992
##   .. ..$ validation_pr_auc              : num [1:2] 0.964 0.964
##   .. ..$ validation_lift                : num [1:2] 5.74 5.74
##   .. ..$ validation_classification_error: num [1:2] 0.0361 0.0369
##   .. ..- attr(*, "header")= chr "Scoring History"
##   .. ..- attr(*, "formats")= chr [1:23] "%s" "%s" "%d" "%.5f" ...
##   .. ..- attr(*, "description")= chr ""
##  $ reproducibility_information_table               :List of 3
##   ..$ :Classes 'H2OTable' and 'data.frame':  1 obs. of  26 variables:
##   .. ..$ node                 : int 0
##   .. ..$ h2o                  : chr "127.0.0.1:54321"
##   .. ..$ healthy              : chr "true"
##   .. ..$ last_ping            : chr "1627201591256"
##   .. ..$ num_cpus             : int 4
##   .. ..$ sys_load             : num 0.564
##   .. ..$ mem_value_size       : num 70732923
##   .. ..$ free_mem             : num 7.76e+08
##   .. ..$ pojo_mem             : num 1.91e+08
##   .. ..$ swap_mem             : num 0
##   .. ..$ free_disc            : num 3.6e+10
##   .. ..$ max_disc             : num 2.55e+11
##   .. ..$ pid                  : int 7916
##   .. ..$ num_keys             : int 9133
##   .. ..$ tcps_active          : chr ""
##   .. ..$ open_fds             : int -1
##   .. ..$ rpcs_active          : chr ""
##   .. ..$ nthreads             : int 4
##   .. ..$ is_leader            : chr "true"
##   .. ..$ total_mem            : num 6.27e+08
##   .. ..$ max_mem              : num 1.04e+09
##   .. ..$ java_version         : chr "Java 1.8.0_77 (from Oracle Corporation)"
##   .. ..$ jvm_launch_parameters: chr "[-Xmx1g, -ea]"
##   .. ..$ os_version           : chr "Windows 10 10.0 (x86)"
##   .. ..$ machine_physical_mem : num 1.7e+10
##   .. ..$ machine_locale       : chr "en_US"
##   .. ..- attr(*, "header")= chr "Node Information"
##   .. ..- attr(*, "formats")= chr [1:26] "%d" "%s" "%s" "%s" ...
##   .. ..- attr(*, "description")= chr ""
##   ..$ :Classes 'H2OTable' and 'data.frame':  1 obs. of  13 variables:
##   .. ..$ h2o_cluster_uptime       : num 1394453
##   .. ..$ h2o_cluster_timezone     : chr "America/Los_Angeles"
##   .. ..$ h2o_data_parsing_timezone: chr "UTC"
##   .. ..$ h2o_cluster_version      : chr "3.32.1.3"
##   .. ..$ h2o_cluster_version_age  : chr "2 months and 5 days"
##   .. ..$ h2o_cluster_name         : chr "H2O_started_from_R_kingr17_afp345"
##   .. ..$ h2o_cluster_total_nodes  : int 1
##   .. ..$ h2o_cluster_free_memory  : num 7.76e+08
##   .. ..$ h2o_cluster_total_cores  : int 4
##   .. ..$ h2o_cluster_allowed_cores: int 4
##   .. ..$ h2o_cluster_status       : chr "locked, healthly"
##   .. ..$ h2o_internal_security    : chr "false"
##   .. ..$ h2o_api_extensions       : chr "Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4"
##   .. ..- attr(*, "header")= chr "Cluster Configuration"
##   .. ..- attr(*, "formats")= chr [1:13] "%d" "%s" "%s" "%s" ...
##   .. ..- attr(*, "description")= chr ""
##   ..$ :Classes 'H2OTable' and 'data.frame':  2 obs. of  3 variables:
##   .. ..$ input_frame: chr [1:2] "training_frame" "validation_frame"
##   .. ..$ checksum   : num [1:2] 3.75e+18 -1.00
##   .. ..$ espc       : chr [1:2] "[0, 1772, 3544, 5316, 7088]" "-1"
##   .. ..- attr(*, "header")= chr "Input Frames Information"
##   .. ..- attr(*, "formats")= chr [1:3] "%s" "%d" "%d"
##   .. ..- attr(*, "description")= chr ""
##  $ training_metrics                                :Formal class 'H2OBinomialMetrics' [package "h2o"] with 5 slots
##   .. ..@ algorithm: chr "glm"
##   .. ..@ on_train : logi TRUE
##   .. ..@ on_valid : logi FALSE
##   .. ..@ on_xval  : logi FALSE
##   .. ..@ metrics  :List of 30
##   .. .. ..$ __meta                        :List of 3
##   .. .. .. ..$ schema_version: int 3
##   .. .. .. ..$ schema_name   : chr "ModelMetricsBinomialGLMV3"
##   .. .. .. ..$ schema_type   : chr "ModelMetricsBinomialGLM"
##   .. .. ..$ model                         :List of 4
##   .. .. .. ..$ __meta:List of 3
##   .. .. .. .. ..$ schema_version: int 3
##   .. .. .. .. ..$ schema_name   : chr "ModelKeyV3"
##   .. .. .. .. ..$ schema_type   : chr "Key<Model>"
##   .. .. .. ..$ name  : chr "metalearner_AUTO_StackedEnsemble_AllModels_AutoML_20210725_011103"
##   .. .. .. ..$ type  : chr "Key<Model>"
##   .. .. .. ..$ URL   : chr "/3/Models/metalearner_AUTO_StackedEnsemble_AllModels_AutoML_20210725_011103"
##   .. .. ..$ model_checksum                : chr "1672530778380092240"
##   .. .. ..$ frame                         :List of 4
##   .. .. .. ..$ __meta:List of 3
##   .. .. .. .. ..$ schema_version: int 3
##   .. .. .. .. ..$ schema_name   : chr "FrameKeyV3"
##   .. .. .. .. ..$ schema_type   : chr "Key<Frame>"
##   .. .. .. ..$ name  : chr "levelone_training_StackedEnsemble_AllModels_AutoML_20210725_011103"
##   .. .. .. ..$ type  : chr "Key<Frame>"
##   .. .. .. ..$ URL   : chr "...

To leave a comment for the author, please follow the link and comment on their blog: [R]eliability.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)