GLM with H2O in R

June 27, 2017
By

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

> library(h2o)

> h2o.init(max_mem_size = "12g")

> df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV")

> df2 <- h2o.assign(df1[df1$CARDHLDR == 1, ], "glm_df")

> h2o.colnames(df2)
 [1] "CARDHLDR" "DEFAULT"  "AGE"      "ACADMOS"  "ADEPCNT"  "MAJORDRG"
 [7] "MINORDRG" "OWNRENT"  "INCOME"   "SELFEMPL" "INCPER"   "EXP_INC"
[13] "SPENDING" "LOGSPEND"

> Y <- "DEFAULT"

> X <- c("MAJORDRG", "MINORDRG", "INCOME", "OWNRENT")

> dist <- "binomial"

> link <- "logit"

> id <- "h2o_mdl01"

> mdl <- h2o.glm(X, Y, training_frame = h2o.getFrame("glm_df"), model_id = id, family = dist, link = link, lambda = 0, compute_p_values = TRUE, standardize = FALSE)

> show(h2o.getModel(id)@model$coefficients_table)
Coefficients: glm coefficients
      names coefficients std_error    z_value  p_value
1 Intercept    -1.204439  0.090811 -13.263121 0.000000
2  MAJORDRG     0.203135  0.069250   2.933370 0.003353
3  MINORDRG     0.202727  0.047971   4.226014 0.000024
4   OWNRENT    -0.201223  0.071619  -2.809636 0.004960
5    INCOME    -0.000442  0.000040 -10.942350 0.000000

> h2o.performance(h2o.getModel(id))
H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.08414496
RMSE:  0.2900775
LogLoss:  0.3036585
Mean Per-Class Error:  0.410972
AUC:  0.6432189
Gini:  0.2864378
R^2:  0.02005004
Residual Deviance:  6376.221
AIC:  6386.221

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0    1    Error         Rate
0      7703 1800 0.189414   =1800/9503
1       630  366 0.632530     =630/996
Totals 8333 2166 0.231451  =2430/10499

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.126755 0.231499 142
2                       max f2  0.075073 0.376556 272
3                 max f0point5  0.138125 0.191828 115
4                 max accuracy  0.368431 0.905039   0
5                max precision  0.314224 0.250000   3
6                   max recall  0.006115 1.000000 399
7              max specificity  0.368431 0.999895   0
8             max absolute_mcc  0.126755 0.128940 142
9   max min_per_class_accuracy  0.106204 0.604546 196
10 max mean_per_class_accuracy  0.103730 0.605663 202

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)