The one function call you need to know as a data scientist: h2o.automl

August 30, 2017
By

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

forest

Introduction

Two things that recently came to my attention were AutoML (Automatic Machine Learning) by h2o.ai and the fashion MNIST by Zalando Research. So as a test, I ran AutoML on the fashion mnist data set.

H2o AutoML

As you all know a large part of the work in predictive modeling is in preparing the data. But once you have done that, ideally you don’t want to spend too much work in trying many different machine learning models.  That’s were AutoML from h2o.ai comes in. With one function call you automate the process of training a large, diverse, selection of candidate models.

AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, GLM’s, Gradient Boosting Machines (GBMs) and Neural Nets. And then as “bonus” it trains a Stacked Ensemble using all of the models. The function to use in the h2o R interface is: h2o.automl. (There is also a python interface)

FashionMNIST_Benchmark = h2o.automl(
  x = 1:784,
  y = 785,
  training_frame = fashionmnist_train,
  validation_frame = fashionmninst_test
)

So the first 784 columns in the data set are used as inputs and column 785 is the column with labels. There are more input arguments that you can use. For example, maximum running time or maximum number of models to use, a stopping metric.

It can take some time to run all these models, so I have spun up a so-called high CPU droplet on Digital Ocean: 32 dedicated cores ($0.92 /h).

h2o_perf

h2o utilizing all 32 cores to create models

The output in R is an object containing the models and a ‘leaderboard‘ ranking the different models. I have the following accuracies on the fashion mnist test set.

  1. Gradient Boosting (0.90)
  2. Deep learning (0.89)
  3. Random forests (0.89)
  4. Extremely randomized forests (0.88)
  5. GLM (0.86)

There is no ensemble model, because it’s not supported yet for multi label classifiers. The deeplearning in h2o are fully connected hidden layers, for this specific Zalando images data set, you’re better of pursuing more fancy convolutional neural networks. As a comparison I just ran a simple 2 layer CNN with keras, resulting in an test accuracy of 0.92. It outperforms all the models here!

Conclusion

If you have prepared your modeling data set, the first thing you can always do now is to run h2o.automl.

Cheers, Longhow.

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)