A Performance Benchmark of Different AutoML Frameworks

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a recent blog post our CEO Sebastian Heinz wrote about Google's newest stroke of genius – AutoML Vision. A cloud service “that is able to build deep learning models for image recognition completely fully automated and from scratch“. AutoML Vision is part of the current trend towards the automation of machine learning tasks. This trend started with automation of hyperparameter optimization for single models (Including services like SigOpt, Hyperopt, SMAC), went along with automated feature engineering and selection (see my colleague Lukas' blog post about our bounceR package) towards full automation of complete data pipelines including automated model stacking (a common model ensembling technique).

One company at the frontier of this development is certainly h2o.ai. They developed both a free Python/R library (H2O AutoML) as well as an enterprise ready software solution called Driverless AI. But H2O is by far not the only player on the field. This blog post will provide you with a short comparison between two freely available Auto ML solutions and compare them by predictive performance as well as general usability.

H2O AutoML

H2O AutoML is an extension to H2O's popular java based open source machine learning framework with APIs for Python and R. It automatically trains, tunes and cross-validates models (including Generalized Linear Models [GLM], Gradient Boosting Machines [GBM], Random Forest [RF], Extremely Randomized Forest [XRF], and Neural Networks). Hyperparameter optimization is done using a random search over a list of reasonable parameters (both RF and XRF are currently not tuned). In the end, H2O produces a leaderboard of models and builds two types of stacked ensembles from the base models. One including all base models, the other including only the best base model of each family.

Model training can be controlled by either the number of models to be trained, or the total training time. Especially the later makes model training quite transparent. One of the big advantages of H2O is that all models are parallelized out-of-the-box.


auto-sklearn is an automated machine learning toolkit based on Python's Scikit-Learn Library. A detailed explanation of auto-sklearn can be found in Feurer et al. (2015). In H2O AutoML, each model was independently tuned and added to a leaderboard. In auto-sklearn, the authors combine model selection and hyperparameter optimization in what they call “Combined Algorithm Selection and Hyperparameter optimization” (CASH). This joint optimization problem is than solved using a tree-based Bayesian optimization methods called “Sequential Model-based Algorithm Configuration” (SMAC) (see Bergstra 2011).

So contrary to H2O AutoML, auto-sklearn optimizes a complete modeling pipeline including various data and feature preprocessing steps as well as the model selection and hyperparameter optimization. Data preprocessing includes one-hot-encoding, scaling, imputation, and balancing. Feature preprocessing includes, among others, feature agglomeration, ICA and PCA. Algorithms included in auto-sklearn are similar to those in H2O AutoML, but in addition also includes more traditional methods like k-Nearest-Neighbors (kNN), Naive Bayes, and Support Vector Machines (SVM).

Similar to H2O AutoML, auto-sklearn includes a final model ensemble step. Whereas H2O AutoML uses simple but efficient model stacking, auto-sklearn uses ensemble selection. A greedy method that adds individual models iteratively to the ensemble if and only if they increase the validation performance. Like H2O, auto-sklearn allows model training to be controlled by the total training time.


In order to compare the predictive performance of H2O's AutoML with auto-sklearn, one can conduct a small simulation study. My colleague André's R package Xy offers a straightforward way to simulate regression datasets with linear, non-linear, and noisy relationships. Using multiple (ten in total) simulation runs makes the whole simulation a bit more robust. The following R code was used to simulate the data:


# Number of datasets
n_data_set <- 10

for (i in seq(n_data_set)) {
  # Sim settings
  n <- floor(runif(1, 1000, 5000))
  n_num_vars <- c(sample(2:10, 1), sample(2:10, 1))
  n_cat_vars <- c(0, 0)
  n_noise_vars <- sample(1:5, 1)
  inter_degree <- sample(2:3, 1)
  # Simulate data
  sim <- Xy(n = n,   
            numvars = n_num_vars,
            catvars = n_cat_vars, 
            noisevars = n_noise_vars,   
            task = Xy_task(),
            nlfun = function(x) {x^2},
            interactions = 1,
            sig = c(1,4), 
            cor = c(0),
            weights = c(-10,10),
            intercept = TRUE,
            stn = 4)
  # Get data and DGP
  df <- sim$data
  dgp <- sim$dgp
  # Remove Intercept
  df[, "(Intercept)"] <- NULL
  # Rename columns
  names(df) <- gsub("(?<![0-9])0+", "", names(df), perl = TRUE)
  # Create test/train split
  df <- dplyr::rename(df, label = y)
  in_train <- createDataPartition(y = df$label, p = 0.7, list = FALSE)
  df_train <- df[in_train, ]
  df_test <- df[-in_train, ]
  # Path names
  path_train <- paste0("../data/Xy/", i, "_train.csv")
  path_test <- paste0("../data/Xy/", i, "_test.csv")
  # Export
  fwrite(df_train, file = path_train)
  fwrite(df_test, file = path_test)

Since auto-sklearn is only available in Python, switching languages is necessary. Therefore, loading the raw data in Python is the next step:

import pandas as pd

# Load data
df_train = pd.read_csv("../data/Xy/1_train.csv")
df_test = pd.read_csv("../data/Xy/1_test.csv")

# Columns
cols_train = df_train.columns.tolist()
cols_test = df_test.columns.tolist()

# Target and features
y_train = df_train.loc[:, "label"]
X_train = df_train.drop("label", axis=1)

y_test = df_test.loc[:, "label"]
X_test = df_test.drop("label", axis=1)

Having the data in Python, the training procedure can start. In order to make the results comparable, both frameworks used, where possible, similar settings. This included 60 minutes of training for each dataset, 5-fold crossvalidation for model evaluation and ensemble building, no preprocessing (not available in H2O AutoML and therefore deactivated in auto-sklearn), and a limitation to similar algorithms (namely GLM, RF, XRF, and GBM).

As previously noted, H2O supports out-of-the-box parallelization. By default, auto-sklearn only uses two cores, while also supporting more cores, at least in theory. While there is a manual on how to do that, I was not able to get it working on my system (OSX 10.13, Python 3.6.2 Anaconda). Therefore H2O was also limited to only two cores.

from autosklearn.regression import AutoSklearnRegressor
from autosklearn.metrics import mean_squared_error

# Settings
estimators_to_use = ["random_forest", "extra_trees", "gradient_boosting", "ridge_regression"]
preprocessing_to_use = ["no_preprocessing"]

# Init auto-sklearn
auto_sklearn = AutoSklearnRegressor(time_left_for_this_task=60*60,
                                    resampling_strategy_arguments={"folds": 5})

# Train models
auto_sklearn.fit(X=X_train.copy(), y=y_train.copy(), metric=mean_squared_error)
it_fits = auto_sklearn.refit(X=X_train.copy(), y=y_train.copy())

# Predict
y_hat = auto_sklearn.predict(X_test)

# Show results

import h2o
from h2o.automl import H2OAutoML

# Shart h2o cluster
h2o.init(max_mem_size="8G", nthreads=2)

# Upload to h2o
df_train_h2o = h2o.H2OFrame(pd.concat([X_train, pd.DataFrame({"target": y_train})], axis=1))
df_test_h2o = h2o.H2OFrame(X_test)

features = X_train.columns.values.tolist()
target = "target"

# Training
auto_h2o = H2OAutoML(max_runtime_secs=60*60)

# Leaderboard
auto_h2o = auto_h2o.leader

# Testing
df_test_hat = auto_h2o.predict(df_test_h2o)
y_hat = h2o.as_list(df_test_hat["predict"])

# Close cluster

The complete code, including all simulation runs and visualization of results can be find on my GitHub repo.


First, some words of caution: The results presented in the next sections are by no mean representative. Both H2O and the authors of auto-sklearn recommend to run their frameworks for hours, if not even days. Given ten different datasets, this was beyond the scope of a blog post. For the same reason of feasibility, the datasets are restricted to a rather small size. For a more elaborated performance comparison see for example Balaji and Allen (2018).

Figure 1 shows the Mean Squared Error of both frameworks produced on the test sample. The horizontal line, indicating the result from a vanilla Random Forest (from scikit-learn), serves as a benchmark. As one can see, the results are pretty similar for both frameworks and all data sets. Actually, it is a tie, with five wins for H2O and five wins for auto-sklearn.

results ml benchmark

The percentage difference between the average errors is 1.04\% in favor of auto-sklearn. Thus, auto-sklearn is on average about 1\% better than H2O. Compared with the vanilla RF, H2O's AutoML is on average 23.4\% better than the benchmark, while auto-sklearn is 24.6\% better.

The sheer closeness of the results can be further illustrated when taking a look at the predicted values. Figure 2 shows exemplary the predicted values for one particular dataset against all feature values (linear, non-linear and noise features). As one can see, the estimated effects for both frameworks are almost identical and pretty close to the actual relationship.

visualization ml benchmark


Automatic Machine Learning frameworks can provide promising results for standard machine learning task while keeping the manual efforts down to a minimum. This blog post compared two popular frameworks, namely H2O's AutoML and auto-sklearn. Both reached comparable results on ten simulated datasets, while outperforming vanilla models significantly. Beside predictive performance, H2O's AutoML offers some additional features like native parallelization, API for R, support for XGBoost and GPU training making it even more attractive.


Über den Autor
Fabian Müller

Fabian Müller

Fabian ist unser Teamleiter für Data Science und betreut mit seinem Team unsere Großkunden aus der Wirtschaft. In seiner Freizeit treibt er viel Sport und ist ein großer Automobil-Fan.

Der Beitrag A Performance Benchmark of Different AutoML Frameworks erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)