AutoML Frameworks in R & Python

[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In last few years, AutoML or automated machine learning as become widely popular among data science community. Big tech giants like Google, Amazon and Microsoft have started offering AutoML tools. There is still a split among data scientists when it comes to AutoML. Some fear that it is going to be a threat to their jobs and others believe that there is a bigger risk than a job; might cost the company itself. Others see it as a tool that they could use for non-critical tasks or for presenting proof-of-concepts. In-arguably, it has definitely made its mark among the data science community.

If you don’t know what AutoML is, a quick google search will give you a good intro to AutoML. According to wikipedia “Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model”

In this blog post, I will give my take on AutoML and introduce to few frameworks in R and Python.

Pro’s

  • Time saving: It’s a quick and dirty prototyping tool. If you are not working on critical task, you could use AutoML to do the job for you while you focus on more critical tasks.
  • Benchmarking: Building an ML/DL model is fun. But, how do you know the model you have is the best? You either have to spend a lot of time in building iterative models or ask your colleague to build one and compare it. The other option is to use AutoML to benchmark yours.

Con’s

  • Most AI models that we come across are black box. Similar is the case with these AutoML frameworks. If you don’t understand what you are doing, it could be catastrophic.
  • Based on my previous point, AutoML is being marketed as a tool for non-data scientists. This is a bad move. Without understanding how a model works and blindly using it for making decisions could be disastrous.

Personally, I do use AutoML frameworks for day-to-day tasks. It helps me save time and understand the techniques and tuning parameters behind these frameworks.

Now, let me introduce you to some of the top open source AutoML frame works I have come across.

H2O

1s6ke_nwoge5m7ok1onsjsq

H2O definitely goes on the top of the list. They offer ML, deep learning and stacked ensemble models in their frame work. Although it is written in java, they offer connectors for R and Python through API’s. The best feature that I have almost never seen is the “stopping time”, where I can set how long I want to train my model. Below is the code for running in R and Python for Iris data set.

R

# Load library
library(h2o)

# start h2o cluster
invisible(h2o.init())

# convert data as h2o type
train_h = as.h2o(train)
test_h = as.h2o(test)

# set label type
y = 'Species'
pred = setdiff(names(train), y)

#convert variables to factors
train[,y] = as.factor(train[,y])
test[,y] = as.factor(test[,y])

# Run AutoML for 20 base models
aml = h2o.automl(x = pred, y = y,
                  training_frame = train_h,
                  max_models = 20,
                  seed = 1,
                  max_runtime_secs = 20
                 )

# AutoML Leaderboard
lb = aml@leaderboard
lb

# prediction result on test data
prediction = h2o.predict(aml@leader, test_h[,-5]) %>%
                         as.data.frame()

# create a confusion matrix
caret::confusionMatrix(test$Species, prediction$predict)

# close h2o connection
h2o.shutdown(prompt = F)

Python

# load python libraries
import h2o
from h2o.automl import H2OAutoML
import pandas as pd

# start cluster
h2o.init()

# convert to h2o frame
traindf = h2o.H2OFrame(r.train)
testdf = h2o.H2OFrame(r.test)

y = "Species"
x = list(traindf.columns)
x.remove(y)

# create df to factors
traindf[y] = traindf[y].asfactor()
testdf[y] = testdf[y].asfactor()

#run automl
aml = H2OAutoML(max_runtime_secs = 60)
aml.train(x = x, y = y, training_frame = traindf)

# view leader board
aml.leaderboard

# do pridiction and convert it to a data frame
predict = aml.predict(testdf)
p = predict.as_data_frame()

# convert to pandas dataframe
data = {'actual': r.test.Species, 'Ypredict': p['predict'].tolist()}

df = pd.DataFrame(data, columns = ['actual','Ypredict'])

# create a confusion matrix and print results
confusion_matrix = pd.crosstab(df['actual'], df['Ypredict'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

# close h2o connection
h2o.shutdown(prompt = False)

automl Package in R

The automl package is availabe on CRAN. The automl package fits from simple regression to highly customizable deep neural networks either with gradient descent or metaheuristic, using automatic hyper parameters tuning and custom cost function. A mix inspired by the common tricks on Deep Learning and Particle Swarm Optimization. Below is a sample code for how to use in R.

library(automl)

amlmodel = automl_train_manual(Xref = subset(train, select = -c(Species)),
                               Yref = subset(train, select = c(Species))$Species
                               %>% as.numeric(),
                               hpar = list(learningrate = 0.01,
                               minibatchsize = 2^2,
                               numiterations = 60))

prediction = automl_predict(model = amlmodel, X = test[,1:4]) 

prediction = ifelse(prediction > 2.5, 3, ifelse(prediction > 1.5, 2, 1)) %>% as.factor()

caret::confusionMatrix(test$Species, prediction)

Remix AutoML

 

55656390-94dc4b00-57ab-11e9-9e3f-06b049b796d5

Remix AutoML was developed by remyx institute. According to the developers “This is a collection of functions that I have made to speed up machine learning and to ensure high quality modeling results and output are generated. They are great at establishing solid baselines that are extremely challenging to beat using alternative methods (if at all). They are intended to make the development cycle fast and robust, along with making operationalizing quick and easy, with low latency model scoring.” Below is a sample code for how to use in R.

library(RemixAutoML)
train$Species = train$Species %>% as.integer()
remixml = AutoCatBoostRegression(data = train %>% data.matrix()
                                 , TargetColumnName = "Species"
                                 , FeatureColNames = c(1:4)
                                 , MaxModelsInGrid = 1
                                 , ModelID = "ModelTest"
                                 , ReturnModelObjects = F
                                 , Trees = 150
                                 , task_type = "CPU"
                                 , GridTune = FALSE
                                 )
predictions = AutoCatBoostScoring(TargetType = 'regression'
                                  , ScoringData = test %>% data.table::data.table()
                                  , FeatureColumnNames = c(1:4)
                                  , ModelObject = remixml$Model
                                   )

prediction = ifelse(predictions$Predictions > 2.5, 3, ifelse(predictions$Predictions > 1.5, 2, 1)) %>% as.factor()

caret::confusionMatrix(test$Species, prediction)

AutoXGboost

xgboost

The autoxgboost aims to find an optimal xgboost model automatically using the machine learning framework mlr and the bayesian optimization framework mlrMBO. The development version of this package is available on github. Below is a sample code for how to use in R.

# load library
library(autoxgboost)

# create a classification task
trainTask = makeClassifTask(data = train, target = "Species")

# create a control object for optimizer
ctrl = makeMBOControl()
ctrl = setMBOControlTermination(ctrl, iters = 5L) 

# fit the model
res = autoxgboost(trainTask, control = ctrl, tune.threshold = FALSE)

# do prediction and print confusion matrix
prediction = predict(res, test[,1:4])
caret::confusionMatrix(test$Species, prediction$data$response)

Auto-sklearn

Auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. According to Auto-sklearn team, “auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading our paper published at NIPS 2015 .” Also to note that, this framework is possibly the slowest among all the frameworks presented in this post. Below is a sample code for how to use in Python.

import autosklearn.classification
import sklearn.model_selection
import sklearn.metrics
import pandas as pd

train = pd.DataFrame(r.train)
test = pd.DataFrame(r.test)

x_train = train.iloc[:,1:4]
y_train = train[['Species']]
print(y_train.head())
x_test = test.iloc[:,1:4]
y_test = test[['Species']]
print(y_test.head())

automl = autosklearn.classification.AutoSklearnClassifier()
print("classifier")
print("fittiong" )
automl.fit(x_train, y_train)
y_hat = automl.predict(x_test)

# convert to pandas dataframe
data = {'actual': r.test.Species, 'Ypredict': y_hat.tolist()}

df = pd.DataFrame(data, columns = ['actual','Ypredict'])

# create a confusion matrix and print results
confusion_matrix = pd.crosstab(df['actual'], df['Ypredict'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

Autogluon

gluon

Autogluon is the latest offering by aws labs. According to the developers, “AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on deep learning and real-world applications spanning image, text, or tabular data. Intended for both ML beginners and experts, AutoGluon enables you to:

  • Quickly prototype deep learning solutions for your data with few lines of code.
  • Leverage automatic hyperparameter tuning, model selection / architecture search, and data processing.
  • Automatically utilize state-of-the-art deep learning techniques without expert knowledge.
  • Easily improve existing bespoke models and data pipelines, or customize AutoGluon for your use-case.

Below is a sample code for how to use in Python.

#import autogluon as ag
from autogluon import TabularPrediction as task
import pandas as pd

train_data = task.Dataset(file_path = "TRAIN_DATA.csv")
test_data = task.Dataset(file_path = "TEST_DATA.csv")

label_column = 'Species'
print("Summary of class variable: \n", train_data[label_column].describe())

predictor = task.fit(train_data = train_data, label = label_column)

y_test = test_data[label_column]  # values to predict

y_pred = predictor.predict(test_data)
print("Predictions:  ", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
print(perf)

The above frameworks just a few to scratch the surface. Some of the honorable mentions to this list are autokeras, deep learning studio, auto-weka and tpot. Some of the other paid tools are from Dataiku, data robot, rapid miner etc. As you can see from the above that there are so many open source tools that you can use today and here is a list of open source AutoML projects being worked on right now.

Hope you enjoyed this post. Comment below to let me know if I missed any frameworks or is worth mentioning. Do subscribe to this blog and check out my other posts.

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)