Kannada MNIST Prediction Classification using H2O AutoML in R

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Kannada MNIST dataset is another MNIST-type Digits dataset for Kannada (Indian) Language. All details of the dataset curation has been captured in the paper titled: “Kannada-MNIST: A new handwritten digits dataset for the Kannada language.” by Vinay Uday Prabhu. The github repo of the author can be found here.

The objective of this post is to demonstrate how to use h2o.ai’s automl function to quickly get a (better) baseline. Thsi also proves a point how these automl tools help democratizing Machine Learning Model Building process.

Loading required libraries

  • h2o – for Machine Learning
  • tidyverse – for Data Manipulation
library(h2o)
library(tidyverse)

Initializing H2O Cluster

h2o::h2o.init()

Reading Input Files (Data)

train <- read_csv("~/Documents/R Codes/Kannada-MNIST/train.csv")
test <- read_csv("~/Documents/R Codes/Kannada-MNIST/test.csv")
valid <- read_csv("~/Documents/R Codes/Kannada-MNIST/Dig-MNIST.csv")
submission <- read_csv("~/Documents/R Codes/Kannada-MNIST//sample_submission.csv")

Checking the shape / dimension of the dataframe

dim(train)

784 Pixel Values + 1 Label denoting what digit it’s.

Label Count

train  %>% count(label)

Visualizing the Kannada MNIST Digits

# visualize the digits
par(mfcol=c(6,6))

par(mar=c(0, 0, 3, 0), xaxs='i', yaxs='i')

for (idx in 1:36) { 

im<-matrix((train[idx,2:ncol(train)]), nrow=28, ncol=28)

im_numbers <- apply(im, 2, as.numeric)

image(1:28, 1:28, im_numbers, col=gray((0:255)/255), main=paste(train$label[idx]))
}

Converting R dataframe to H2O object which is required by H2O functions

train_h <- as.h2o(train)
test_h <- as.h2o(test)
valid_h <- as.h2o(valid)

Converting our numeric target variable into a factor for the algorithm to perform Classification

train_h$label <- as.factor(train_h$label)
valid_h$label <- as.factor(valid_h$label)

Explanatory and Response Variables

x <- names(train)[-1]
y <- 'label'

AutoML in Action

aml <- h2o::h2o.automl(x = x, 
                       y = y,
                       training_frame = train_h,
                       nfolds = 3,
                       leaderboard_frame = valid_h,
                       max_runtime_secs = 1000)

nfolds denotes the number of folds for cross-validation and max_runtime_secs represents the maximum amount of time the AutoML process can go on.

AutoML Leaderboard

Leaderboard is where the AutoML lists the top performing Models.

aml@leaderboard

Prediction and Submission

pred <- h2o.predict(aml, test_h)  

submission$label <- as.vector(pred$predict)

#write_csv(submission, "submission_automl.csv")

Submission (for Kaggle)

write_csv(submission, "submission_automl.csv")

This is currently a playground Competition on Kaggle. So, this submission file can be submitted to this competition. Based on the above parameters the submission scored 0.90720 in the public leaderboard. 0.90 score in an MNIST Classification is close to nothing, but I hope this code snippet can serve as quick starter template for anyone attempting to begin with AutoML.

References

If you liked this, Please subscribe to my Language-agnostic Data Science Newsletter and also share it with your friends!

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)