Scalable Machine Learning for Big Data Using R and H2O

February 28, 2015

(This article was first published on Data Science Las Vegas (DSLV) » R, and kindly contributed to R-bloggers)

Part I

Part II

H2O is an open source parallel processing engine for machine learning on Big Data. This prediction engine is designed by, h20, a Mountain View-based startup that has implemented a number of impressive statistical and machine learning algorithms to run on HDFS, S3, SQL and NoSQL.

We were honored to have Tom Kraljevic (Vice President of Engineering at H2O) demonstrate how this prediction engine is suited for machine learning on Big Data from within R. Yes, that’s right, from within R. Most R users will attest to running into memory issues when crunching millions or billions of data records. That’s what H2o is designed to address. So it was no surprise that most of the R users in attendance including myself were impressed when Tom said:

“R tells H2O to perform a task…and then H2O returns the result back to R, which is a tiny result….but you never actually transfer the data to R…That’s the magic behind the scalability of H2O with R.”

This feature appealed to me. The data never flows through R!!. R requires a reference object to the H2O instance because it uses a REST API to send functions to H2O. Data sets are not transmitted directly through the REST API. Instead, the user sends a command (for example, an HDFS path to the data set) either through the browser or via the REST API to ingest data from disk.

You can find the slides to this presentation by clicking here or copy and paste the following URL into your web browser ( You can also watch Tom’s presentation in a series of two videos shown above.

Another takeaway from this meetup was that H2O provides a combination of extraordinary math, backed by some of the most knowledgeable experts in Machine Learning: Stanford professors Trevor Hastie, Rob Tibshirani and Steven Boyd. It is also easy to use within R. Their package is available on CRAN. You can get started by launching and initializing H2O from within R using a few lines of code.

View this code snippet on GitHub.


The post Scalable Machine Learning for Big Data Using R and H2O appeared first on Data Science Las Vegas (DSLV).

To leave a comment for the author, please follow the link and comment on their blog: Data Science Las Vegas (DSLV) » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)