H2O is an open source parallel processing engine for machine learning on Big Data. This prediction engine is designed by, h20, a Mountain View-based startup that has implemented a number of impressive statistical and machine learning algorithms to run on HDFS, S3, SQL and NoSQL.
We were honored to have Tom Kraljevic (Vice President of Engineering at H2O) demonstrate how this prediction engine is suited for machine learning on Big Data from within R. Yes, that’s right, from within R. Most R users will attest to running into memory issues when crunching millions or billions of data records. That’s what H2o is designed to address. So it was no surprise that most of the R users in attendance including myself were impressed when Tom said:
“R tells H2O to perform a task…and then H2O returns the result back to R, which is a tiny result….but you never actually transfer the data to R…That’s the magic behind the scalability of H2O with R.”
This feature appealed to me. The data never flows through R!!. R requires a reference object to the H2O instance because it uses a REST API to send functions to H2O. Data sets are not transmitted directly through the REST API. Instead, the user sends a command (for example, an HDFS path to the data set) either through the browser or via the REST API to ingest data from disk.
You can find the slides to this presentation by clicking here or copy and paste the following URL into your web browser (https://github.com/h2oai/h2o-
Another takeaway from this meetup was that H2O provides a combination of extraordinary math, backed by some of the most knowledgeable experts in Machine Learning: Stanford professors Trevor Hastie, Rob Tibshirani and Steven Boyd. It is also easy to use within R. Their package is available on CRAN. You can get started by launching and initializing H2O from within R using a few lines of code.View this code snippet on GitHub.
The post Scalable Machine Learning for Big Data Using R and H2O appeared first on Data Science Las Vegas (DSLV).