Diving into H2O

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

One of the remarkable features of the R language is its adaptability. Motivated by R’s popularity and helped by R’s expressive power and transparency developers working on other platforms display what looks like inexhaustible creativity in providing seamless interfaces to software that complements R’s strengths. The H2O R package that connects to 0xdata’s H2O software (Apache 2.0 License) is an example of this kind of creativity.

According to the 0xdata website, H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”. Indeed, H2O offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multi-layer neural net models). Examples with timing information of running all of these models on fairly large data sets are available on the 0xdata website. Execution speeds are very impressive. In this post, I thought I would start a little slower and look at H2O from an R point of View.

H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the documentation a cluster’s “memory capacity is the sum across all H2O nodes in the cluster”. So, as I understand it, if you were to build a 16 node cluster of machines each having 64GB of DRAM, and you installed H2O everything then you could run the H2O machine learning algorithms using a terabyte of memory.

Underneath the covers, the H2O JVM sits on an in-memory, non-persistent key-value (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. (See Cliff Click’s blog for the details.) The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O.

The R H2O package communicates with the H2O JVM over a REST API. R sends RCurl commands and H2O sends back JSON responses. Data ingestion, however, does not happen via the REST API. Rather, an R user calls a function that causes the data to be directly parsed into the H2O KV store. The H2O R package provides several functions for doing this Including: h20.importFile() which imports and parses files from a local directory, h20.importURL() which imports and pareses files from a website, and h2o.importHDFS() which imports and parses HDFS files sitting on a Hadoop cluster.

So much for the background: let’s get started with H2O. The first thing you need to do is to get Java running on your machine. If you don’t already have Java the default download ought to be just fine. Then fetch and install the H2O R package. Note that the h2o.jar executable is currently shipped with the h2o R package. The following code from the 0xdata website ran just fine from RStudio on my PC:

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
 
# Next, we download, install and initialize the H2O package for R.
install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))
 
library(h2o)
localH2O = h2o.init()
 
# Finally, let's run a demo to see H2O at work.
demo(h2o.glm)

Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster. h2o.init(Xmx=”10g”) will start up the H2O KV store with 10GB of RAM. demo(h2o,glm) runs the glm demo to let you know that everything is working just fine. I will save examining the model for another time. Instead let's look at some other H2O functionality.

The first thing to get straight with H2O is to be clear about when you are working in R and when you are working in the H2O JVM. The H2O R package implements several R functions that are wrappers to H2O native functions. “H2O supports an R-like language” (See a note on R) but sometimes things behave differently than an R programmer might expect.

For example, the R code:

y <- apply(iris[,1:4],2,sum)
y

produces the following result:

Sepal.Length Sepal.Width Petal.Length Petal.Width 
876.5        458.6       563.7        179.9

Now, let's see how things work in H2O, The following code loads the H2O package, starts a local instance of H2O, uploads the iris data set into the H2O instance from the H2O R package and produces a very R-like summary.

library(h2o)                # Load H2O library  
localH2O = h2o.init()       # initial H2O locl instance
# Upload iris file from the H2O package into the H2O local instance
iris.hex <-  h2o.uploadFile(localH2O, path = system.file("extdata", "iris.csv", package="h2o"), key = "iris.hex")
summary(iris.hex)

However, the apply() function from the H2O R package behaves a bit differently

x <- apply(iris.hex[,1:4],2,sum)
x
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: Last.value.17

Instead of returning the the results it returns the attributes of file in which the results are stored. You can see this from looking at the structure of x.

str(x)
Formal class 'H2OParsedData' [package "h2o"] with 3 slots
..@ h2o :Formal class 'H2OClient' [package "h2o"] with 2 slots
.. .. ..@ ip : chr "127.0.0.1"
.. .. ..@ port: num 54321
..@ key : chr "Last.value.17"
..@ logic: logi FALSE

H2O dataset 'Last.value.17': 4 obs. of 1 variable:
$ C1: num 876.5 458.1 563.8 179.8

You can get the data out, by coercing x into being a data frame.

df <- as.data.frame(x)
df
C1
1 876.5
2 458.1
3 563.8
4 179.8

So, as one might expect, there are some differences that take a little getting used to. However, the focus ought not to be on the differences from R but on the pontential of having some capabilities for manipulating huge data sets from with R. In combination, the H2O R package functions h2o.ddply() and h2o.addFunction(), which permits users to push a new function into the H2O JVM, do a fine job of providing some ddply() features to H2O data sets. 

The following code loads one year of the airlines data set from my hard drive into the H2O instance, gives me the dimensions of the data, and lets me know what variables I have.

path <- "C:/DATA/Airlines_87_08/2008.csv"
air2008.hex <- h2o.uploadFile(localH2O, path = path,key="air2008")
dim(air2008.hex)
[1] 7009728 29

colnames(air2008.hex)

Then, using h20.addFunction(), define a function to compute the average departure delay, and create a new H2O data set without DepDelay missing values that would otherwise blow up the added function.

# Define function to compute an average for colume 16
fun = function(df) { sum(df[,16])/nrow(df) }
h2o.addFunction(localH2O, fun)  # Push the function to H2O
# Filter out missing values
air2008.filt = air2008.hex[!is.na(air2008.hex$DepDelay),]
head(air2008.filt)

Finally, run h2o.ddply() to get average departure delay by day of the week and pull down the results from H2O.

airlines.ddply = h2o.ddply(air2008.filt, "DayOfWeek", fun)
as.data.frame(airlines.ddply)

  DayOfWeek C1
1 2         8.976897
2 6         8.645681
3 7        11.568973
4 4         9.772897
5 1        10.269990
6 5        12.158036
7 3         8.289761

Exactly, what you would expect! 

Having h2o.ddply() being limited to functions that can be pushed to H2O may seem limiting to some. However, in the context of working with huge data sets I don't see this to be a problem. Presumably the real data cleaning and preperation will be accompished by other tools that are appropriate for the environment (e.g. Hadoop) where the data resides. In a future post, I hope to more closely examine H2O's machine learning algorithms. As it stands, from and R perspective H2O appears to be an impressive accomplishment and welcome addition to the open source world. 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)