by Joseph Rickert
The second, annual H2O World conference finished up yesterday. More than 700 people from all over the US attended the three-day event that was held at the Computer History Museum in Mountain View, California; a venue that pretty much sits well within the blast radius of ground zero for Data Science in the Silicon Valley. This was definitely a conference for practitioners and I recognized quite a few accomplished data scientists in the crowd. Unlike many other single-vendor productions, this was a genuine Data Science event and not merely a vendor showcase. H2O is a relatively small company, but they took a big league approach to the conference with an emphasis on cultivating the community of data scientists and delivering presentations and panel discussions that focused on programming, algorithms and good Data Science practice.
The R based sessions I attended on the tutorial day were all very well done. Each was designed around a carefully crafted R script performing a non-trivial model building exercise and showcasing one or more of the various algorithms in the H2O repertoire including GLMs, Gradient Boosting Machines, Random Forests and Deep Learning Neural Nets. The presentations were targeted to a sophisticated audience with considerable discussion of pros and cons. Deep Learning is probably H2O's signature algorithm, but despite its extremely impressive performance in many applications nobody here was selling it as the answer to everything.
The following code fragment from a script (Download Deeplearning)that uses deep learning to identify a spiral pattern in a data set illustrates the current look and feel of H2O's R interface. Any function that begins with h2o. runs in the JVM not in the R environment. (Also note that if you want to run the code you must first install Java on your machine, the Java Runtime Environment will do. Then, download the H2O R package Version 22.214.171.124 from the company's website. The scripts will not run with the older version of the package on CRAN.)
### Cover Type Dataset #We important the full cover type dataset (581k rows, 13 columns, 10 numerical, 3 categorical). #We also split the data 3 ways: 60% for training, 20% for validation (hyper parameter tuning) and 20% for final testing. # df <- h2o.importFile(path = normalizePath("../data/covtype.full.csv")) dim(df) df splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234) train <- h2o.assign(splits[], "train.hex") # 60% valid <- h2o.assign(splits[], "valid.hex") # 20% test <- h2o.assign(splits[], "test.hex") # 20% # #Here's a scalable way to do scatter plots via binning (works for categorical and numeric columns) to get more familiar with the dataset. # #dev.new(noRStudioGD=FALSE) #direct plotting output to a new window par(mfrow=c(1,1)) # reset canvas plot(h2o.tabulate(df, "Elevation", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Cover_Type")) plot(h2o.tabulate(df, "Soil_Type", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Elevation" )) # #### First Run of H2O Deep Learning #Let's run our first Deep Learning model on the covtype dataset. #We want to predict the `Cover_Type` column, a categorical feature with 7 levels, and the Deep Learning model will be tasked to perform (multi-class) classification. It uses the other 12 predictors of the dataset, of which 10 are numerical, and 2 are categorical with a total of 44 levels. We can expect the Deep Learning model to have 56 input neurons (after automatic one-hot encoding). # response <- "Cover_Type" predictors <- setdiff(names(df), response) predictors # #To keep it fast, we only run for one epoch (one pass over the training data). # m1 <- h2o.deeplearning( model_id="dl_model_first", training_frame=train, validation_frame=valid, ## validation dataset: used for scoring and early stopping x=predictors, y=response, #activation="Rectifier", ## default #hidden=c(200,200), ## default: 2 hidden layers with 200 neurons each epochs=1, variable_importances=T ## not enabled by default ) summary(m1) # #Inspect the model in [Flow](http://localhost:54321/) for more information about model building etc. by issuing a cell with the content `getModel "dl_model_first"`, and pressing Ctrl-Enter. # #### Variable Importances #Variable importances for Neural Network models are notoriously difficult to compute, and there are many [pitfalls](ftp://ftp.sas.com/pub/neural/importance.html). H2O Deep Learning has implemented the method of [Gedeon](http://cs.anu.edu.au/~./Tom.Gedeon/pdfs/ContribDataMinv2.pdf), and returns relative variable importances in descending order of importance. # head(as.data.frame(h2o.varimp(m1))) # #### Early Stopping #Now we run another, smaller network, and we let it stop automatically once the misclassification rate converges (specifically, if the moving average of length 2 does not improve by at least 1% for 2 consecutive scoring events). We also sample the validation set to 10,000 rows for faster scoring. # m2 <- h2o.deeplearning( model_id="dl_model_faster", training_frame=train, validation_frame=valid, x=predictors, y=response, hidden=c(32,32,32), ## small network, runs faster epochs=1000000, ## hopefully converges earlier... score_validation_samples=10000, ## sample the validation dataset (faster) stopping_rounds=2, stopping_metric="misclassification", ## could be "MSE","logloss","r2" stopping_tolerance=0.01 ) summary(m2) plot(m2)
First notice that it all looks pretty much like R code. The script mixes standard R functions and H2O functions in a natural way. For example, h20.tabulate() produces an object of class "list" and h20.deeplearning() yields a model object that plot can deal with. This is just really baseline stuff that has to happen to provide make H2O coding feel like R. But note that the H2O code goes beyond this baseline requirement. The functions h2o.splitFrame() and h2o.assign() manipulate data residing in the JVM in a way that will probably seem natural to most R users, and the function signatures also seem to be close enough "R like" to go unnoticed. All of this reflects the conscious intent of the H2O designers not only to provide tools to facilitate the manipulation of H2O data from the R environment, but also to try and replicate the R experience.
An innovative new feature of the h20.deeplearning() function itself is the ability to specify a stopping metric. The parameter setting: (stopping_metric="misclassification", ## could be "MSE","logloss","r2" ) in the specification of model m2 means that the neural net will continue to learn until the specified performance threshold is achieved. In most cases, this will produce a useful model in much less time than it would take to have the learner run to completion. The following plot, generated in the script referenced above, shows the kind of problem for which the Deep Learning algorithm excels.
Highlights of the conference for me included the presentations listed below. The videos and slides (when available) from all of these presentations will be posted on the H2O conference website. Some have been posted already and the rest should follow soon. (I have listed the dates and presentation times to help you locate the slides when they become available)
Madeleine Udell (11-11: 10:30AM) presented the mathematics underlying the new algorithm, Generalized Low Rank Models (GLRM), she developed as part of her PhD work under Stephen Boyd, professor at Stanford University and adviser to H2O. This algorithm which generalizes PCA to deal with heterogeneous data types shows great promise for a variety of data science applications. Among other things, it offers a scalable way to impute missing data. This was possibly the best presentation of the conference. Madeleine is an astonishingly good speaker; she makes the math exciting.
Anqi Fu (11-9: 3PM) presented her H2O implementation of the GLRM. Anqi not only does a great job of presenting the algorithm, she also offers some real insight into the challenges of turning the mathematics into production level code. You can download one of Anqi's demo R scripts here: Download Glrm.census.labor.violations. To my knowledge, Anqi's code is the only scalable implementation of the GLRM. (Madeleine wrote the prototype code in Julia.)
Matt Dowle (11-10), of data.table fame, demonstrated his port of data.table's lightning fast radix sorting algorithm to H2O. Matt showed a 1B row X 1B row table join that runs in about 1.45 minutes on a 4 node 128 core H2O cluster. This is very impressive result, but Matt says he can already do 10B x 10B row joins, and is shooting for 100B x 100B rows.
Professor Rob Tibshirani (11-11: 11AM) presented work he is doing that may lead to lasso based models capable of detecting the presence of cancer in tissue extracted from patients while they are on the operating table! He described "Customized Learning", a method of building individual models for each patient. The basic technique is to pool the data from all of the patients and run a clustering algorithm. Then, for each patient fit a model using only the data in the patient's cluster. This is exciting work with the real potential to save lives.
Professor Stephen Boyd (11-10: 11AM) delivered a tutorial on optimization starting with basic convex optimization problems and then went on to describe Consensus Optimization, an algorithm for building machine learning models from data stored at different locations without sharing the data among the locations. Professor Boyd is a lucid and entertaining speaker, the kind of professor you will wish you had had.
Arno Candel (11-9: 1:30PM) presented the Deep Learning model which he developed at H2O. Arno is an accomplished speaker who presents the details with great clarity and balance. Be sure to have a look at his slide showing the strengths and weaknesses of Deep Learning.
Erin LeDell (11-9: 3PM) de-mystified ensembles and described how to build an ensemble learner from scratch. Anyone who wants to compete in a Kaggle competition should find this talk to be of value.
Szilard Pafka (11-11:3PM), in a devastatingly effective, low key presentation, described his efforts to benchmark the open source, machine learning platforms R, Python scikit, Vopal Wabbit, H2O, xgboost and Spark MLlib. Szilard downplayed his results, pointing out that they are in no way meant to be either complete nor conclusive. Nevertheless, Szilard put some considerable effort into the benchmarks. (He worked directly with all of the development teams for the various platforms.) Szilard did not offer any conclusions, but things are not looking all that good for Spark. The following slide plots AUC vs file size up to 10M rows.
Szilard's presentation should be available on the H2O site soon, but it is also available here.
I also found the Wednesday morning panel discussion on the "Culture of Data Driven Decision Making" and the Wednesday afternoon panel on "Algorithms -Design and Application" to be informative and well worth watching. Both panels included a great group of articulate and knowledgeable people.
If you have not checked in with H2O since the post I wrote last year, here' on one slide, is some of what they have been up to since then.
Congratulations to H2O for putting on a top notch event!