by Joseph Rickert
H2O.ai held its first H2O World conference over two days at the Computer History Museum in Mountain View, CA. Although the main purpose of the conference was to promote the company's rich set of Java based machne learning algorithms and announce their new products Flow and Play there were quite a few sessions devoted to R and statistics in general.
Before I describe some of these, a few words about the conference itself. H20 World was exceptionally well run, especially for a first try with over 500 people attending (my estimate). The venue is an interesting, accommodating space with plenty of parking, that played well with what, I think, must have been an underlying theme of the conference: acknowledging contributions of past generations of computer scientists and statisticians. There were two stages offering simultaneous talks for at least part of the conference: The Paul Erdős stage and the John Tukey stage. Tukey I got, why put such an eccentric mathematician front and center? I was puzzled until Sri Ambati, H2O.ai's CEO and co-founder remarked that he admired Erdős because of his great generosity with collaboration. To a greater extent than most similar events, H2O World itself felt like a collaboration. There was plenty of opportunity to interact with other attendees, speakers and H20 technical staff (The whole company must have been there). Data scientists, developers and Marketing staff were accessible and gracious with their time. Well done!
R was center stage for a good bit the hands on training that that occupied the first day of the conference. There were several sessions (Exploratory Data Analysis, Regression, Deep Learning, Clustering and Dimensionality Reduction) on accessing various H2O algorithms through the h2o R package and the H2O API. All of these moved quickly from R to running the custom H2O alogorithms on the JVM. However, the message that came through is that R is the right environment for sophisticated machine learning.
Two great pleasures from the second day of the conference were Trevor Hastie's tutorial on the Gradient Boosting Machine and John Chamber's personal remembrances of John Tukey. It is unusual for a speaker to announce that he has been asked to condense a two hour talk into something just under an hour and then go on to speak slowly with great clarity, each sentence beguiling you into imagining that you are really following the details. (It would be very nice if the video of this talk would be made available.)
Two notable points from Trevor's lecure where understanding gradient boosting as minimizing the exponential loss function and the openness of the gbm algorithm to “tinkering”. For the former point see Chapter 10 of the Elements of Statistical Learning or the more extended discussion in Schapire and Freund's Boosting: Foundations and Algorithms.
John Tukey spent 40 years at Bell Labs (1945 – 1985) and John Chamber's tenure there overlapped the last 20 years of Tukey's stay. Chambers who had the opportunity to observe Tukey over this extended period of time painted a moving and lifelike portrait of the man. According to Chambers, Tukey could be patient and gracious with customers and staff, provocative with his statistician colleagues and “intellectually intimidating”. John remembered Richard Hamming saying: “John (Tukey) was a genius. I was not.” Tukey apparently delighted in making up new terms when talking with fellow statisticians. For example, he called the top and bottom lines that identify the interquartile range on a box plot “hinges” not quartiles. I found it particularly interesting that Tukey would describe a statistic in terms of the process used to compute it, and not in terms of any underlying theory. Very unusual, I would think, for someone who earned a PhD in topology under Solomon Lefschetz. For more memories of John Tukey including more from John Chambers look here.
Other R related highlights were talks by Matt Dowle and Erin Ledell. Matt reprised the update on new features in data.table that he recently gave to the Bay Area useR Group and also presented interesting applications using data.table from UK insurance company Landmark, and KatRisk (Look here for KatRisk part of Matt's presentation).
Erin, author of the h20Ensemble package available on GitHub, delivered an exciting and informative talk on using ensembles of learners (combining gbm models and logistic regression models, for example) to create “superlearners”.
Finally, I gave a short talk Revolution Analytics' recent work towards achieving reproducibility in R. The presentation motivates the need for reproducibility by examining the use of R in industry and science and describing how the checkpoint package and Revolution R Open, an open source distribution of R that points to a static repository can be helpful.