Annual R User Conference 2014
The useR! 2014 conference was a mind-blowing experience. Hundreds of R enthusiasts and the beautiful UCLA campus, I am really glad that I had the chance to attend! The only problem is that, after a few days of non-stop R talks, I was (and still am) completely overwhelmed with the new cool packages and ideas.
“The Open Source In-Memory, Prediction Engine for Big Data Science” – that’s what Oxdata, the creator of H2O, said. Joseph Rickert’s blog post is a very good introduction of H2O so please read that if you want to find out more. I am going straight into the deep learning part.
Deep Learning in R
But first, let’s play with the ‘h2o’ package and get familiar with it.
The H2O Experiment
- How to set up and connect to a local H2O cluster from R.
- How to train a deep neural networks model.
- How to use the model for predictions.
- Out-of-bag performance of non-regularized and regularized models.
- How does the memory usage vary over time.
OK, enough for the background and experiment setup. Instead of writing this blog post like a boring lab report, let’s go through what I have found out so far. (If you want to find out more, all code is available here so you can modify it and try it out on your clusters.)
Setting Up and Connecting to a H2O Cluster
Smoooooth! – if I have to explain it in one word. Oxdata made this really easy for R users. Below is the code to start a local cluster with 1GB or 2GB memory allowance. However, if you want to start the local cluster from terminal (which is also useful if you see the messages during model training), you can do this java -Xmx1g -jar h2o.jar (see the original H2O documentation here).
By default, H2O starts a cluster using all available threads (8 in my case). The h2o.init(…) function has no argument for limiting the number of threads yet (well, sometimes you do want to leave one thread idle for other important tasks like Facebook). But it is not really a problem.
In order to train models with the H2O engine, I need to link the datasets to the H2O cluster first. There are many ways to do it. In this case, I linked a data frame (Breast Cancer) and imported CSVs (MNIST) using the following code.
Training a Deep Neural Network Model
The syntax is very similar to other machine learning algorithms in R. The key differences are the inputs for x and y which you need to use the column numbers as identifiers.
Using the Model for Prediction
Again, the code should look very familiar to R users.
Out-of-Bag Performance (Breast Cancer Dataset)
No surprise here. As I expected, the non-regularized model overfitted the training set and performed poorly on test set. Also as expected, the regularized models did give consistent out-of-bag performance. Of course, more tests on different datasets are needed. But this is definitely a good start for using deep learning techniques in R!
Memory Usage (MNIST Dataset)
This is awesome and really encouraging! In near idle mode, my laptop uses about 1GB of memory (Ubuntu 14.04). During the MNIST model training, H2O successfully kept the memory usage below the capped 2GB allowance over time with all 8 threads working like a steam train! OK, this is based on just one simple test but I already feel comfortable and confident to move on and use H2O for much bigger datasets.
Having said that, the included algorithms (Deep Neural Networks, Random Forest, GBM, K-Means, PCA etc) are solid for most of the common data mining tasks. Discovering and experimenting with the deep learning functions in H2O really made me happy. With the superb memory management and the full integration with multi-node big data platforms, I am sure this H2O engine will become more and more popular among data scientists. I am already thinking about the Parallella project but I will leave it until I finish my thesis.
I can now understand why John Chambers recommended H2O. It has already become one of my essential R tools for data mining. The deep learning algorithm in H2O is very interesting, I will continue to explore and experiment with the rest of the regularization parameters such as ‘L1’, ‘L2’ and ‘Maxout’.
As usual, code is available at my GitHub repo for this blog.
Personal Highlight of useR! 2014
There are more cool things that I am trying at the moment. I will try to blog about them in the near future. If I have to name a few right now … that will be:
- Embedding Shiny Apps in R Markdown by RStudio
- subsemble: Ensemble learning in R with the Subsemble algorithm by Erin LeDell
- OpenCPU by Jeroen Ooms
- dendextend: an R package for easier manipulation and visualization of dendrograms by Tal Galili
- Adaptive Resampling in a Parallel World by Max Kuhn
- Packrat – A Dependency Management System for R by J.J. Allaire