**Blend it like a Bayesian!**, and kindly contributed to R-bloggers)

### Annual R User Conference 2014

The useR! 2014 conference was a mind-blowing experience. Hundreds of R enthusiasts and the beautiful UCLA campus, I am really glad that I had the chance to attend! The only problem is that, after a few days of non-stop R talks, I was (and still am) completely overwhelmed with the new cool packages and ideas.

* Let me start with H2O* – one of the three promising projects that John Chambers highlighted during his keynote (the other two were Rcpp/Rcpp11 and RLLVM/RLLVMCompile).

### What’s H2O?

*“The Open Source In-Memory, Prediction Engine for Big Data Science”* – that’s what Oxdata, the creator of H2O, said. Joseph Rickert’s blog post is a very good introduction of H2O so please read that if you want to find out more. I am going straight into the deep learning part.

### Deep Learning in R

But first, let’s play with the ‘h2o’ package and get familiar with it.

### The H2O Experiment

- How to set up and connect to a local H2O cluster from R.
- How to train a deep neural networks model.
- How to use the model for predictions.
- Out-of-bag performance of non-regularized and regularized models.
- How does the memory usage vary over time.

#### Experiment 1:

*h2o.deeplearning(…)*function (or basically the objectives 1 to 4 mentioned above).

#### Experiment 2:

### Findings

OK, enough for the background and experiment setup. Instead of writing this blog post like a boring lab report, let’s go through what I have found out so far. (*If you want to find out more, all code is available here so you can modify it and try it out on your clusters.*)

#### Setting Up and Connecting to a H2O Cluster

*Smoooooth!* – if I have to explain it in one word. Oxdata made this really easy for R users. Below is the code to start a local cluster with 1GB or 2GB memory allowance. However, if you want to start the local cluster from terminal (which is also useful if you see the messages during model training), you can do this java -Xmx1g -jar h2o.jar (see the original H2O documentation here).

By default, H2O starts a cluster using all available threads (8 in my case). The *h2o.init(…)* function has no argument for limiting the number of threads yet (*well, sometimes you do want to leave one thread idle for other important tasks like Facebook*). But it is not really a problem.

#### Loading Data

####
In order to train models with the H2O engine, I need to link the datasets to the H2O cluster first. There are many ways to do it. In this case, I linked a data frame (Breast Cancer) and imported CSVs (MNIST) using the following code.

#### Training a Deep Neural Network Model

The syntax is very similar to other machine learning algorithms in R. The key differences are the inputs for x and y which you need to use the column numbers as identifiers.

#### Using the Model for Prediction

Again, the code should look very familiar to R users.

*h2o.predict(…)*function will return the predicted label with the probabilities of all possible outcomes (or numeric outputs for regression problems) – very useful if you want to train more models and build an ensemble.

#### Out-of-Bag Performance (Breast Cancer Dataset)

No surprise here. As I expected, the non-regularized model overfitted the training set and performed poorly on test set. Also as expected, the regularized models did give consistent out-of-bag performance. Of course, more tests on different datasets are needed. But this is definitely a good start for using deep learning techniques in R!

#### Memory Usage (MNIST Dataset)

*This is awesome and really encouraging!* In near idle mode, my laptop uses about 1GB of memory (Ubuntu 14.04). During the MNIST model training, H2O successfully kept the memory usage below the capped 2GB allowance over time with all 8 threads working like a steam train! OK, this is based on just one simple test but I already feel comfortable and confident to move on and use H2O for much bigger datasets.

### Conclusions

Having said that, the included algorithms (Deep Neural Networks, Random Forest, GBM, K-Means, PCA etc) are solid for most of the common data mining tasks. Discovering and experimenting with the deep learning functions in H2O really made me happy. With the superb memory management and the full integration with multi-node big data platforms, I am sure this H2O engine will become more and more popular among data scientists. I am already thinking about the Parallella project but I will leave it until I finish my thesis.

I can now understand why John Chambers recommended H2O. It has already become one of my essential R tools for data mining. The deep learning algorithm in H2O is very interesting, I will continue to explore and experiment with the rest of the regularization parameters such as ‘L1’, ‘L2’ and ‘Maxout’.

### Code

As usual, code is available at my GitHub repo for this blog.

### Personal Highlight of useR! 2014

`#User2014 trended thx to: @LouBajuk @guneetc79 @earino @pilatesbuff @matlabulous @timtriche http://t.co/auoFM1xWIw pic.twitter.com/l952WD5ejz`

— Ajay Gopal (@aj2z) July 7, 2014

**… which means I successfully made Matlab trending with R!!!**There are more cool things that I am trying at the moment. I will try to blog about them in the near future. If I have to name a few right now … that will be:

- Embedding Shiny Apps in R Markdown by RStudio
- subsemble: Ensemble learning in R with the Subsemble algorithm by Erin LeDell
- OpenCPU by Jeroen Ooms
- dendextend: an R package for easier manipulation and visualization of dendrograms by Tal Galili
- Adaptive Resampling in a Parallel World by Max Kuhn
- Packrat – A Dependency Management System for R by J.J. Allaire

*(Pheeew! So here is my first blog post related to machine learning – the very purpose of starting this blog. Not bad it finally happened after a whole year!)*

**leave a comment**for the author, please follow the link and comment on their blog:

**Blend it like a Bayesian!**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...