Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

Deep learning is a recent trend in machine learning that models highly non-linear representations of data. In the past years, deep learning has gained a tremendous momentum and prevalence for a variety of applications (Wikipedia 2016a). Among these are image and speech recognition, driverless cars, natural language processing and many more. Interestingly, the majority of mathematical concepts for deep learning have been known for decades. However, it is only through several recent developments that the full potential of deep learning has been unleashed (Nair and Hinton 2010; Srivastava et al. 2014).

Previously, it was hard to train artificial neural networks due to vanishing gradients and overfitting problems. Both problems are now solved by using different activation functions, dropout regularization and a massive amount of training data. For instance, the Internet can nowadays be utilized to retrieve large volumes of both labeled and unlabeled data. In addition, the availability of GPUs and GPGPUs has made computations much cheaper and faster.

Today, deep learning has shown itself to be very effective for almost any task which requires machine learning. However, it is particularly suited to complex, hierarchical data. Its underlying artificial neural network models highly non-linear representations; these are usually composed of multiple layers together with non-linear transformations and tailored architectures. A typical representation of a deep neural network is depicted in Figure 1.

Figure 1. Model of a deep neural network.

The success of deep learning has led to a wide range of frameworks and libraries for various programming languages. Examples include Caffee, Theano, Torch and Tensor Flow, amongst others. This blog entry aims to provide an overview and comparison of different deep learning packages available for the programming language R. We compare performance and ease-of-use across different datasets.

## Packages for deep learning in R

The R programming language has gained considerable popularity among statisticians and data miners for its ease-of-use, as well as its sophisticated visualizations and analyses. With the advent of the deep learning era, the support for deep learning in R has grown ever since, with an increasing number of packages becoming available. This section presents an overview on deep learning in R as provided by the following packages: MXNetR, darch, deepnet, H2O and deepr.

First of all, we note that the underlying learning algorithms greatly vary from one package to another. As such, Table 1 shows a list of the available methods/architectures in each of the packages.

Table 1. List of available deep learning methods across the R packages.

Package Available architectures of neural networks
MXNetR Feed-forward neural network, convolutional neural network (CNN)
darch Restricted Boltzmann machine, deep belief network
deepnet Feed-forward neural network, restricted Boltzmann machine, deep belief network, stacked autoencoders
H2O Feed-forward neural network, deep autoencoders
deepr Simplify some functions from H2O and deepnet packages

### Package “MXNetR”

The MXNetR package is an interface of the MXNet library written in C++. It contains feed-forward neural networks and convolutional neural networks (CNN) (MXNetR 2016a). It also allows one to construct customized models. This package is distributed in two versions: CPU only or GPU version. The former CPU version can be easily installed directly from inside R, whereas the latter GPU version depends on 3rd party libraries like cuDNN and requires building the library from its source code (MXNetR 2016b).

A feed-forward neural network (multi-layer perceptron) can be built in MXNetR with the function call:

mx.mlp(data, label, hidden_node=1, dropout=NULL, activation=”tanh”, out_activation=”softmax”, device=mx.ctx.default(),…)

The parameters are as follows:

• data – input matrix
• label – training labels
• hidden_node – a vector containing the number of hidden nodes in each hidden layer
• dropout – a number in [0,1) containing the dropout ratio from the last hidden layer to the output layer
• activation – either a single string or a vector containing the names of activation functions. Valid values are {'relu', 'sigmoid', 'softrelu', 'tanh'}
• out_activation – a single string containing the name of the output activation function. Valid values are {'rmse', 'sofrmax', 'logistic'}
• device – whether to train on mx.cpu (default) or mx.gpu
• ... – other parameters passing to mx.model.FeedForward.create

Function mx.model.FeedForward.create is used internally in mx.mpl and takes the following parameters:

• symbol – the symbolic configuration of the neural network
• y – array of labels
• x – training data
• ctx – context, i.e. a device (CPU/GPU) or list of devices (multiple CPUs or GPUs)
• num.round – number of iterations to train the model
• optimizer – string (default is 'sgd')
• initializer – initialization scheme for parameters
• eval.data – validation set used during the process
• eval.metric – evaluation function on the results
• epoch.end.callback – callback when iteration ends
• batch.end.callback – callback when one mini-batch iteration ends
• array.batch.size – batch size used for array training
• array.layout – can be {'auto', 'colmajor', 'rowmajor'}
• kvstore – synchronization scheme for multiple devices

Sample call:

model <- mx.mlp(train.x, train.y, hidden_node=c(128,64), out_node=2, activation="relu", out_activation="softmax",num.round=100, array.batch.size=15, learning.rate=0.07, momentum=0.9, device=mx.cpu())

To use the trained model afterwards, we simply need to invoke the function predict() specifying the model as the first parameter and testset as the second:

preds = predict(model, testset)

The function mx.mlp() is essentially a proxy to the more flexible but lengthy process of defining a neural network by using ‘Symbol’ system of MXNetR. The equivalent of the previous network in symbolic definition will be:

data <- mx.symbol.Variable("data") fc1 <- mx.symbol.FullyConnected(data, num_hidden=128) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu") fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64) act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu") fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2) lro <- mx.symbol.SoftmaxOutput(fc3, name="sm") model2 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y, ctx=mx.cpu(), num.round=100, array.batch.size=15, learning.rate=0.07, momentum=0.9)

When the architecture of the network is finally created, MXNetR provides a simple way to graphically inspect it using the following function call:

graph.viz(model$symbol$as.json())

graph.viz(model2$symbol$as.json())

Here, the parameter is the trained model represented by the symbol. The first network is constructed by mx.mlp() and the second using the symbol system.

The definition goes layer-by-layer from input to output, while also allowing for a different number of neurons and specific activation functions for each layer separately. Additional options are available via mx.symbol: mx.symbol.Convolution, which applies convolution to the input and then adds a bias. It can create convolutional neural networks. The reverse is mx.symbol.Deconvolution, which is usually used in segmentation networks along with mx.symbol.UpSampling in order to reconstruct the pixel-wise classification of an image. Another type of layer used in CNNs is mx.symbol.Pooling; this essentially reduces the data by usually picking signals with the highest response. The layer mx.symbol.Flatten is needed to link convolutional and pooling layers to a fully connected network. Additionally, mx.symbol.Dropout can be used to cope with the overfitting problem. It takes as a parameter previous_layer and a float value fraction of the input that is dropped.

As we can see, MXNetR can be used for quick design of standard multi-layer perceptrons with the function mx.mlp() or for more extensive experiments regarding symbolic representation.

Example of LeNet network:

data <- mx.symbol.Variable('data') conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20) tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh") pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max", kernel=c(2,2), stride=c(2,2)) conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50) tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh") pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max", kernel=c(2,2), stride=c(2,2)) flatten <- mx.symbol.Flatten(data=pool2) fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500) tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh") fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10) lenet <- mx.symbol.SoftmaxOutput(data=fc2) model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y, ctx=device.cpu, num.round=5, array.batch.size=100, learning.rate=0.05, momentum=0.9)

Altogether, the MXNetR package is highly flexible, while supporting both multiple CPUs and multiple GPUs. It has a shortcut to build standard feed-forward networks, but also grants flexible functionality to build more complex, customized networks such as CNN LeNet.

### Package “darch”

The darch package (darch 2015) implements the training of deep architectures, such as deep belief networks, which consist of layer-wise pre-trained restricted Boltzmann machines. The package also entails backpropagation for fine-tuning and, in the latest version, makes pre-training optional.

Training of a Deep Belief Network is performed via darch() function.

Sample call:

darch  <- darch(train.x, train.y,                 rbm.numEpochs = 0,                 rbm.batchSize = 100,                 rbm.trainOutputLayer = F,                 layers = c(784,100,10),                 darch.batchSize = 100,                 darch.learnRate = 2,                 darch.retainData = F,                 darch.numEpochs = 20 )

This function takes several parameters with the most important ones as follows:

• x – input data
• y – target data
• layers – vector containing one integer for the number of neurons in each layer (including input and output layers)
• rbm.batchSize – pre-training batch size
• rbm.trainOutputLayer – boolean used in pre-training. If true, the output layer of RBM is trained as well
• rbm.numCD – number of full steps for which contrastive divergence is performed
• rbm.numEpochs – number of epochs for pre-training
• darch.batchSize – fine-tuning batch size
• darch.fineTuneFunction– fine-tuning function
• darch.dropoutInput – dropout rate on the network input
• darch.dropoutHidden – dropout rate on the hidden layers
• darch.layerFunctionDefault – default activation function for DBN, available options are {'sigmoidUnitDerivative', 'binSigmoidUnit', 'linearUnitDerivative', 'linearUnit', 'maxoutUnitDerivative', 'sigmoidUnit', 'softmaxUnitDerivative', 'softmaxUnit', 'tanSigmoidUnitDerivative', 'tanSigmoidUnit' }
• darch.stopErr – stops training if the error is smaller or equal than a threshold
• darch.numEpochs – number of epochs for fine-tuning
• darch.retainData – boolean, indicates weather to store the training data in darch instance after training

Based on the previous parameters, we can train our model resulting in an object darch. We can later apply this to a test dataset test.x to make predictions. In that case, an additional parameter type specifies the output type of the prediction. For example, it can be ‘raw’ to give probabilities, ‘bin’ for binary vectors and ‘class’ for class labels. Finally, the prediction is made when calling predict() as follows:

predictions <- predict(darch, test.x, type="bin")

Overall, the basic usage of darch is very simple. It requires only one function to train the network. But on the other hand, the package is limited to deep belief networks, which usually require much more extensive training.

### Package “deepnet”

deepnet (deepnet 2015) is a relatively small, yet quite powerful package with variety of architectures to pick from. It can train a feed-forward network using the function nn.train() or initialize weights for the deep belief network with dbn.dnn.train(). This function internally uses rbm.train() to train a restricted Boltzmann machine (which can also be used individually). Furthermore, deepnet can also handle stacked autoencoders via sae.dnn.train().

Sample call (for nn.train()):

nn.train(x, y, initW=NULL, initB=NULL, hidden=c(50,20), activationfun="sigm", learningrate=0.8, momentum=0.5, learningrate_scale=1, output="sigm", numepochs=3, batchsize=100, hidden_dropout=0, visible_dropout=0)

One can set initial weights initW and weights initB which are otherwise randomly generated. In addition, hidden controls the number of units in the hidden layers, whereas activationfun specifies the activation function of the hidden layers (can be ‘sigm’, ‘linear’ or ‘tanh’), as well as of the output layer (can be ‘sigm’, ‘linear’, ‘softmax’).

As an alternative, the following example trains a neural network where the weights are initialized by a deep belief network (via dbn.dnn.train()). The difference is mainly in the contrastive divergence algorithm that trains the restricted Boltzmann machines. It is set via cd, giving the number of iterations for Gibbs sampling inside the learning algorithm.

dbn.dnn.train(x, y, hidden=c(1), activationfun="sigm", learningrate=0.8, momentum=0.5, learningrate_scale=1, output="sigm", numepochs=3, batchsize=100, hidden_dropout=0, visible_dropout=0, cd=1)

Similarly, it is possible to initialize weights from stacked autoencoders. Instead of the parameter output, this example uses sae_output, though it works the same as before.

sae.dnn.train(x, y, hidden=c(1), activationfun="sigm", learningrate=0.8, momentum=0.5, learningrate_scale=1, output="sigm", sae_output="linear", numepochs=3, batchsize=100, hidden_dropout=0, visible_dropout=0)

Finally, we can use a trained network to predict results via nn.predict(). Subsequently, we can transform the predictions with the help of nn.test() into an error rate. The first call requires a neural network and corresponding observations as inputs. The second call additionally needs the correct labels and a threshold when making predictions (default is 0.5).

predictions = nn.predict(nn, test.x) error_rate = nn.test(nn, test.x, test.y, t=0.5)

Altogether, deepnet represents a lightweight package with a restricted set of parameters; however, it offers variety of architectures.

### Package “H2O”

H2O is an open-source software platform with the ability to exploit distributed computer systems (H2O 2015). Its core is coded in Java and requires the latest version of JVM and JDK, which can be found at https://www.java.com/en/download/. The package provides interfaces for many languages and was originally designed to serve as a cloud-based platform (Candel et al. 2015). Accordingly, one starts H2O by calling h2o.init():

h2o.init(nthreads = -1)

The parameter nthreads specifies how many cores will be used for computation. A value -1 means that H2O will try to use all available cores on the system, though the default is 2. This routine can also work with parameters ip and port in case H2O is installed on a different machine. By default, it uses the ip address 127.0.0.1 together with port 54321. Thus, it is possible to locate the address ‘localhost:54321’ in the browser in order to access a web-based interface. Once your work with the current H2O instance is finished, you need to disconnect via:

h2o.shutdown()

Sample call:

All training operations are performed by h2o.deeplearning() as follows:

model <- h2o.deeplearning(   x=x,   y=y,   training_frame=train,   validation_frame=test,   distribution="multinomial",   activation="RectifierWithDropout",   hidden=c(32,32,32),   input_dropout_ratio=0.2,   sparse=TRUE,   l1=1e-5,   epochs=100)

The interface for passing data in H2O is a slightly different from other packages: x is a vector containing names of the columns with training data and y is the name of the variable with all the names. The next two parameters, training_frame and validation_frame, are H2O frame objects. They can be created by calling h2o.uploadFile(), which takes a directory path as an argument and loads a csv file into the environment. The use of a specific data class is motivated by the distributed environment, since the data should be available across the whole cluster. The parameter distribution is a string and can take the values ‘bernoulli’, ‘multinomial’, ‘poisson’, ‘gamma’, ‘tweedie’, ‘laplace’, ‘huber’ or ‘gaussian’, while ‘AUTO’ automatically picks a parameter based on the data. The following parameter specifies the activation function (possible values are ‘Tanh’, ‘TanhWithDropout’, ‘Rectifier’, ‘RectifierWithDropout’, ‘Maxout’ or ‘MaxoutWithDropout’). The parameter sparse is a boolean value denoting a high degree of zeros, which allows H2= to handle it more efficiently. The remaining parameters are intuitive and do not differ much from other packages. There are, however, many more available for fine-tuning, but it will probably not be necessary to change them since they come with recommended, pre-defined values.

Finally, we can make predictions using h2o.predict() with the following signature:

predictions <- h2o.predict(model, newdata=test_data)

Another powerful tool that H2O offers is the grid search for optimizing the hyperparameters. It is possible to specify sets of values for each parameter and then find the best combination via h2o.grid().

Hyperparameter optimization

hidden_par <- list(c(50,20,50), c(32,32,32)) l1_par <- c(1e-3,1e-8) hyperp <- list(hidden=hidden_par, l1=l1_par) model_grid <- h2o.grid("deeplearning",                        hyper_params=hyperp,                        x=x,                        y=y,                        distribution="multinomial",                        training_frame=train,                        validation_frame=test)

The H2= package will train four different models with two architectures and different L1-regularization weights. Therefore, it is possible to easily try a number of combinations of hyperparameters and see which one performs better:

for (model_id in [email protected]_ids) {     model <- h2o.getModel(model_id)     mse <- h2o.mse(model, valid=TRUE)     print(sprintf("MSE on the test set %f", mse)) }

Deep autoencoders

H2O can also exploit deep autoencoders. To train such a model, the same function h2o.deeplearning() is used but the set of parameters is slightly different

anomaly_model <- h2o.deeplearning(   x = names(train),   training_frame = train,   activation = "Tanh",   autoencoder = TRUE,   hidden = c(50,20,50),   sparse = TRUE,   l1 = 1e-4,   epochs = 100)

Here, we use only the training data, without the test set and labels. The fact that we need a deep autoencoder instead of a feed-forward Network is specified by the autoencoder parameter. As before, we can choose how many hidden units should be in different layers. If we use one integer value, we will get a naive autoencoder.

After training, we can study the reconstruction error. We compute it by the specific h2o.anomaly() function.

# Compute reconstruction error (MSE between output and input layers) recon_error <- h2o.anomaly(anomaly_model, test) # Convert reconstruction error data into R data frame recon_error <- as.data.frame(recon_error)

Overall, H2O is a highly user-friendly package that can be used to train feed-forward networks or deep autoencoders. It supports distributed computations and provides a web interface.

### Package “deepr”

The package deepr (deepr 2015) doesn’t implement any deep learning algorithms itself but forwards its tasks to H20. The package was originally designed at a time when the H2O package was not yet available on CRAN. As this is no longer the case, we will exclude it from our comparison. We also note that its function train_rbm() uses the deepnet implementation of rbm to train a model with some additional output.

## Comparison of Packages

This section compares the aforementioned packages across different metrics. Among these are ease-of-use, flexibility, ease-of-installation, support of parallel computations and assistance in choosing hyperparameters. In addition, we measure the performance across three common datasets ‘Iris’, ‘MNIST’ and ‘Forest Cover Type’. We hope that our comparison aids practitioners and researchers in choosing their preferred package for deep learning.

#### Installation

Installing packages that are available via CRAN is usually very simple and smooth. However, some packages depend on third party libraries. For example, H2O requires the latest version of Java, as well as Java Development Kit. The darch and MXNetR packages allow the use of GPU. For that purpose, darch depends on R package gputools, which is only supported on Linux and MacOS systems. MXNetR is by default shipped without GPU support due to its dependence on cuDNN, which cannot be included in the package because of licensing restrictions. Thus, the GPU version of MXNetR requires Rtools and a modern compiler with C++11 support to compile MXNet from source with CUDA SDK and cuDNN.

#### Flexibility

With respect to flexibility, MXNetR is most likely at the top of the list. It allows one to experiment with different architectures due to its layer-wise approach of defining the network, not to mention the rich variety of parameters. In our opinion, we think that both H2O and darch score second place. H20 predominantly addresses feed-forward networks and deep autoencoders, while darch focuses on restricted Boltzmann machines and deep belief networks. Both packages offer a broad range of tuning parameters. Last but not least, deepnet is a rather lightweight package but it might be beneficial when one wants to play around with different architectures. However, we do not recommend it for day-to-day use with huge datasets as its current version lacks GPU support and the relatively small set of parameters does not allow fine-tuning to the fullest.

#### Ease-of-use

H2O and MXNetR stand out for their speed and ease of use. MXNetR requires little to no preparation of data to start training and H2O offers a very intuitive wrapper by using the as.h2o() function, which converts data to the H2OFrame object. Both packages provide additional tools to examine models. deepnet takes labels in the form of one-hot encoding matrix. This usually requires some pre-processing since most of the datasets have their classes in a vector format. However it does not report very detailed information regarding the progress during training. The package also lacks additional tools for examining models. darch, on the other hand, has a very nice and verbose output.

Overall, we see H2O or MXNetR as the winners in this category, since both are fast and provide feedback during training. This allows one to quickly adjust parameters and improve the predictive performance.

#### Parallelization

Deep learning is common when dealing with massive datasets. As such, it can be of tremendous help when the packages allow for some degree of parallelization. Table 2 compares the support of parallelization. It shows only explicitly stated information from the documentation.

Table 2. Comparison of parallelization.

Package Multiple CPU [Multiple] GPU Cluster Platforms
MXNetR X X Linux\MacOS\Windows
darch X Linux\MaxOS
H20 X X Linux\MacOS\Windows
deepnet No information

#### Choice of parameters

Another crucial aspect is the choice of hyperparameters. The H2O package uses a fully-automated per-neuron adaptive learning rate for fast convergence. It also has an option to use n-folds cross validation and offers the function h2o.grid() for grid search in order to optimize hyperparameters and model selection.

MXNetR displays the training accuracy after each iteration. darch shows the error after each epoch. Both allow for manually experimenting with different hyperparameters without waiting for the convergence, since the training phase can be terminated earlier in case the accuracy doesn’t improve. In contrast, deepnet doesn’t display any information until training is completed, which makes tweaking the hyperparameters very challenging.

#### Performance and runtime

We prepared a very simple comparison of performance in order to provide our readers with information on the efficiency. All subsequent measurements were made on a system with CPU Intel Core i7 and GPU NVidia GeForce 750M, Windows OS. The comparison is carried out on three datasets: ‘MNIST’ (LeCun et al. 2012), ‘Iris’ (Fisher 1936) and ‘Forest Cover Type’ (Blackard and Dean 1998). Details are provided in the appendix.

As a baseline, we use the random forest algorithm as implemented in the H2O package. The random forest is an ensemble learning method that works by constructing multiple decision trees (Wikipedia 2016b). Interestingly, it has proved its ability to achieve a high performance while working out-of-the-box without parameter tuning to a large extent.

Results

The results of the measurements are presented in Table 3 and also visualized in Figures 2, 3, and 4 for the ‘MNIST’, ‘Iris’ and ‘Forest Cover Type’ datasets, respectively.

• ‘MNIST’ dataset. According to Table 3 and Figure 2, MXNetR and H2O achieve a superior trade-off between runtime and predictive performance on the ‘MNIST’ dataset. darch and deepnet take a relatively long time to train the networks while simultaneously achieving a lower accuracy.
• ‘Iris’ dataset. Here, we see again that MXNetR and H2O perform best. As can be seen from Figure 3, deepnet has the lowest accuracy, probably because it is such a tiny dataset where the pre-training is misleading. Because of this, darch 100 and darch 500/300 were trained through backpropagation, omitting a pre-training phase. This is marked by the * symbol in the table.
• ‘Forest Cover Type’ dataset. H2O and MXNetR show an accuracy of around 67%, but this is still better that the remaining packages. We note that the training of darch 100 and darch 500/300 didn’t converge, and the models have thus been excluded from this comparison.

We hope that even this simple performance comparison can provide valuable insights for practitioners when choosing their preferred R package.

Note: It can be seen from Figures 3 and 4 that the random forest can perform better than the deep learning packages. There are several valid reasons for this. First, the datasets are too small as Deep Learning usually requires big data or the use of data augmentation to function properly. Second, the data in these datasets consists of hand-made features, which negates the advantage of deep architectures to learn those features from raw data, and, therefore, traditional methods might be sufficient. Finally, we choose very similar (and probably not the most efficient) architectures in order to compare the different implementations..

Table 3. Comparison of accuracy and runtime across different deep learning packages in R.
* Models that were trained with backpropagation only (no pre-training).

Model/Dataset MNIST Iris Forest Cover Type
Accuracy (%) Runtime (sec) Accuracy (%) Runtime (sec) Accuracy (%) Runtime (sec)
MXNetR (CPU) 98.33 147.78 83.04 1.46 66.80 30.24
MXNetR (GPU) 98.27 336.94 84.77 3.09 67.75 80.89
darch 100 92.09 1368.31 69.12 * 1.71
darch 500/300 95.88 4706.23 54.78 * 2.10
deepnet DBN 97.85 6775.40 30.43 0.89 14.06 67.97
deepnet DNN 97.05 2183.92 78.26 0.42 26.01 25.67
H2O 98.08 543.14 89.56 0.53 67.36 5.78
Random Forest 96.77 125.28 91.30 2.89 86.25 9.41

Figure 2. Comparison of runtime and accuracy for the ‘MNIST’ dataset.

Figure 3. Comparison of runtime and accuracy for the ‘Iris’ dataset.

Figure 4. Comparison of runtime and accuracy for the ‘Forest Cover Type’ dataset.

## Conclusion

As part of this article, we have compared five different packages in R for the purpose of deep learning: (1) the current version of deepnet might represent the most differentiated package in terms of available architectures. However, due to its implementation, it might not be the fastest nor the most user-friendly option. Furthermore, it might not offer as many tuning parameters as some of the other packages. (2) H2O and MXNetR, on the contrary, offer a highly user-friendly experience. Both also provide output of additional information, perform training quickly and achieve decent results. H2O might be more suited for cluster environments, where data scientists can use it for data mining and exploration within a straightforward pipeline. When flexibility and prototyping is more of a concern, then MXNetR might be the most suitable choice. It provides an intuitive symbolic tool that is used to build custom network architectures from scratch. Additionally, it is well optimized to run on a personal computer by exploiting multi CPU/GPU capabilities. (3) darch offers a limited but targeted functionality focusing on deep belief networks.

Altogether, we see that R support for deep learning is well on its way. Initially, the offered capabilities of R were lagging behind other programming languages. However, this is no longer the case. With H20 and MXnetR, R users have two powerful tools at their fingertips. In the future, it would be desirable to see further interfaces – e.g. for Caffe or Torch.

## Appendix

‘MNIST’ is a well-known digit recognition dataset. It contains 60,000 training samples and 10,000 test samples with labels and can be downloaded in csv format from http://pjreddie.com/projects/mnist-in-csv/. The ‘Forest Cover Type’ dataset originates from a Kaggle challenge and can be found at https://www.kaggle.com/c/forest-cover-type-prediction/data. It contains 15,120 labeled observations that we divide into 70% training set and 30% test set. It has 54 features and 7 output classes of cover type. The ‘Iris’ dataset is also very popular in machine learning. It is a tiny dataset with 3 classes and 150 samples, and we also subdivide it in a 70/30 ratio for training and testing. We immediately observe that practical applications require far larger datasets to unleash the full potential of deep learning. We are aware of this issue but, nevertheless, want to provide a very plain comparison. However, our experiments indicate that not all packages might be suitable for big data and can thus still provide decision support to practitioners.

For the ‘MNIST’ dataset, all networks were designed to have 2 hidden layers with 500 and 300 units, respectively. One exception is darch 100, which has one hidden layer with 100 elements. For other datasets the number of hidden units was reduced by the factor of ten and, hence, architectures have 2 hidden layers with 50 and 30 units, respectively. Where possible, the array batch size was set to 500 elements, momentum to 0.9, learning rate to 0.07 and dropout ratio to 0.2. Number of rounds (in MXNetR) or epochs (in other packages) was set to 50. darch architectures used pre-training with 15 epochs and batch size 100.

The ‘Iris’ dataset is tiny compared to the others. It has only 150 samples that were randomly shuffled and divided for training and test sets. Therefore, all numbers in tables referring to it were averaged across 5 runs. The batch size parameter was reduced to 5 and the learning rate to 0.007.

The third dataset is the ‘Forest Cover Type’, which has 15,120 samples. The architecture of the networks was the same as for the ‘Iris’ dataset. As this dataset is more challenging, the number of epochs was increased from 50 to 100.

# References

Blackard, J. A., and Dean, D. J. 1998. “Comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables,” in Proc. second southern forestry gIS conf, pp. 189–199.

Candel, A., Parmar, V., LeDell, E., and Arora, A. 2015. “Deep learning with h2O,”

darch. 2015. “Package darch,” (available at https://cran.r-project.org/web/packages/darch/darch.pdf).

deepnet. 2015. “Package deepnet,” (available at https://cran.r-project.org/web/packages/deepnet/deepnet.pdf).

deepr. 2015. “Deepr,” (available at https://github.com/woobe/deepr; retrieved January 9, 2016).

Fisher, R. A. 1936. “The use of multiple measurements in taxonomic problems,” Annals of eugenics (7:2), pp. 179–188.

H2O. 2015. “Package h2o,” (available at https://cran.r-project.org/web/packages/h2o/h2o.pdf).

LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R. 2012. “Efficient backprop,” in Neural networks: Tricks of the trade, Springer, pp. 9–48.

MXNetR. 2016a. “MXNet r package: Mxnet 0.5.0 documentation,” (available at https://mxnet.readthedocs.org/en/latest/R-package/index.html#tutorials; retrieved January 9, 2016).

MXNetR. 2016b. “Installation guide: Mxnet 0.5.0 documentation,” (available at https://mxnet.readthedocs.org/en/latest/build.html; retrieved January 9, 2016).

Nair, V., and Hinton, G. E. 2010. “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (iCML-10), pp. 807–814.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research (15:1), pp. 1929–1958.

Wikipedia. 2016a. “Wikipedia: Deep learning,” (available at https://en.wikipedia.org/wiki/Deep_learning; retrieved March 17, 2016).

Wikipedia. 2016b. “Wikipedia: Random forest,” (available at https://en.wikipedia.org/wiki/Random_forest; retrieved February 3, 2016).