This is a detailed tutorial on image recognition in R using a deep convolutional neural network provided by the MXNet package. After a short post I wrote some times ago I received a lot of requests and emails for a much more detailed explanation, therefore I decided to write this tutorial. This post will show a reproducible example on how to get 97.5% accuracy score on a faces recognition task in R.
I feel some premises are necessary. I’m writing this tutorial with two main objectives.
The first one is to provide a fully reproducible example to everyone; the second objective is to provide a, hopefully, clear explanation to the most frequent questions I received via email on the old post. Please be aware that this is just how I decided to tackle this particular problem, this is certainly not the only way to solve it, and most certainly it is not the best. If you have any suggestion to improve the model or the process, or should you spot any mistake, please do leave a comment.
I’m going to use both Python 3.x (for getting the data and some preprocessing) and R (for the core of the tutorial), so make sure to have both installed. The package requirements for R are the following:
- MXNet. This package will provide you with the model we are going to be using in this tutorial, namely a deep convolutional neural network. You don’t need the GPU version, CPU will suffice for this task although it could be slower. If you feel like it, then go for the GPU version.
- EBImage. This package provides a wide variety of tools to work with images. It is a pleasure to work on images using this package, the documentation is crystal clear and pretty straightforward.
As for Python 3.x please make sure to install both Numpy and Scikit-learn. I suggest to install the Anaconda distribution which already has installed some commonly used packages for data science and machine learning.
Once you have all these things installed and running, you are set to go.
I am going to use the Olivetti faces dataset1. This dataset is a collection of 64×64 pixel 0-256 greyscale images.
The dataset contains a total of 400 images of 40 subjects. With just 10 samples for each subject it is usually used for unsupervised or semi-supervised algorithms, but I’m going to try my best with the selected supervised method.
First of all, you need to scale the images in the 0-1 range. This is automatically done by the function we are going to use to download the dataset, so you do not need to worry about it that much, just be aware that it has already been done for you. Should you use your own images be sure to scale them down in the 0-1 range (or –1,1 range, although the first seems work better with neural networks in my experience). This below is the Python script you need to run in order to download the dataset. Just change the paths to your selected paths and then run it either from your IDE or the terminal.
What this piece of code does is basically download the data, reshape the X datapoints (the images) and then save the numpy arrays to a .csv file.
As you can see, the x array is a tensor (tensor is a fancy name for a multidimensional matrix) of shape (400, 64, 64): this means that the x array contains 400 samples of 64×64 matrices (read images). When in doubt about these things, just print out the first elements of the tensor and try to figure out the structure of the data with what you know. For instance, we know from the dataset description that we have 400 samples and that each image is 64×64 pixel. We flatten the x tensor into a matrix of shape 400×4096. That is, each 64×64 matrix (image) is now converted (flattened) to a row vector of length 4096.
As for y, y is already a simple column vector of size 400. No need to reshape it.
Take a look at the generated .csv and check that everything makes sense to you.
Some more preprocessing with R
Now we are going to use EBImage to resize each image to 28×28 pixel and generate the training and testing datasets. You may ask why I am resizing the images, well, for some reason my machine does not like 64×64 pixel images and my PC crash whenever I run the model with the data. Not good. But that’s ok since we can get good results with smaller images too (you can try running the model with the 64×64 size images if your PC does not crash though). Let’s go on now:
This part of the tutorial should be pretty self explanatory, if you are not sure on what is the output, I suggest taking a look at the rs_df dataframe. It should be a 400×785 dataframe as follows:
label, pixel1, pixel2, … , pixel784
0, 0.2, 0.3, … ,0.1
Building the model
Now for the fun part, let’s build the model. Below you can find the script I used to train and test the model. After the script you can find my comments and explanations on the code.
After having loaded the training and testing dataset, I convert each dataframe to a numeric matrix using the data.matrix function. Remember that the first column of the data represents the labels associated with each image. Be sure to remove the labels from train_array and test_array. Then, after having separated the predictors from the labels, you need to tell MXNet the shape of the data. This is what I did at line 19 with the following piece of code “dim(train_array) <- c(28, 28, 1, ncol(train_x))” for the training array and at line 24 for the test arrray. By doing this, we are essentially telling the model that the training data is made of ncol(train_x) samples (360 images) of shape 28x28. The number 1 signals that the images are greyscale, ie they have only 1 channel. If the images were RGB then the 1 should have been replaced by a 3 since in that case each image would have 3 channels.
As far as the model structure is concerned, this is a variation of the LeNet model using hyperbolic tangent instead of “Relu” as activation function, 2 convolutional layers, 2 pooling layers, 2 fully connected layer, and a typical softmax output.
Each convolutional layer is using a 5×5 kernel and applying a fixed number of filters, check this awesome video for some qualitative insight into convolutional layers. The pooling layers apply a classical “max pooling” approach.
During my tests tanh seemd to work far better than sigmoid and Relu but you can try and use the other activation functions if you want.
As for the model hyperparameters, the learning rate is a little higher than usual but it turns out that it works fine, while the selected number of epochs is 480. Batch size of 40 seems to work fine. These hyperparameters were found out from many trials and error. As far as I know I could have performed a grid search at best, but I did not want to over complicate the code and therefore I just used the classical trial and error approach.
At the end you should get 0.975 accuracy.
All in all, this model was fairly easy to set up and run. When running on CPU it takes about 4-5 minutes to train which is a bit long if you want to make some experiments but it is reasonable to work with.
Considering the fact that we did not do any feature engineering and just made some simple and ordinary preprocessing steps, I believe the result achieved is quite good. Of course, in case we would like to get a better grasp on the “true” accuracy, we would need to perform some cross validation procedures (that will inevitably eat up a lot of computing time ).
Thank you for reading this far, I hope this tutorial helped you understanding how to set up and run this particular model.
Dataset source and credits:
1Olivetti faces dataset was made by AT&T Laboratories Cambridge.