# Deep Learning from first principles in Python, R and Octave – Part 4

**R – Giga thoughts …**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this 4th post of my series on Deep Learning from first principles in Python, R and Octave – Part 4, I explore the details of creating a multi-class classifier using the Softmax activation unit in a neural network. The earlier posts in this series were

1. Deep Learning from first principles in Python, R and Octave – Part 1. In this post I implemented logistic regression as a simple Neural Network in vectorized Python, R and Octave

2. Deep Learning from first principles in Python, R and Octave – Part 2. This 2nd part implemented the most elementary neural network with 1 hidden layer and any number of activation units in the hidden layer with sigmoid activation at the output layer

3. Deep Learning from first principles in Python, R and Octave – Part 3. The 3rd implemented a multi-layer Deep Learning network with an arbitrary number if hidden layers and activation units per hidden layer. The output layer was for binary classification which was based on the sigmoid unit. This multi-layer deep network was implemented in vectorized Python, R and Octave

This 4th part takes a swing at multi-class classification and uses the Softmax as the activation unit in the output layer. Inclusion of the Softmax activation unit in the activation layer requires us to compute the derivative of Softmax, or rather the “Jacobian” of the Softmax function, besides also computing the log loss for this Softmax activation during back propagation. Since the derivation of the Jacobian of a Softmax and the computation of the Cross Entropy/log loss is very involved, I have implemented a basic neural network with just 1 hidden layer with the Softmax activation at the output layer. I also perform multi-class classification based on the ‘spiral’ data set from CS231n Convolutional Neural Networks Stanford course, to test the performance and correctness of the implementations in Python, R and Octave. You can clone download the code for the Python, R and Octave implementations from Github at Deep Learning – Part 4

The Softmax function takes an N dimensional vector as input and generates a N dimensional vector as output.

The Softmax function is given by

There is a probabilistic interpretation of the Softmax, since the sum of the Softmax values of a set of vectors will always add up to 1, given that each Softmax value is divided by the total of all values.

As mentioned earlier, the Softmax takes a vector input and returns a vector of outputs. For e.g. the Softmax of a vector a=[1, 3, 6] is another vector S=[0.0063,0.0471,0.9464]. Notice that vector output is proportional to the input vector. Also, taking the derivative of a vector by another vector, is known as the Jacobian. By the way, The Matrix Calculus You Need For Deep Learning by Terence Parr and Jeremy Howard, is very good paper that distills all the main mathematical concepts for Deep Learning in one place.

Let us take a simple 2 layered neural network with just 2 activation units in the hidden layer is shown below

and

where g'() is the activation unit in the hidden layer which can be a relu, sigmoid or a

tanh function

**Note**: The superscript denotes the layer. The above denotes the equation for layer 1

of the neural network. For layer 2 with the Softmax activation, the equations are

and

where S() is the Softmax activation function

The Jacobian of the softmax ‘S’ is given by

– (A)

Now the ‘division-rule’ of derivatives is as follows. If u and v are functions of x, then

Using this to compute each element of the above Jacobian matrix, we see that

when i=j we have

and when

This is of the general form

when i=j

and

when

**Note**: Since the Softmax essentially gives the probability the following

notation is also used

when i=j

and

If you throw the “Kronecker delta” into the equation, then the above equations can be expressed even more concisely as

where when i=j and 0 when

This reduces the Jacobian of the simple 2 output softmax vectors equation (A) as

The loss of Softmax is given by

For the 2 valued Softmax output this is

Using the chain rule we can write

(1)

and

(2)

In expanded form this is

Also

Therefore

Since

when i=j

and

when

which simplifies to

Since

Similarly

In general this is of the form

For e.g if the probabilities computed were p=[0.1, 0.7, 0.2] then this implies that the class with probability 0.7 is the likely class. This would imply that the ‘One hot encoding’ for yi would be yi=[0,1,0] therefore the gradient pi-yi = [0.1,-0.3,0.2]

**Note: Further, we could extend this derivation for a Softmax activation output that outputs 3 classes
**

To

R-bloggers.com offers

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We could derive

which similarly reduces to

Interestingly, despite the lengthy derivations the final result is simple and intuitive!

As seen in my post ‘Deep Learning from first principles with Python, R and Octave – Part 3 the key equations for forward and backward propagation are

**Forward propagation equations layer 1**

and

**Forward propagation equations layer 1**

and

**Backward propagation equations layer 2**

**Backward propagation equations layer 1**

#### 2.0 Spiral data set

As I mentioned earlier, I will be using the ‘spiral’ data from CS231n Convolutional Neural Networks to ensure that my vectorized implementations in Python, R and Octave are correct. Here is the ‘spiral’ data set.

import numpy as np import matplotlib.pyplot as plt import os os.chdir("C:/junk/dl-4/dl-4") exec(open("././DLfunctions41.py").read()) # Create an input data set - Taken from CS231n Convolutional Neural networks # http://cs231n.github.io/neural-networks-case-study/ N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) plt.savefig("fig1.png", bbox_inches='tight')

The implementations of the vectorized Python, R and Octave code are shown diagrammatically below

#### 2.1 Multi-class classification with Softmax – Python code

A simple 2 layer Neural network with a single hidden layer , with 100 Relu activation units in the hidden layer and the Softmax activation unit in the output layer is used for multi-class classification. This Deep Learning Network, plots the non-linear boundary of the 3 classes as shown below

import numpy as np import matplotlib.pyplot as plt import os os.chdir("C:/junk/dl-4/dl-4") exec(open("././DLfunctions41.py").read()) # Read the input data N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Set the number of features, hidden units in hidden layer and number of classess numHidden=100 # No of hidden units in hidden layer numFeats= 2 # dimensionality numOutput = 3 # number of classes # Initialize the model parameters=initializeModel(numFeats,numHidden,numOutput) W1= parameters['W1'] b1= parameters['b1'] W2= parameters['W2'] b2= parameters['b2'] # Set the learning rate learningRate=0.6 # Initialize losses losses=[] # Perform Gradient descent for i in range(10000): # Forward propagation through hidden layer with Relu units A1,cache1= layerActivationForward(X.T,W1,b1,'relu') # Forward propagation through output layer with Softmax A2,cache2 = layerActivationForward(A1,W2,b2,'softmax') # No of training examples numTraining = X.shape[0] # Compute log probs. Take the log prob of correct class based on output y correct_logprobs = -np.log(A2[range(numTraining),y]) # Conpute loss loss = np.sum(correct_logprobs)/numTraining # Print the loss if i % 1000 == 0: print("iteration %d: loss %f" % (i, loss)) losses.append(loss) dA=0 # Backward propagation through output layer with Softmax dA1,dW2,db2 = layerActivationBackward(dA, cache2, y, activationFunc='softmax') # Backward propagation through hidden layer with Relu unit dA0,dW1,db1 = layerActivationBackward(dA1.T, cache1, y, activationFunc='relu') #Update paramaters with the learning rate W1 += -learningRate * dW1 b1 += -learningRate * db1 W2 += -learningRate * dW2.T b2 += -learningRate * db2.T #Plot losses vs iterations i=np.arange(0,10000,1000) plt.plot(i,losses) plt.xlabel('Iterations') plt.ylabel('Loss') plt.title('Losses vs Iterations') plt.savefig("fig2.png", bbox="tight") #Compute the multi-class Confusion Matrix from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # We need to determine the predicted values from the learnt data # Forward propagation through hidden layer with Relu units A1,cache1= layerActivationForward(X.T,W1,b1,'relu') # Forward propagation through output layer with Softmax A2,cache2 = layerActivationForward(A1,W2,b2,'softmax') #Compute predicted values from weights and biases yhat=np.argmax(A2, axis=1) a=confusion_matrix(y.T,yhat.T) print("Multi-class Confusion Matrix") print(a)

## iteration 0: loss 1.098507 ## iteration 1000: loss 0.214611 ## iteration 2000: loss 0.043622 ## iteration 3000: loss 0.032525 ## iteration 4000: loss 0.025108 ## iteration 5000: loss 0.021365 ## iteration 6000: loss 0.019046 ## iteration 7000: loss 0.017475 ## iteration 8000: loss 0.016359 ## iteration 9000: loss 0.015703 ## Multi-class Confusion Matrix ## [[ 99 1 0] ## [ 0 100 0] ## [ 0 1 99]]

To know more details about Confusion Matrix and other related measures check out my book My book ‘Practical Machine Learning with R and Python’ on Amazon

#### 2.2 Multi-class classification with Softmax – R code

The spiral data set created with Python was saved, and is used as the input with R code. The R Neural Network seems to perform much,much slower than both Python and Octave. Not sure why! Incidentally the computation of loss and the softmax derivative are identical for both R and Octave. yet R is much slower. To compute the softmax derivative I create matrices for the One Hot Encoded yi and then stack them before subtracting pi-yi. I am sure there is a more elegant and more efficient way to do this, much like Python. Any suggestions?

library(ggplot2) library(dplyr) library(RColorBrewer)

</code><code class="hljs"><span class="hljs-keyword">source</span>(<span class="hljs-string">"DLfunctions41.R"</span>) <span class="hljs-comment"># Read the spiral dataset</span> Z <- as.matrix(read.csv(<span class="hljs-string">"spiral.csv"</span>,header=<span class="hljs-literal">FALSE</span>)) Z1=data.frame(Z) <span class="hljs-comment">#Plot the dataset</span> ggplot(Z1,aes(x=V1,y=V2,col=V3)) +geom_point() + scale_colour_gradientn(colours = brewer.pal(<span class="hljs-number">10</span>, <span class="hljs-string">"Spectral"</span>))

</code><code class="hljs"><span class="hljs-string"># Setup the data X <- Z[,<span class="hljs-number">1</span>:<span class="hljs-number">2</span>] y <- Z[,<span class="hljs-number">3</span>] X1 <- t(X) Y1 <- t(y) <span class="hljs-comment"># Initialize number of features, number of hidden units in hidden layer and</span> <span class="hljs-comment"># number of classes</span> numFeats<-<span class="hljs-number">2</span> <span class="hljs-comment"># No features</span> numHidden<-<span class="hljs-number">100</span> <span class="hljs-comment"># No of hidden units</span> numOutput<-<span class="hljs-number">3</span> <span class="hljs-comment"># No of classes</span> <span class="hljs-comment"># Initialize model</span> parameters <-initializeModel(numFeats, numHidden,numOutput) W1 <-parameters[[<span class="hljs-string">'W1'</span>]] b1 <-parameters[[<span class="hljs-string">'b1'</span>]] W2 <-parameters[[<span class="hljs-string">'W2'</span>]] b2 <-parameters[[<span class="hljs-string">'b2'</span>]] <span class="hljs-comment"># Set the learning rate</span> learningRate <- <span class="hljs-number">0.5</span> <span class="hljs-comment"># Initialize losses</span> losses <- <span class="hljs-literal">NULL</span> <span class="hljs-comment"># Perform gradient descent</span> <span class="hljs-keyword">for</span>(i <span class="hljs-keyword">in</span> <span class="hljs-number">0</span>:<span class="hljs-number">9000</span>){ <span class="hljs-comment"># Forward propagation through hidden layer with Relu units</span> retvals <- layerActivationForward(X1,W1,b1,<span class="hljs-string">'relu'</span>) A1 <- retvals[[<span class="hljs-string">'A'</span>]] cache1 <- retvals[[<span class="hljs-string">'cache'</span>]] forward_cache1 <- cache1[[<span class="hljs-string">'forward_cache1'</span>]] activation_cache <- cache1[[<span class="hljs-string">'activation_cache'</span>]] <span class="hljs-comment"># Forward propagation through output layer with Softmax units</span> retvals = layerActivationForward(A1,W2,b2,<span class="hljs-string">'softmax'</span>) A2 <- retvals[[<span class="hljs-string">'A'</span>]] cache2 <- retvals[[<span class="hljs-string">'cache'</span>]] forward_cache2 <- cache2[[<span class="hljs-string">'forward_cache1'</span>]] activation_cache2 <- cache2[[<span class="hljs-string">'activation_cache'</span>]] <span class="hljs-comment"># No oftraining examples</span> numTraining <- dim(X)[<span class="hljs-number">1</span>] dA <-<span class="hljs-number">0</span> <span class="hljs-comment"># Select the elements where the y values are 0, 1 or 2 and make a vector</span> a=c(A2[y==<span class="hljs-number">0</span>,<span class="hljs-number">1</span>],A2[y==<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],A2[y==<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]) <span class="hljs-comment"># Take log</span> correct_probs = -log(a) <span class="hljs-comment"># Compute loss</span> loss= sum(correct_probs)/numTraining <span class="hljs-keyword">if</span>(i %% <span class="hljs-number">1000</span> == <span class="hljs-number">0</span>){ sprintf(<span class="hljs-string">"iteration %d: loss %f"</span>,i, loss) print(loss) } <span class="hljs-comment"># Backward propagation through output layer with Softmax units</span> retvals = layerActivationBackward(dA, cache2, y, activationFunc=<span class="hljs-string">'softmax'</span>) dA1 = retvals[[<span class="hljs-string">'dA_prev'</span>]] dW2= retvals[[<span class="hljs-string">'dW'</span>]] db2= retvals[[<span class="hljs-string">'db'</span>]] <span class="hljs-comment"># Backward propagation through hidden layer with Relu units</span> retvals = layerActivationBackward(t(dA1), cache1, y, activationFunc=<span class="hljs-string">'relu'</span>) dA0 = retvals[[<span class="hljs-string">'dA_prev'</span>]] dW1= retvals[[<span class="hljs-string">'dW'</span>]] db1= retvals[[<span class="hljs-string">'db'</span>]] <span class="hljs-comment"># Update parameters</span> W1 <- W1 - learningRate * dW1 b1 <- b1 - learningRate * db1 W2 <- W2 - learningRate * t(dW2) b2 <- b2 - learningRate * t(db2) }</span>

## [1] 1.212487 ## [1] 0.5740867 ## [1] 0.4048824 ## [1] 0.3561941 ## [1] 0.2509576 ## [1] 0.7351063 ## [1] 0.2066114 ## [1] 0.2065875 ## [1] 0.2151943 ## [1] 0.1318807

#Create iterations iterations <- seq(0,10) #df=data.frame(iterations,losses) ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(color="blue") + ggtitle("Losses vs iterations") + xlab("Iterations") + ylab("Loss") plotDecisionBoundary(Z,W1,b1,W2,b2)

**Multi-class Confusion Matrix**

library(caret) library(e1071) # Forward propagation through hidden layer with Relu units retvals <- layerActivationForward(X1,W1,b1,'relu') A1 <- retvals[['A']] # Forward propagation through output layer with Softmax units retvals = layerActivationForward(A1,W2,b2,'softmax') A2 <- retvals[['A']] yhat <- apply(A2, 1,which.max) -1 Confusion Matrix and Statistics Reference Prediction 0 1 2 0 97 0 1 1 2 96 4 2 1 4 95 Overall Statistics Accuracy : 0.96 95% CI : (0.9312, 0.9792) No Information Rate : 0.3333 P-Value [Acc > NIR] : <2e-16 Kappa : 0.94 Mcnemar's Test P-Value : 0.5724 Statistics by Class: Class: 0 Class: 1 Class: 2 Sensitivity 0.9700 0.9600 0.9500 Specificity 0.9950 0.9700 0.9750 Pos Pred Value 0.9898 0.9412 0.9500 Neg Pred Value 0.9851 0.9798 0.9750 Prevalence 0.3333 0.3333 0.3333 Detection Rate 0.3233 0.3200 0.3167 Detection Prevalence 0.3267 0.3400 0.3333 Balanced Accuracy 0.9825 0.9650 0.9625

My book “Practical Machine Learning with R and Python” includes the implementation for many Machine Learning algorithms and associated metrics. Pick up your copy today!

#### 2.3 Multi-class classification with Softmax – Octave code

A 2 layer Neural network with the Softmax activation unit in the output layer is constructed in Octave. The same spiral data set is used for Octave also

```
```source("DL41functions.m")

# Read the spiral data

data=csvread("spiral.csv");

# Setup the data

X=data(:,1:2);

Y=data(:,3);

# Set the number of features, number of hidden units in hidden layer and number of classes

numFeats=2; #No features

numHidden=100; # No of hidden units

numOutput=3; # No of classes

# Initialize model

[W1 b1 W2 b2] = initializeModel(numFeats,numHidden,numOutput);

# Initialize losses

losses=[]

#Initialize learningRate

learningRate=0.5;

for k =1:10000

# Forward propagation through hidden layer with Relu units

[A1,cache1 activation_cache1]= layerActivationForward(X',W1,b1,activationFunc ='relu');

# Forward propagation through output layer with Softmax units

[A2,cache2 activation_cache2] =

layerActivationForward(A1,W2,b2,activationFunc='softmax');

# No of training examples

numTraining = size(X)(1);

# Select rows where Y=0,1,and 2 and concatenate to a long vector

a=[A2(Y==0,1) ;A2(Y==1,2) ;A2(Y==2,3)];

#Select the correct column for log prob

correct_probs = -log(a);

#Compute log loss

loss= sum(correct_probs)/numTraining;

if(mod(k,1000) == 0)

disp(loss);

losses=[losses loss];

endif

dA=0;

# Backward propagation through output layer with Softmax units

[dA1 dW2 db2] = layerActivationBackward(dA, cache2, activation_cache2,Y,activationFunc='softmax');

# Backward propagation through hidden layer with Relu units

[dA0,dW1,db1] = layerActivationBackward(dA1', cache1, activation_cache1, Y, activationFunc='relu');

#Update parameters

W1 += -learningRate * dW1;

b1 += -learningRate * db1;

W2 += -learningRate * dW2';

b2 += -learningRate * db2';

endfor

# Plot Losses vs Iterations

iterations=0:1000:9000

plotCostVsIterations(iterations,losses)

# Plot the decision boundary

`plotDecisionBoundary( X,Y,W1,b1,W2,b2)`

The code for the Python, R and Octave implementations can be downloaded from Github at Deep Learning – Part 4

#### Conclusion

In this post I have implemented a 2 layer Neural Network with the Softmax classifier. In Part 3, I implemented a multi-layer Deep Learning Network. I intend to include the Softmax activation unit into the generalized multi-layer Deep Network along with the other activation units of sigmoid,tanh and relu.

Stick around, I’ll be back!!

Watch this space!

**References**

1. Deep Learning Specialization

2. Neural Networks for Machine Learning

3. CS231 Convolutional Neural Networks for Visual Recognition

4. Eli Bendersky’s Website – The Softmax function and its derivative

5. Cross Validated – Backpropagation with Softmax / Cross Entropy

6. Stackoverflow – CS231n: How to calculate gradient for Softmax loss function?

7. Math Stack Exchange – Derivative of Softmax

8. The Matrix Calculus for Deep Learning

**You may like**

1.My book ‘Practical Machine Learning with R and Python’ on Amazon

2. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

3. Deblurring with OpenCV: Weiner filter reloaded

4. A method to crowd source pothole marking on (Indian) roads

5. Rock N’ Roll with Bluemix, Cloudant & NodeExpress

6. Sea shells on the seashore

7. Design Principles of Scalable, Distributed Systems

To see all post click Index of posts

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Giga thoughts …**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.