How To Build an Artificial Neural Network From Scratch

January 11, 2018 t.pernetcoudrier@gmail.com No comments exist

A Simple Artificial Neural Network With Python

The idea behind this article is to build a neural network from scratch. We can find a pleiad of tutorials and excellent articles on the internet and textbooks. The slight differences with the other documentation lie in the methodology. I will detail as much as possible the calculous by connecting together algebra, matrice formula and Python code. This will, at least it helped me, to understand how a simple ANN works under the hood.

Let’s begin with a simple 2 inputs, 2 hidden nodes and 2 outputs networks. Our task here is to train a neural network to complete any classification task. Our ultimate goal is to find the best weights, i.e the slope between nodes, that minimises the total error. For that, the learning method is split into 2 propagations :

  \begin{enumerate}   \item Forward propagation   \begin{itemize}     \item Compute the output of the hidden node     \item Compute the output of the third layer     \item Compute the error    \end{itemize}     \item Back propagation   \begin{itemize}     \item Update the weight between the output layer and hidden layer     \item Update the weight between the hidden layer and input layer   \end{itemize} \end{enumerate}

An artificial Neural network is similar to a biological network. Our brain emits a signal to different connected node until a final decision is made. Each node receives an electrical signal from all transmitting nodes. If the signal is too small then it won’t be emitted in the next node. The brain uses only the most relevant signals to propagate around the network. For ANN, the logic applies in a similar fashion. The network takes an input, sends it to all connected nodes and computes the signal with an activation function. Only the signals above a threshold will go to the next layer.

Rendered by QuickLaTeX.com

Figure 1 plots this idea. The neuron is decomposed into the input part and the activation function. The left part receives all the input from the previous layer. The right part is the sum of the input passes into an activation function. In this article, we will use the sigmoid function due to the flexible mathematical properties.

Rendered by QuickLaTeX.com

Our network looks like picture 2. The first layer is the input values i_{i}, the second layer, called hidden layer, receives the weighted input from the previous layer. We use _{ij} notation for the link between layer 1 and layer 2. For instance, the node h_{1} receives signal from the weighted input w_{11} and w_{21}. The third layer links the hidden layer and the output layer, written as _{jk} relationship. From figure 3, we can see the weight from the first node of the hidden layer connecting the first node of the output layer is w_{11} = .40

For simplicity, let’s write down all the weights. Besides, we don’t include a bias on purpose.

Input layers ii Hidden layers wij Output layers wjk Output ok
i1 = .05 w11 = .15 w21 = .25 w11 = .40 w11 = .50 o1 = .01
i2 = .10 w12 = .20 w22 = .30 w11 = .45 w11 = .55 o2 = .99

In order to train our neural network with Python, we need to follow the methodology we have from above. We will write three functions. We follow the books from Tariq Rashid.

  • initialisation : To set the number of input, hidden and output nodes

  • train : refine the wieghts after being given a training set example to learn from

  • query : give an answer from the output nodes after being given an input

 # neural network class definition
 class neuralNetwork:
 # initialise the neural network
  def   __init__() :
 pass
 # train the neural network
  def   train() :
 pass
 # query the neural network
  def   query() :

We can initialize the network with the following code. We need to import numpy library and a special scipy. We want to build a neural network without using Keras or TensorFlow library because we aim to understand what is going on behind the code.

import numpy as np
import scipy.special
class  neuralNetwork :
   def __init__(self, inputnodes, hiddennodes, outputnodes, learningrate):
# set number of nodes in each input, hidden, output layer 
       self.inodes = inputnodes
       self.hnodes = hiddennodes
       self.onodes = outputnodes
# link weight matrices, wih and who
       self.wih = np.matrix([[0.15,0.2], [0.25,0.30]])
       self.who = np.matrix([[0.40,0.45], [0.50,0.55]])
# learning rate 
       self.lr = learningrate
# activation function is the sigmoid function  
       self.activation_function = lambda x: scipy.special.expit(x)
       pass

Our neural network looks like picture 3. We are ready to move to the forward propagation.

Rendered by QuickLaTeX.com

Forward propagation

The first step of the neural network train starts by computing the outputs. We begin with the computation of the hidden output, o_{ij}. Then, we calculate the output generated by the network that will be compared with the actual output. The third layer’s outputs are generated from the output of the hidden layer and plug it into the activation function, i.e sigmoid function.

Hidden layer

     \begin{minipage}[t]{0.4\linewidth} \textit{Algebra} \\ $Net_{ij}=\sum _{j}w_{ij}*o_{i}$ \\ $net_{11}= 0.15*0.05+0.20*0.10 \\ = 0.027$ \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice formula}\\ $X_{hidden}=W_{hidden}\cdot O_{hidden}$ \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice} \[ \begin{bmatrix}  0.15&0.20 \\   0.25& 0.30 \end{bmatrix}*\begin{bmatrix} 0.05\\ 0.10 \end{bmatrix}=\begin{bmatrix} 0.027\\  0.042 \end{bmatrix} \] \end{minipage}

We split the computation into three distinct parts. The Net_{ij} variable represents the left part of the node, which is only the sum of the weighted input. When we have two nodes, the computation is not heavy and can be easily dealt by a human. However, if we connect hundreds of nodes together, it is time-consuming and prone to error. Instead of algebra, we can rewrite the equation in a matrice form so that any computer program can understand. In our case, we use Python numpy library. The matrice formula is simply X_{hidden}=W_{hidden}\cdot O_{hidden} which multiply the weight by the input vector. Note that the dot means multiply matrices. The third column details the input/weight in a matrice form.

Activation function

We can plug the sum of the weighted input into the activation function.
 $o(x)= \frac{1}{(1+e^{-x})}$  $o11= \frac{1}{(1+e^{-.027)})} = 0.506$  $o12= \frac{1}{(1+e^{-.042)})} = 0.510$\\  The outputs of the hidden layer are, $o_{j}$ :  \begin{itemize}  \item $o11= 0.506$  \item $o12=0.510$ \end{itemize}

The Python code used to produce the output comes directly from the query function. We can do that because we have initialized the weight before.

We need to transpose the input matrix. This is easily done with the .T command. Numpy has an easy way to produce a dot calculation. It takes two arguments, which correspond to matrix A and matrix B. We wrote the matrix formula in column two. The Python script is simply np.dot(self.wih, inputs). The final output is calculated using the activation_function declared before.

   def query(self, inputs_list) :
        inputs = np.array(inputs_list, ndmin=2).T
        # calculate signals into hidden layer
        hidden_inputs = np.dot(self.wih, inputs)
        # calculate the signals emerging from hidden layer
        hidden_outputs = self.activation_function(hidden_inputs)
        # calculate signals into final output layer
        final_inputs = np.dot(self.who, hidden_outputs)
        # calculate the signals emerging from final output layer 
        final_outputs = self.activation_function(final_inputs)
        return final_outputs

Output layer

We can turn to the final step of the forward propagation. We know the weight of the connected nodes and the output of the previous layers.

     \begin{minipage}[t]{0.4\linewidth} \textit{Algebra}\\ $Net_{jk}=\sum _{k}w_{jk}*o_{j}$\\ \begin{equation*} \begin{split} net_{11} &= 0.40*0.506+0.45*0.510 \\          &= 0.41$ \end{split} \end{equation*} \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice formula}\\ $X_{output}=W_{output}\cdot O{output}$ \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice} \[ \begin{bmatrix}  0.40&0.45 \\   0.50& 0.55 \end{bmatrix}*\begin{bmatrix} 0.506\\ 0.510 \end{bmatrix} \] = \[ \begin{bmatrix} 0.410\\  0.547 \end{bmatrix} \] \end{minipage}

Activation function

Let’s plug everything into the activation function. The final outputs are, o_{k} :

  \begin{itemize}  \item $o11= .6011$  \item $o12= .6335$ \end{itemize}

We can try if it works using the following command :

# number of input, hidden and output nodes 
input_nodes = 2
hidden_nodes = 2
output_nodes = 2
# learning rate is 0.3 
learning_rate = 1

# create instance of neural network
n = neuralNetwork(input_nodes,hidden_nodes,output_nodes, learning_rate)
n.query([0.05, 0.1]) 

You should see a matrix with 0.6011 and 0.6335 like in our example.

That’s it. We have computed the output, we can know use the error to update the weights of our network.

Backward propagation

The backward propagation aims to minimize the error obtain from each layer to update the weights. We are interested by how much the error change after a change in the weights. Similar to the forward propagation, we need to compute separately the connection between each layer. We begin with updating the weights from the output layer and the hidden layer by minimizing the total error of our model. Our task is to use the gradient descent function to find the weights that minimize the error function. We need to find the derivative of the error function to see the sign of the slope. A positive slope means, we need to move in the opposite direction, we decrease the weight to find the function minimum. The error of the function is (target-actual)^{2}.

Output layer

The error function we are trying to solve is the sum of the differences between the target and actual values squared, summed over the weight wjk linked to ok. We turn to compute the derivative respect to the wjk. For the details of the derivative, you can check this website.

     \begin{equation*} \begin{split} \frac{\partial E}{\partial w_{jk}} & = \frac{\partial }{\partial w_{jk}}(t_{k}-o_{k})^{2} \\ &=\frac{\partial E}{\partial o_{k}}*\frac{\partial o_{k}}{\partial w_{jk}} \\ &=-2(t_{k}-o_{k})*\frac{\partial o_{k}}{\partial w_{jk}} \\ &=-2(t_{k}-o_{k})*\frac{\partial E}{\partial w_{jk}}sigmoid(\sum _{j}w_{jk}o_{j}) \\ &=-(t_{k}-o_{k})*sigmoid(\sum _{j}w_{jk}o_{j})(1-sigmoid(\sum _{j}w_{jk}o_{j}))o_{j} \\ &=-(t_{k}-o_{k})o_{k}(1-o_{k})o_{j} \\ &=e_{k}o_{k}(1-o_{k})o_{j} \end{split} \end{equation*}

The last line describes the slope of the error function.
 \begin{itemize}  \item $t_{k}-o_{k}$ = Error  \item $o_{k}(1-o_{k})$ = Squashed signal of the activation function  \item $o_{j}$ = output from the previous hidden layer \end{itemize}

We can solve the function to see how much the weight should change to minimize the error.
The error is, e_{k} :

 \begin{itemize}  \item $e11= -.591$  \item $o12=0.35$ \end{itemize}

     \begin{minipage}[t]{0.4\linewidth} \textit{Algebra} \\ \begin{small} \begin{equation*} \begin{split} \frac{\partial E}{\partial w_{11}} & = -(t_{1}-o_{1})o_{k}(1-o_{1})o_{1} \\ &=(0.01-0.591)*0.601*(1-0.601)*0.501 \\ &=0.591*0.610*0.398*0.501 \\ &=-0.141 \\ \end{split} \end{equation*} \end{small} \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice formula} \\ \begin{small} $\Delta w_{jk}= \alpha *E_{k}*sigmoid(o_{k}) \\ *(1-sigmoid(o_{k}))\cdot o^{T}_{j}$ \end{small} \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice} \begin{small} \[ \begin{bmatrix} -.14\\0.083  \end{bmatrix}\cdot \begin{bmatrix} 0.5068 & 0.51062 \end{bmatrix} \\  \] \[ = \begin{bmatrix}  -.0721&-.07269 \\   .0424& .04276 \end{bmatrix} \] \end{small} \end{minipage}

Update weight

The last step involves updating the weights. This is easily be done by taking the old weight and sum with the gradient. We usually multiply the second term with a learning rate to moderate the strength of this change to make sure the algorithm find a minimum. For the purpose of this article, suppose the learning rate is equal to 1.

     \begin{minipage}[t]{0.4\linewidth} \textit{Algebra} $w_{jk}^{+}= w_{jk}-\alpha *\frac{\partial E}{\partial w_{jk}}$ \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice} \[ \begin{bmatrix} .40-.072 &.45-.072 \\   .50+.042& .55+.042 \end{bmatrix} = \begin{bmatrix} .356 &.406 \\   .525& .575 \end{bmatrix} \] \end{minipage}

The updated weights are :

  \begin{itemize}  \item $w11^{+}= .356 $  \item $w12^{+}=.406$  \item $w13^{+}=.525$  \item $w14^{+}=.575$ \end{itemize}

We need to write our last function in order to train our network. Let’s write the code in two-step. Firstly, we start with updating the output weight. Secondly, we update the hidden layer’s weights. We will see how to update those weights shortly.

We need 3 matrix.
 \begin{itemize}  \item output error : $t_{k}-o_{k}$  \item final output : $o_{k}$  \item the hidden output : $o_{j}$ \end{itemize}

def train(self, inputs_list, targets_list) :
       # convert inputs list to 2d array
       inputs = np.array(inputs_list, ndmin=2).T 
       targets = np.array(targets_list, ndmin=2).T    
       # calculate signals into hidden layer
       hidden_inputs = np.dot(self.wih, inputs)
       # calculate the signals emerging from hidden layer 
       hidden_outputs = np.asarray(self.activation_function(hidden_inputs))
       # calculate signals into final output layer 
       final_inputs = np.dot(self.who, hidden_outputs)
       # calculate the signals emerging from final output layer 
       final_outputs = np.asarray(self.activation_function(final_inputs))
       # output layer error is the (target?actual)
       output_errors = np.asarray(targets-final_outputs)
       
       self.who += self.lr * np.dot((output_errors * final_outputs * (1.0-final_outputs)), np          .transpose(hidden_outputs))
       pass

Hidden layer

Next, we will tackle the backward pass by calculated the new weight of the hidden layer, find the error slope between the input and hidden layers. The stage is slightly different than the previous one. The output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons. The error becomes e_{j}= \sum _{k}w_{k}e_{k} and takes into consideration the changes of both output, o_{j}

     \begin{equation*} \begin{split} \frac{\partial E}{\partial w_{ij}} & = -(e_{ij})*sigmoid(\sum _{i}w_{ij}o_{i})(1-sigmoid(\sum _{i}w_{ij}o_{i}))o_{i} \\ &=-e_{ij}o_{j}(1-o_{j})o_{i} \end{split} \end{equation*} with  $e_{j}= \sum _{k}w_{k}e_{k}$ \\ \begin{itemize}  \item $e_{j}$ = recombined back-propagated error  \item $o_{j}(1-o_{j})$ = Squashed signal of the activation function from previous layer  \item $o_{i}$ = input signal \end{itemize}

We can figure out the impact of the error after a change in the weight w_{11}

     \begin{minipage}[t]{0.4\linewidth} \textit{Algebra} \\ \begin{small} \begin{equation*} \begin{split} e_{1} &=.40*-.591+.50*.356 \\ &=-.040 \end{split} \end{equation*} \begin{equation*} \begin{split} \frac{\partial E}{\partial w_1}& = -.040*.506*(1-.506)*.05 \\ &= -.00015 \end{split} \end{equation*} \end{small} \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice formula} \\ $\Delta w_{ij}= \alpha *E_{j}*sigmoid(o_{j}) \\ *(1-sigmoid(o_{j}))\cdot o^{T}_{i}$ \end{minipage} \begin{minipage}[t]{0.4\linewidth} \textit{Matrice} \[ \begin{bmatrix} -.010\\ -.013 \end{bmatrix}\cdot \begin{bmatrix} .05 & .1 \end{bmatrix} \] \[ = \begin{bmatrix} -.00015 & -.00030\\   -.00019& -.00039 \end{bmatrix} \] \end{minipage}

Update weight

We have now our updated weights.

 \begin{itemize}  \item $w11^{+}= .149$  \item $w12^{+}=.199$  \item $w13^{+}=.240$  \item $w14^{+}=.299$ \end{itemize}

The python code is straightforward. We append the new matrices required to compute the updated weight for the hidden layer.

 def train(self, inputs_list, targets_list) :
       # convert inputs list to 2d array
       inputs = np.array(inputs_list, ndmin=2).T 
       targets = np.array(targets_list, ndmin=2).T    
       # calculate signals into hidden layer
       hidden_inputs = np.dot(self.wih, inputs)
       # calculate the signals emerging from hidden layer 
       hidden_outputs = np.asarray(self.activation_function(hidden_inputs))
       # calculate signals into final output layer 
       final_inputs = np.dot(self.who, hidden_outputs)
       # calculate the signals emerging from final output layer 
       final_outputs = np.asarray(self.activation_function(final_inputs))
       # output layer error is the (target dot actual)
       output_errors = np.asarray(targets-final_outputs)
       
       # hidden layer error is the output_errors, split by weights, recombined at hidden nodes
       hidden_errors = np.asarray(np.dot(self.who.T, output_errors))
       # update the weights for the links between the hidden and output layers
       self.who += self.lr * np.dot((output_errors * final_outputs * (1.0-final_outputs)), 
       np.transpose(hidden_outputs))
       # update the weights for the links between the input and hidden layers
       self.wih += self.lr * np.dot((hidden_errors * hidden_outputs * (1.0-hidden_outputs)), 
       np.transpose(inputs))   
       #pass
       pass

Our model is ready. In the next article, we will see the code in action with the famous handwritten dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *