How to train a neural network in Python
You probably have this question: what is a neural network, what is it capable of doing, in what fields is it used, and how to train a neural network in python? Neural networks are the basis of deep learning. Deep learning is one of the subfields of machine learning that has made great progress. Among its practical examples, we can mention the control of players in Go and Poker games, as well as accelerating the discovery of drugs and helping automatic cars.
If these kinds of advanced applications excite you, you’re probably interested in deep learning. However, this requires you to be familiar with how neural networks work. Our goal is to teach you neural networks in a deep, practical, and fast way with simple language and concepts. This tutorial presents concepts, code, and mathematics that will allow you to build and understand a simple neural network in a practical way.
Some tutorials focus only on coding and skip the math. But note that this prevents the correct understanding of the content. This article will take the tutorials forward with the exact details. However, to understand the content, it is necessary to be familiar with concepts such as matrices and derivatives in mathematics. The code in the article is written in Python, so it will be helpful if you have a basic understanding of how Python works.
What is an artificial neural network?
Artificial Neural Networks (ANN) are software implementations of the neural structure of our brain. We don’t need to talk about the complex biology of our brain structures, but it is enough to know that the brain contains nerve cells (neurons) that are kind of like organic switches. They are placed as an intermediary between a set of inputs and outputs. These neurons can change their output state depending on the strength of their electrical or chemical input.
The neural network in a person’s brain is a fully connected network of neurons, where the output of any given neuron may be the input of thousands of other neurons. Learning occurs by repeatedly activating certain neural connections over other connections, and these repeated activations strengthen those connections. This makes it possible to create the desired result according to any given input. This learning requires feedback. When the desired outcome occurs, the neural connections that created that outcome are strengthened.
Artificial neural networks try to simplify and imitate this brain behavior. They can be trained in two ways: supervised and unsupervised. In a supervised ANN, the network is trained by providing matched input and output data samples. With the aim that ANN can provide the desired output for a given input. An example is an email spam filter. The input training data can be the number of different words in the body of the email, and the output training data can be the classification of whether the email was actually spam or not.
If many samples of emails are passed through the neural network, the network can learn which inputs make the email spam or not. This learning is done to adjust the weights of the ANN connections, but this will be discussed further in the next section. Unsupervised learning in ANN is to “understand” the structure of the input data provided to the ANN “by itself”. This type of ANN will not be reviewed in this article.
Structure of an artificial neural network (ANN)
In the following, we will learn about the components of the artificial neural network structure.
Artificial neuron
The biological neuron is simulated by an activation function in ANN. In classification tasks (e.g. spam email detection), this activation function should have a “switch on” property. In other words, when the input is greater than a certain value, the output must change, for example, from 0 to 1, from -1 to 1, or from 0 to a value greater than zero. In effect, this simulates the switching on or activation of a biological neuron. A commonly used activation function is the sigmoid function. In the sigmoid function we have:

The code of this function is as follows:
import matplotlib.pylab as plt
import numpy as np
x = np.arange(-8, 8, 0.1)
f = 1 / (1 + np.exp(-x))
plt.plot(x, f)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.show()
The output is as follows:

As seen in the image above, this function is “activated” when the input x is greater than a certain value; it moves from 0 to 1. The sigmoid function is not a step function; the edge is “smooth”, and the output does not change suddenly. This means that the derivative of the function exists, which is important for the gradient-based training algorithm.
Nodes
As mentioned before, biological neurons are connected in hierarchical networks, and the output of some neurons is connected to the input of other cells. We can represent these networks as connected layers of nodes. Each node takes multiple weighted inputs, applies an activation function to the weighted sum of these inputs, and produces an output. We will analyze and examine this case further. Consider the following diagram:

In the image above, the circle represents the node. The node actually takes the weighted inputs, adds them together, and then feeds the result to the activation function. The output of the activation function is indicated by h in the above diagram.
Note: In some articles, the node shown above is also called perceptron.
What is the “weight” of the node? The weights are real numbers (eg, not binary 1 or 0), which are multiplied by the inputs (each input is multiplied by its weight), and these products are added together at the node’s input. In other words, the weighted input to the above node is as follows:

Here wi are the weights (ignore b for now). But what are these weights related to? In fact, they are variables that change during the learning process and, together with the input, determine the output of the node. b The weight of the bias element is +1. Including this bias increases the flexibility of the node, which is better to prove with an example.
Bias
Let’s consider a very simple node, a node with only one input and output:

The input to the node activation function in this case is the value of x1w1. What does changing w1 do in this simple network? Pay attention to the following code:
w1 = 0.5
w2 = 1.0
w3 = 2.0
l1 = 'w = 0.5'
l2 = 'w = 1.0'
l3 = 'w = 2.0'
for w, l in [(w1, l1), (w2, l2), (w3, l3)]:
f = 1 / (1 + np.exp(-x*w))
plt.plot(x, f, label=l)
plt.xlabel('x')
plt.ylabel('h_w(x)')
plt.legend(loc= 2 )
plt.show()
The output of the code is as follows:

Here we see that changing the weights changes the output slope of the sigmoid activation function, which is obviously useful if we want to model different strengths of relationships between input and output variables. However, what if we want the output to change only when x is greater than 1? This is where bias comes into play. Let’s consider the same network with a biased input:

See the following code where bias is considered:
w = 5.0
b1 = -8.0
b2 = 0.0
b3 = 8.0
l1 = 'b = -8.0'
l2 = 'b = 0.0'
l3 = 'b = 8.0'
for b, l in [(b1, l1), (b2, l2), (b3, l3)]:
f = 1 / (1 + np.exp(-(x*w+b)))
plt.plot(x, f, label=l)
plt.xlabel('x')
plt.ylabel('h_wb(x)')
plt.legend(loc= 2 )
plt.show()
The output of the code is as follows, which shows the effect of setting the bias values:

In this case, the value is incremented to simulate the defined “turn on” function. As you can see, by changing the “weight” of the bias (b), you can change the activation time of the node. So by adding bias we can make the node behave like an if function. For example: if (x>z) then 1 else 0.
Without bias, you cannot change the value of z in the if statement. In the unbiased state, the value of z always remains around zero. This is very useful if you are trying to simulate conditional relationships.
Putting all network structures together
Hopefully, the previous explanation has given you a good overview of how a given node/neuron/perceptron works in a neural network. However, as you probably know, many such interconnected nodes exist in a neural network. These structures can exist in thousands of different forms, but the most common simple neural network structure consists of an input layer, a hidden layer, and an output layer. An example of such a structure can be seen in the figure below:

The three layers of the network can be seen in the figure above. Layer one represents the input layer, where external input data enters the network. Layer two is called the hidden layer because it is not part of the input or output.
Note: Neural networks can have many hidden layers, but in this case, we have considered only one hidden layer for simplicity. Finally, layer three is the output layer. You can see many connections between layers, especially between layer 1 (L1) and layer 2 (L2). As can be seen, each node in L1 is connected to all nodes in L2, similarly, the nodes in L2 are connected to a single node in L3. Each of these connections will have an associated weight.
Mark
The following math needs some pretty accurate notation, so we know what we’re talking about. The notation used here is similar to that used in the Stanford Deep Learning tutorial. In the following equations, each of these weights is identified by the following notation:
wij (l) where i refers to the connection node number in layer l+1 and j refers to the connection node number in layer l.
Pay special attention to this. Therefore, to express the connection between node 1 in layer 1 and node 2 in layer 2, the weight notation will be W21. This notation may seem a little strange, but as you might expect, the numbers in layers l and l+1 are sequential (input and output number order) and not the other way around. However, when you add bias, this notation becomes more meaningful. As you can see in the figure above, the bias (+1) is connected to each of the nodes of the next layer. So the bias in layer 1 is connected to all the nodes in layer 2. Since bias is not an actual node with an enable function, it has no input (it always produces a +1 value at the output).
The symbol for the bias weight is bi (l), where i represents the node number in layer l+1, the same as that used to denote the normal weight w21(1). Therefore, the weight corresponding to the connection between the bias in layer 1 and The second node in layer 2 is represented by b2(1). Remember, the values of w ji(1) and bi(l) must be calculated in the ANN training stage.
The output signal of the node is also hj (l), where j represents the number of nodes in the network layer. As can be seen in the three-layer network above, the output of node 2 in layer 2 is denoted by h2 (2). Now that the notations are sorted, it is time to examine how to calculate the output of the network when the input and weights are known. Calculating the neural network’s output according to these values is called the feed-forward method.
feed-forward method
To illustrate how the output is computed from input in neural networks, let’s start with the specific case of the three-layer neural network presented above. In this part, the same three-layer neural network is presented as an equation. In the following, we will show it to you with an example and Python codes.

In the above equations, (f) refers to the activation function of the node, which in this case is the sigmoid function. In the first line, h1(2) represents the output of node 1 in layer 2, whose inputs are as follows:

These inputs can be traced in the above three-layer connection diagram. They are simply summed and then passed through the activation function to calculate the output of the first node. We do the same for the other two nodes in the second layer. The final output line is the only node in the third layer, the last layer, and the neural network’s final output.
As can be seen, instead of using the weighted input variables (x 1 , x 2 , x 3 ), the final node, the weighted output of the nodes of the second layer (h3(2), h2(2), h1(2) ) It also takes weight bias as input. So, in equation form, you can see the hierarchical nature of artificial neural networks.
Example for feed-forward
Now, let’s examine a simple example of the output of this neural network in Python. Before doing anything, check if the weights between layers 1 and 2, (w11(1), w12(1),…) are ideally suitable for the matrix representation? Consider the following matrix:

This matrix can be easily represented using numpy arrays:
import numpy as np
w1 = np.array([[0.2, 0.2, 0.2], [0.4, 0.4, 0.4], [0.6, 0.6, 0.6]])

Here we have filled the weight array of layer 1 with some sample weights. We can do the same for layer 2 weight arrays:
The following Python code shows how to define the layer 2 weight array:
w2 = np.zeros((1, 3))
w2[0,:] = np.array([0.5, 0.5, 0.5])
We can also set some dummy values in the array/vector, layer 1 bias weight and layer 2 bias weight (which is just a single value in this neural network structure, i.e. a scalar):
b1 = np.array([0.8, 0.8, 0.8])
b2 = np.array([0.2])
Finally, before writing the main program to compute the output from the neural network, it’s a good idea to set up a separate Python function for the activation function:
def f(x):
return 1 / (1 + np.exp(-x))
Provide feed-forward function
Below is a simple way to calculate the output of a neural network using nested loops in Python. We will soon explore more efficient methods of calculating the output.
def simple_looped_nn_calc(n_layers, x, w, b):
for l in range(n_layers-1):
#Setup the input array which the weights will be multiplied by for each layer
#If it's the first layer, the input array will be the x input vector
#If it's not the first layer, the input to the next layer will be the
#output of the previous layer
if l == 0:
node_in = x
else:
node_in = h
#Setup the output array for the nodes in layer l + 1
h = np.zeros((w[l].shape[0],))
#loop through the rows of the weight array
for i in range(w[l].shape[0]):
#setup the sum inside the activation function
f_sum = 0
#loop through the columns of the weight array
for j in range(w[l].shape[1]):
f_sum += w[l][i][j] * node_in[j]
#add the bias
f_sum += b[l][i]
#finally use the activation function to calculate the
#i-th output i.e. h1, h2, h3
h[i] = f(f_sum)
return h
This function receives as input the number of layers of the neural network, which we consider as the input array/vector x, then receives a tuple or a list of weights and bias weights of the network, where each element in this list represents a layer l It is in the network. In other words, the inputs are set as follows:
w = [w1, w2]
b = [b1, b2]
#a dummy x input vector
x = [1.5, 2.0, 3.0]
This function first checks the input to the node layer (input weight to the node). If we look at the first layer, the input of the nodes of the second layer is the product of the input vector x in the corresponding weights. After the first layer, the input of the next layers is the output of the previous layers. Finally, we have a nested loop through the i and j values of the weight and bias vectors. This function uses the dimensions of the weights of each layer to detect the number of nodes as well as the structure of the network. The function call is as follows:
simple_looped_nn_calc(3, x, w, b)
This function will produce an output of 0.8354. We can confirm these results by manually performing the calculations in the original equations:

A more efficient implementation of the feed-forward neural network
As previously stated, using loops is not the most efficient way to calculate feed-forward in Python; Because loops are slow in Python. In the following, an alternative and more efficient mechanism for performing feed-forward calculations in Python and Numpy will be discussed. We can calculate the efficiency of the presented algorithm using the timeit% function.
This function executes the algorithm several times and returns the average execution times as output:
%timeit simple_looped_nn_calc(3, x, w, b)
Running this code tells us that the feed-forward calculation takes about 40 microseconds. Achieving the result in a few tens of seconds seems very fast, but when this method is used for very large neural networks with several hundreds of nodes in each layer, especially when training the network, this speed is not acceptable. If we try a four-layer neural network using the same code, we get much worse performance (70 µs).
Vectorization in neural networks
In the feed-forward method, equations can be written more compactly, and calculations can be implemented more efficiently. At first, we define a new variable named zi(l) that represents the sum of inputs to node i in layer l and includes the bias part. Therefore, in the first node of layer 2, the value of z is equal to:

where n represents the number of nodes in layer 1. Using this notation, the previously inappropriate set of equations, for example for a three-layer network, can be reduced to:

Note the use of the capital letter W to refer to the weights matrix. Note that all the elements in the above equation are matrices or vectors. If you are not familiar with these concepts, they are fully explained in the next section.
Can the above equation be made a little simpler? The answer is yes. We can extend the computation through any number of layers in the neural network by generalizing.

Here we can see the general process of feed forward, where the output of layer l is converted to the input of layer l + 1. Note that (h(1) is the input of layer x and (h(nl) is the output of the output layer.
Note that in the above equations, we have removed the reference to node numbers i and j. But how can we do this? Don’t we have to loop through the code and calculate all the different inputs and outputs of the node? The answer is that we can use matrix multiplication to do this more easily. This process is called “vectorization” and has two advantages:
- Its first advantage is that it reduces the complexity of the code.
- A second advantage is that we can use fast linear algebra programs in Python (and other languages) instead of using loops, which speeds up our programs.
Numpy can easily handle these calculations. For those who are not familiar with matrix operations, the following section provides a brief explanation of matrix operations.
Multiplication of matrices
Let’s expand the following equation into matrix/vector form for the input layer:

Note that in this equation h (l) = x. The expansion of the above equation for the input layer is as follows:

For those who do not know how matrix multiplication works, it is better to check the matrix operation. There are many websites that cover this topic well. However, let’s quickly examine matrix multiplication:
when the weight matrix is multiplied by the input layer vector, each element in the row of the weight matrix is multiplied by each element in a single column of the input vector (vectors have only one column), then the results of these multiplications are added together until a new 3×1 vector is created. Then you can add the weight bias vectors to get the final result.
If you look carefully, you can see that each row of the final result above corresponds to the activation function in the set of non-matrix original equations above. If the activation function can be applied to the elements (that is, applied to each row individually in the vector), we can do all our calculations using matrices and vectors instead of Python core loops.
Let’s take a look at a simpler (and faster) version of simple_looped_nn_calc:
def matrix_feed_forward_calc(n_layers, x, w, b):
for l in range(n_layers-1):
if l == 0:
node_in = x
else:
node_in = h
z = w[l].dot(node_in) + b[l]
h = f(z)
return h
Pay attention to the seventh line. To multiply the weights in the node input vector in numpy, instead of using the * sign, we use the sign (a.dot(b).
If we re-run timeit% using this new function and a simple 4-layer network, we only improve by 24 µs (down from 70 µs to 46 µs). However, if we increase the size of the 4-layer network to layers of 10-100-50-10-100 nodes, much more impressive results are obtained.
The Python loop-based method takes 41 milliseconds. Note that the unit here is milliseconds, and the vector implementation takes only 84 µs to feed forward through the neural network. By using vectorized calculations instead of Python loops, we have increased the calculation efficiency by 500 times! This is very good progress. It’s even possible to run matrix operations faster using deep learning packages like TensorFlow and Theano, which use your computer’s GPU (rather than the CPU), whose architecture is better suited for fast matrix computations.
Conclusion
In this article, we introduced the neural network and the feed-forward solution for it. Neural networks are one of the important algorithms in machine learning. In addition, to learn new algorithms such as deep learning, familiarity with neural networks is required. In this article, in addition to explaining the mathematical formulas, the implementation of the relevant codes in Python was also explained in order to provide a better understanding to the reader. We hope this article was useful for you. We are happy to share your comments and experiences with us.