Artificial_Intelligence-A_Guide_to_Intelligent_Systems-184-213.pdf
Document Details
Uploaded by ConfidentAgate1880
Tags
Related
- Artificial_Intelligence-A_Guide_to_Intelligent_Systems-184-213.pdf
- Artificial_Intelligence-A_Guide_to_Intelligent_Systems-184-213.pdf
- Artificial_Intelligence-A_Guide_to_Intelligent_Systems-184-213.pdf
- AI Notes PDF
- Artificial Intelligence - Neural Networks PDF
- Neural Networks for Machine Learning in Finance PDF
Full Transcript
Artificial neural networks 6 In which we consider how our brains work and how to build and train artificial neural networks. 6.1 Introduction, or how the brain works ‘The computer hasn’t proved anything yet,’ angry Garry Kasparov, the world ches...
Artificial neural networks 6 In which we consider how our brains work and how to build and train artificial neural networks. 6.1 Introduction, or how the brain works ‘The computer hasn’t proved anything yet,’ angry Garry Kasparov, the world chess champion, said after his defeat in New York in May 1997. ‘If we were playing a real competitive match, I would tear down Deep Blue into pieces.’ But Kasparov’s efforts to downplay the significance of his defeat in the six- game match was futile. The fact that Kasparov – probably the greatest chess player the world has seen – was beaten by a computer marked a turning point in the quest for intelligent machines. The IBM supercomputer called Deep Blue was capable of analysing 200 million positions a second, and it appeared to be displaying intelligent thoughts. At one stage Kasparov even accused the machine of cheating! ‘There were many, many discoveries in this match, and one of them was that sometimes the computer plays very, very human moves. It deeply understands positional factors. And that is an outstanding scientific achievement.’ Traditionally, it has been assumed that to beat an expert in a chess game, a computer would have to formulate a strategy that goes beyond simply doing a great number of ‘look-ahead’ moves per second. Chess-playing programs must be able to improve their performance with experience or, in other words, a machine must be capable of learning. What is machine learning? In general, machine learning involves adaptive mechanisms that enable com- puters to learn from experience, learn by example and learn by analogy. Learning capabilities can improve the performance of an intelligent system over time. Machine learning mechanisms form the basis for adaptive systems. The most popular approaches to machine learning are artificial neural networks and genetic algorithms. This chapter is dedicated to neural networks. 166 ARTIFICIAL NEURAL NETWORKS What is a neural network? A neural network can be defined as a model of reasoning based on the human brain. The brain consists of a densely interconnected set of nerve cells, or basic information-processing units, called neurons. The human brain incorporates nearly 10 billion neurons and 60 trillion connections, synapses, between them (Shepherd and Koch, 1990). By using multiple neurons simultaneously, the brain can perform its functions much faster than the fastest computers in existence today. Although each neuron has a very simple structure, an army of such elements constitutes a tremendous processing power. A neuron consists of a cell body, soma, a number of fibres called dendrites, and a single long fibre called the axon. While dendrites branch into a network around the soma, the axon stretches out to the dendrites and somas of other neurons. Figure 6.1 is a schematic drawing of a neural network. Signals are propagated from one neuron to another by complex electro- chemical reactions. Chemical substances released from the synapses cause a change in the electrical potential of the cell body. When the potential reaches its threshold, an electrical pulse, action potential, is sent down through the axon. The pulse spreads out and eventually reaches synapses, causing them to increase or decrease their potential. However, the most interesting finding is that a neural network exhibits plasticity. In response to the stimulation pattern, neurons demonstrate long-term changes in the strength of their connections. Neurons also can form new connections with other neurons. Even entire collections of neurons may sometimes migrate from one place to another. These mechanisms form the basis for learning in the brain. Our brain can be considered as a highly complex, nonlinear and parallel information-processing system. Information is stored and processed in a neural network simultaneously throughout the whole network, rather than at specific locations. In other words, in neural networks, both data and its processing are global rather than local. Owing to the plasticity, connections between neurons leading to the ‘right answer’ are strengthened while those leading to the ‘wrong answer’ weaken. As a result, neural networks have the ability to learn through experience. Learning is a fundamental and essential characteristic of biological neural networks. The ease and naturalness with which they can learn led to attempts to emulate a biological neural network in a computer. Figure 6.1 Biological neural network INTRODUCTION, OR HOW THE BRAIN WORKS 167 Although a present-day artificial neural network (ANN) resembles the human brain much as a paper plane resembles a supersonic jet, it is a big step forward. ANNs are capable of ‘learning’, that is, they use experience to improve their performance. When exposed to a sufficient number of samples, ANNs can generalise to others they have not yet encountered. They can recognise hand- written characters, identify words in human speech, and detect explosives at airports. Moreover, ANNs can observe patterns that human experts fail to recognise. For example, Chase Manhattan Bank used a neural network to examine an array of information about the use of stolen credit cards – and discovered that the most suspicious sales were for women’s shoes costing between $40 and $80. How do artificial neural nets model the brain? An artificial neural network consists of a number of very simple and highly interconnected processors, also called neurons, which are analogous to the biological neurons in the brain. The neurons are connected by weighted links passing signals from one neuron to another. Each neuron receives a number of input signals through its connections; however, it never produces more than a single output signal. The output signal is transmitted through the neuron’s outgoing connection (corresponding to the biological axon). The outgoing connection, in turn, splits into a number of branches that transmit the same signal (the signal is not divided among these branches in any way). The outgoing branches terminate at the incoming connections of other neurons in the network. Figure 6.2 represents connections of a typical ANN, and Table 6.1 shows the analogy between biological and artificial neural networks (Medsker and Liebowitz, 1994). How does an artificial neural network ‘learn’? The neurons are connected by links, and each link has a numerical weight associated with it. Weights are the basic means of long-term memory in ANNs. They express the strength, or in other words importance, of each neuron input. A neural network ‘learns’ through repeated adjustments of these weights. Figure 6.2 Architecture of a typical artificial neural network 168 ARTIFICIAL NEURAL NETWORKS Table 6.1 Analogy between biological and artificial neural networks Biological neural network Artificial neural network Soma Neuron Dendrite Input Axon Output Synapse Weight But does the neural network know how to adjust the weights? As shown in Figure 6.2, a typical ANN is made up of a hierarchy of layers, and the neurons in the networks are arranged along these layers. The neurons connected to the external environment form input and output layers. The weights are modified to bring the network input/output behaviour into line with that of the environment. Each neuron is an elementary information-processing unit. It has a means of computing its activation level given the inputs and numerical weights. To build an artificial neural network, we must decide first how many neurons are to be used and how the neurons are to be connected to form a network. In other words, we must first choose the network architecture. Then we decide which learning algorithm to use. And finally we train the neural network, that is, we initialise the weights of the network and update the weights from a set of training examples. Let us begin with a neuron, the basic building element of an ANN. 6.2 The neuron as a simple computing element A neuron receives several signals from its input links, computes a new activation level and sends it as an output signal through the output links. The input signal can be raw data or outputs of other neurons. The output signal can be either a final solution to the problem or an input to other neurons. Figure 6.3 shows a typical neuron. Figure 6.3 Diagram of a neuron THE NEURON AS A SIMPLE COMPUTING ELEMENT 169 How does the neuron determine its output? In 1943, Warren McCulloch and Walter Pitts proposed a very simple idea that is still the basis for most artificial neural networks. The neuron computes the weighted sum of the input signals and compares the result with a threshold value, . If the net input is less than the threshold, the neuron output is 1. But if the net input is greater than or equal to the threshold, the neuron becomes activated and its output attains a value þ1 (McCulloch and Pitts, 1943). In other words, the neuron uses the following transfer or activation function: X n X¼ xi wi ð6:1Þ i¼1 þ1 if X 5 Y¼ 1 if X < where X is the net weighted input to the neuron, xi is the value of input i, wi is the weight of input i, n is the number of neuron inputs, and Y is the output of the neuron. This type of activation function is called a sign function. Thus the actual output of the neuron with a sign activation function can be represented as " # X n Y ¼ sign x i wi ð6:2Þ i¼1 Is the sign function the only activation function used by neurons? Many activation functions have been tested, but only a few have found practical applications. Four common choices – the step, sign, linear and sigmoid functions – are illustrated in Figure 6.4. The step and sign activation functions, also called hard limit functions, are often used in decision-making neurons for classification and pattern recognition tasks. Figure 6.4 Activation functions of a neuron 170 ARTIFICIAL NEURAL NETWORKS Figure 6.5 Single-layer two-input perceptron The sigmoid function transforms the input, which can have any value between plus and minus infinity, into a reasonable value in the range between 0 and 1. Neurons with this function are used in the back-propagation networks. The linear activation function provides an output equal to the neuron weighted input. Neurons with the linear function are often used for linear approximation. Can a single neuron learn a task? In 1958, Frank Rosenblatt introduced a training algorithm that provided the first procedure for training a simple ANN: a perceptron (Rosenblatt, 1958). The perceptron is the simplest form of a neural network. It consists of a single neuron with adjustable synaptic weights and a hard limiter. A single-layer two-input perceptron is shown in Figure 6.5. 6.3 The perceptron The operation of Rosenblatt’s perceptron is based on the McCulloch and Pitts neuron model. The model consists of a linear combiner followed by a hard limiter. The weighted sum of the inputs is applied to the hard limiter, which produces an output equal to þ1 if its input is positive and 1 if it is negative. The aim of the perceptron is to classify inputs, or in other words externally applied stimuli x1 ; x2 ;... ; xn , into one of two classes, say A1 and A2. Thus, in the case of an elementary perceptron, the n-dimensional space is divided by a hyperplane into two decision regions. The hyperplane is defined by the linearly separable function X n x i wi ¼ 0 ð6:3Þ i¼1 For the case of two inputs, x1 and x2 , the decision boundary takes the form of a straight line shown in bold in Figure 6.6(a). Point 1, which lies above the boundary line, belongs to class A1 ; and point 2, which lies below the line, belongs to class A2. The threshold can be used to shift the decision boundary. THE PERCEPTRON 171 Figure 6.6 Linear separability in the perceptrons: (a) two-input perceptron; (b) three-input perceptron With three inputs the hyperplane can still be visualised. Figure 6.6(b) shows three dimensions for the three-input perceptron. The separating plane here is defined by the equation x1 w1 þ x2 w2 þ x3 w3 ¼ 0 But how does the perceptron learn its classification tasks? This is done by making small adjustments in the weights to reduce the difference between the actual and desired outputs of the perceptron. The initial weights are randomly assigned, usually in the range ½0:5; 0:5, and then updated to obtain the output consistent with the training examples. For a perceptron, the process of weight updating is particularly simple. If at iteration p, the actual output is YðpÞ and the desired output is Yd ðpÞ, then the error is given by eð pÞ ¼ Yd ð pÞ Yð pÞ where p ¼ 1; 2; 3;... ð6:4Þ Iteration p here refers to the pth training example presented to the perceptron. If the error, eð pÞ, is positive, we need to increase perceptron output Yð pÞ, but if it is negative, we need to decrease Yð pÞ. Taking into account that each perceptron input contributes xi ð pÞ wi ð pÞ to the total input Xð pÞ, we find that if input value xi ð pÞ is positive, an increase in its weight wi ð pÞ tends to increase perceptron output Yð pÞ, whereas if xi ð pÞ is negative, an increase in wi ð pÞ tends to decrease Yð pÞ. Thus, the following perceptron learning rule can be established: wi ð p þ 1Þ ¼ wi ð pÞ þ xi ð pÞ eð pÞ; ð6:5Þ where is the learning rate, a positive constant less than unity. The perceptron learning rule was first proposed by Rosenblatt in 1960 (Rosenblatt, 1960). Using this rule we can derive the perceptron training algorithm for classification tasks. 172 ARTIFICIAL NEURAL NETWORKS Step 1: Initialisation Set initial weights w1 ; w2 ;... ; wn and threshold to random numbers in the range ½0:5; 0:5. Step 2: Activation Activate the perceptron by applying inputs x1 ð pÞ; x2 ð pÞ;... ; xn ð pÞ and desired output Yd ð pÞ. Calculate the actual output at iteration p ¼ 1 " # X n Yð pÞ ¼ step xi ð pÞwi ð pÞ ; ð6:6Þ i¼1 where n is the number of the perceptron inputs, and step is a step activation function. Step 3: Weight training Update the weights of the perceptron wi ð p þ 1Þ ¼ wi ð pÞ þ wi ð pÞ; ð6:7Þ where wi ð pÞ is the weight correction at iteration p. The weight correction is computed by the delta rule: wi ð pÞ ¼ xi ð pÞ eð pÞ ð6:8Þ Step 4: Iteration Increase iteration p by one, go back to Step 2 and repeat the process until convergence. Can we train a perceptron to perform basic logical operations such as AND, OR or Exclusive-OR? The truth tables for the operations AND, OR and Exclusive-OR are shown in Table 6.2. The table presents all possible combinations of values for two variables, x1 and x2 , and the results of the operations. The perceptron must be trained to classify the input patterns. Let us first consider the operation AND. After completing the initialisation step, the perceptron is activated by the sequence of four input patterns representing an epoch. The perceptron weights are updated after each activa- tion. This process is repeated until all the weights converge to a uniform set of values. The results are shown in Table 6.3. Table 6.2 Truth tables for the basic logical operations Input variables AND OR Exclusive-OR x1 x2 x1 \ x2 x1 [ x2 x1 x2 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 THE PERCEPTRON 173 Table 6.3 Example of perceptron learning: the logical operation AND Initial Final Desired Actual Inputs weights Error weights output output Epoch x1 x2 Yd w1 w2 Y e w1 w2 1 0 0 0 0.3 0.1 0 0 0.3 0.1 0 1 0 0.3 0.1 0 0 0.3 0.1 1 0 0 0.3 0.1 1 1 0.2 0.1 1 1 1 0.2 0.1 0 1 0.3 0.0 2 0 0 0 0.3 0.0 0 0 0.3 0.0 0 1 0 0.3 0.0 0 0 0.3 0.0 1 0 0 0.3 0.0 1 1 0.2 0.0 1 1 1 0.2 0.0 1 0 0.2 0.0 3 0 0 0 0.2 0.0 0 0 0.2 0.0 0 1 0 0.2 0.0 0 0 0.2 0.0 1 0 0 0.2 0.0 1 1 0.1 0.0 1 1 1 0.1 0.0 0 1 0.2 0.1 4 0 0 0 0.2 0.1 0 0 0.2 0.1 0 1 0 0.2 0.1 0 0 0.2 0.1 1 0 0 0.2 0.1 1 1 0.1 0.1 1 1 1 0.1 0.1 1 0 0.1 0.1 5 0 0 0 0.1 0.1 0 0 0.1 0.1 0 1 0 0.1 0.1 0 0 0.1 0.1 1 0 0 0.1 0.1 0 0 0.1 0.1 1 1 1 0.1 0.1 1 0 0.1 0.1 Threshold: ¼ 0:2; learning rate: ¼ 0:1. In a similar manner, the perceptron can learn the operation OR. However, a single-layer perceptron cannot be trained to perform the operation Exclusive-OR. A little geometry can help us to understand why this is. Figure 6.7 represents the AND, OR and Exclusive-OR functions as two-dimensional plots based on the values of the two inputs. Points in the input space where the function output is 1 are indicated by black dots, and points where the output is 0 are indicated by white dots. Figure 6.7 Two-dimensional plots of basic logical operations 174 ARTIFICIAL NEURAL NETWORKS In Figures 6.7(a) and (b), we can draw a line so that black dots are on one side and white dots on the other, but dots shown in Figure 6.7(c) are not separable by a single line. A perceptron is able to represent a function only if there is some line that separates all the black dots from all the white dots. Such functions are called linearly separable. Therefore, a perceptron can learn the operations AND and OR, but not Exclusive-OR. But why can a perceptron learn only linearly separable functions? The fact that a perceptron can learn only linearly separable functions directly follows from Eq. (6.1). The perceptron output Y is 1 only if the total weighted input X is greater than or equal to the threshold value . This means that the entire input space is divided in two along a boundary defined by X ¼ . For example, a separating line for the operation AND is defined by the equation x1 w1 þ x2 w2 ¼ If we substitute values for weights w1 and w2 and threshold given in Table 6.3, we obtain one of the possible separating lines as 0:1x1 þ 0:1x2 ¼ 0:2 or x1 þ x2 ¼ 2 Thus, the region below the boundary line, where the output is 0, is given by x1 þ x2 2 < 0; and the region above this line, where the output is 1, is given by x1 þ x2 2 5 0 The fact that a perceptron can learn only linear separable functions is rather bad news, because there are not many such functions. Can we do better by using a sigmoidal or linear element in place of the hard limiter? Single-layer perceptrons make decisions in the same way, regardless of the activa- tion function used by the perceptron (Shynk, 1990; Shynk and Bershad, 1992). It means that a single-layer perceptron can classify only linearly separable patterns, regardless of whether we use a hard-limit or soft-limit activation function. The computational limitations of a perceptron were mathematically analysed in Minsky and Papert’s famous book Perceptrons (Minsky and Papert, 1969). They proved that Rosenblatt’s perceptron cannot make global generalisations on the basis of examples learned locally. Moreover, Minsky and Papert concluded that MULTILAYER NEURAL NETWORKS 175 the limitations of a single-layer perceptron would also hold true for multilayer neural networks. This conclusion certainly did not encourage further research on artificial neural networks. How do we cope with problems which are not linearly separable? To cope with such problems we need multilayer neural networks. In fact, history has proved that the limitations of Rosenblatt’s perceptron can be overcome by advanced forms of neural networks, for example multilayer perceptrons trained with the back-propagation algorithm. 6.4 Multilayer neural networks A multilayer perceptron is a feedforward neural network with one or more hidden layers. Typically, the network consists of an input layer of source neurons, at least one middle or hidden layer of computational neurons, and an output layer of computational neurons. The input signals are propagated in a forward direction on a layer-by-layer basis. A multilayer perceptron with two hidden layers is shown in Figure 6.8. But why do we need a hidden layer? Each layer in a multilayer neural network has its own specific function. The input layer accepts input signals from the outside world and redistributes these signals to all neurons in the hidden layer. Actually, the input layer rarely includes computing neurons, and thus does not process input patterns. The output layer accepts output signals, or in other words a stimulus pattern, from the hidden layer and establishes the output pattern of the entire network. Neurons in the hidden layer detect the features; the weights of the neurons represent the features hidden in the input patterns. These features are then used by the output layer in determining the output pattern. With one hidden layer, we can represent any continuous function of the input signals, and with two hidden layers even discontinuous functions can be represented. Figure 6.8 Multilayer perceptron with two hidden layers 176 ARTIFICIAL NEURAL NETWORKS Why is a middle layer in a multilayer network called a ‘hidden’ layer? What does this layer hide? A hidden layer ‘hides’ its desired output. Neurons in the hidden layer cannot be observed through the input/output behaviour of the network. There is no obvious way to know what the desired output of the hidden layer should be. In other words, the desired output of the hidden layer is determined by the layer itself. Can a neural network include more than two hidden layers? Commercial ANNs incorporate three and sometimes four layers, including one or two hidden layers. Each layer can contain from 10 to 1000 neurons. Experimental neural networks may have five or even six layers, including three or four hidden layers, and utilise millions of neurons, but most practical applications use only three layers, because each additional layer increases the computational burden exponentially. How do multilayer neural networks learn? More than a hundred different learning algorithms are available, but the most popular method is back-propagation. This method was first proposed in 1969 (Bryson and Ho, 1969), but was ignored because of its demanding com- putations. Only in the mid-1980s was the back-propagation learning algorithm rediscovered. Learning in a multilayer network proceeds the same way as for a perceptron. A training set of input patterns is presented to the network. The network computes its output pattern, and if there is an error – or in other words a difference between actual and desired output patterns – the weights are adjusted to reduce this error. In a perceptron, there is only one weight for each input and only one output. But in the multilayer network, there are many weights, each of which contrib- utes to more than one output. How can we assess the blame for an error and divide it among the contributing weights? In a back-propagation neural network, the learning algorithm has two phases. First, a training input pattern is presented to the network input layer. The network then propagates the input pattern from layer to layer until the output pattern is generated by the output layer. If this pattern is different from the desired output, an error is calculated and then propagated backwards through the network from the output layer to the input layer. The weights are modified as the error is propagated. As with any other neural network, a back-propagation one is determined by the connections between neurons (the network’s architecture), the activation function used by the neurons, and the learning algorithm (or the learning law) that specifies the procedure for adjusting weights. Typically, a back-propagation network is a multilayer network that has three or four layers. The layers are fully connected, that is, every neuron in each layer is connected to every other neuron in the adjacent forward layer. MULTILAYER NEURAL NETWORKS 177 A neuron determines its output in a manner similar to Rosenblatt’s percep- tron. First, it computes the net weighted input as before: X n X¼ xi wi ; i¼1 where n is the number of inputs, and is the threshold applied to the neuron. Next, this input value is passed through the activation function. However, unlike a percepron, neurons in the back-propagation network use a sigmoid activation function: 1 Y sigmoid ¼ ð6:9Þ 1 þ eX The derivative of this function is easy to compute. It also guarantees that the neuron output is bounded between 0 and 1. What about the learning law used in the back-propagation networks? To derive the back-propagation learning law, let us consider the three-layer network shown in Figure 6.9. The indices i, j and k here refer to neurons in the input, hidden and output layers, respectively. Input signals, x1 ; x2 ;... ; xn , are propagated through the network from left to right, and error signals, e1 ; e2 ;... ; el , from right to left. The symbol wij denotes the weight for the connection between neuron i in the input layer and neuron j in the hidden layer, and the symbol wjk the weight between neuron j in the hidden layer and neuron k in the output layer. Figure 6.9 Three-layer back-propagation neural network 178 ARTIFICIAL NEURAL NETWORKS To propagate error signals, we start at the output layer and work backward to the hidden layer. The error signal at the output of neuron k at iteration p is defined by ek ð pÞ ¼ yd;k ð pÞ yk ð pÞ; ð6:10Þ where yd;k ð pÞ is the desired output of neuron k at iteration p. Neuron k, which is located in the output layer, is supplied with a desired output of its own. Hence, we may use a straightforward procedure to update weight wjk. In fact, the rule for updating weights at the output layer is similar to the perceptron learning rule of Eq. (6.7): wjk ð p þ 1Þ ¼ wjk ð pÞ þ wjk ð pÞ; ð6:11Þ where wjk ð pÞ is the weight correction. When we determined the weight correction for the perceptron, we used input signal xi. But in the multilayer network, the inputs of neurons in the output layer are different from the inputs of neurons in the input layer. As we cannot apply input signal xi , what should we use instead? We use the output of neuron j in the hidden layer, yj , instead of input xi. The weight correction in the multilayer network is computed by (Fu, 1994): wjk ð pÞ ¼ yj ð pÞ k ð pÞ; ð6:12Þ where k ð pÞ is the error gradient at neuron k in the output layer at iteration p. What is the error gradient? The error gradient is determined as the derivative of the activation function multiplied by the error at the neuron output. Thus, for neuron k in the output layer, we have @yk ð pÞ k ð pÞ ¼ ek ð pÞ; ð6:13Þ @Xk ð pÞ where yk ð pÞ is the output of neuron k at iteration p, and Xk ð pÞ is the net weighted input to neuron k at the same iteration. For a sigmoid activation function, Eq. (6.13) can be represented as ( ) 1 @ 1 þ exp½Xk ðpÞ exp½Xk ðpÞ k ð pÞ ¼ ek ðpÞ ¼ ek ðpÞ @Xk ðpÞ f1 þ exp½Xk ðpÞg2 Thus, we obtain: MULTILAYER NEURAL NETWORKS 179 k ð pÞ ¼ yk ð pÞ ½1 yk ð pÞ ek ð pÞ; ð6:14Þ where 1 yk ð pÞ ¼ : 1 þ exp½Xk ðpÞ How can we determine the weight correction for a neuron in the hidden layer? To calculate the weight correction for the hidden layer, we can apply the same equation as for the output layer: wij ð pÞ ¼ xi ð pÞ j ð pÞ; ð6:15Þ where j ð pÞ represents the error gradient at neuron j in the hidden layer: X l j ð pÞ ¼ yj ð pÞ ½1 yj ð pÞ k ð pÞwjk ð pÞ; k¼1 where l is the number of neurons in the output layer; 1 yj ð pÞ ¼ ; 1 þ eXj ð pÞ X n Xj ð pÞ ¼ xi ð pÞ wij ð pÞ j ; i¼1 and n is the number of neurons in the input layer. Now we can derive the back-propagation training algorithm. Step 1: Initialisation Set all the weights and threshold levels of the network to random numbers uniformly distributed inside a small range (Haykin, 1999): 2:4 2:4 ;þ ; Fi Fi where Fi is the total number of inputs of neuron i in the network. The weight initialisation is done on a neuron-by-neuron basis. Step 2: Activation Activate the back-propagation neural network by applying inputs x1 ð pÞ; x2 ð pÞ;... ; xn ð pÞ and desired outputs yd;1 ð pÞ; yd;2 ð pÞ;... ; yd;n ð pÞ. (a) Calculate the actual outputs of the neurons in the hidden layer: " # X n yj ð pÞ ¼ sigmoid xi ð pÞ wij ð pÞ j ; i¼1 where n is the number of inputs of neuron j in the hidden layer, and sigmoid is the sigmoid activation function. 180 ARTIFICIAL NEURAL NETWORKS (b) Calculate the actual outputs of the neurons in the output layer: 2 3 X m yk ð pÞ ¼ sigmoid 4 xjk ð pÞ wjk ð pÞ k 5; j¼1 where m is the number of inputs of neuron k in the output layer. Step 3: Weight training Update the weights in the back-propagation network propagating backward the errors associated with output neurons. (a) Calculate the error gradient for the neurons in the output layer: k ð pÞ ¼ yk ð pÞ ½1 yk ð pÞ ek ð pÞ where ek ð pÞ ¼ yd;k ð pÞ yk ð pÞ Calculate the weight corrections: wjk ð pÞ ¼ yj ð pÞ k ð pÞ Update the weights at the output neurons: wjk ð p þ 1Þ ¼ wjk ð pÞ þ wjk ð pÞ (b) Calculate the error gradient for the neurons in the hidden layer: X l j ð pÞ ¼ yj ð pÞ ½1 yj ð pÞ k ð pÞ wjk ð pÞ k¼1 Calculate the weight corrections: wij ð pÞ ¼ xi ð pÞ j ð pÞ Update the weights at the hidden neurons: wij ð p þ 1Þ ¼ wij ð pÞ þ wij ð pÞ Step 4: Iteration Increase iteration p by one, go back to Step 2 and repeat the process until the selected error criterion is satisfied. As an example, we may consider the three-layer back-propagation network shown in Figure 6.10. Suppose that the network is required to perform logical operation Exclusive-OR. Recall that a single-layer perceptron could not do this operation. Now we will apply the three-layer net. Neurons 1 and 2 in the input layer accept inputs x1 and x2 , respectively, and redistribute these inputs to the neurons in the hidden layer without any processing: x13 ¼ x14 ¼ x1 and x23 ¼ x24 ¼ x2. MULTILAYER NEURAL NETWORKS 181 Figure 6.10 Three-layer network for solving the Exclusive-OR operation The effect of the threshold applied to a neuron in the hidden or output layer is represented by its weight, , connected to a fixed input equal to 1. The initial weights and threshold levels are set randomly as follows: w13 ¼ 0:5, w14 ¼ 0:9, w23 ¼ 0:4, w24 ¼ 1:0, w35 ¼ 1:2, w45 ¼ 1:1, 3 ¼ 0:8, 4 ¼ 0:1 and 5 ¼ 0:3. Consider a training set where inputs x1 and x2 are equal to 1 and desired output yd;5 is 0. The actual outputs of neurons 3 and 4 in the hidden layer are calculated as y3 ¼ sigmoid ðx1 w13 þ x2 w23 3 Þ ¼ 1=½1 þ eð10:5þ10:410:8Þ ¼ 0:5250 y4 ¼ sigmoid ðx1 w14 þ x2 w24 4 Þ ¼ 1=½1 þ eð10:9þ11:0þ10:1Þ ¼ 0:8808 Now the actual output of neuron 5 in the output layer is determined as y5 ¼ sigmoid ðy3 w35 þ y4 w45 5 Þ ¼ 1=½1 þ eð0:52501:2þ0:88081:110:3Þ ¼ 0:5097 Thus, the following error is obtained: e ¼ yd;5 y5 ¼ 0 0:5097 ¼ 0:5097 The next step is weight training. To update the weights and threshold levels in our network, we propagate the error, e, from the output layer backward to the input layer. First, we calculate the error gradient for neuron 5 in the output layer: 5 ¼ y5 ð1 y5 Þe ¼ 0:5097 ð1 0:5097Þ ð0:5097Þ ¼ 0:1274 182 ARTIFICIAL NEURAL NETWORKS Then we determine the weight corrections assuming that the learning rate parameter, , is equal to 0.1: w35 ¼ y3 5 ¼ 0:1 0:5250 ð0:1274Þ ¼ 0:0067 w45 ¼ y4 5 ¼ 0:1 0:8808 ð0:1274Þ ¼ 0:0112 5 ¼ ð1Þ 5 ¼ 0:1 ð1Þ ð0:1274Þ ¼ 0:0127 Next we calculate the error gradients for neurons 3 and 4 in the hidden layer: 3 ¼ y3 ð1 y3 Þ 5 w35 ¼ 0:5250 ð1 0:5250Þ ð0:1274Þ ð1:2Þ ¼ 0:0381 4 ¼ y4 ð1 y4 Þ 5 w45 ¼ 0:8808 ð1 0:8808Þ ð0:1274Þ 1:1 ¼ 0:0147 We then determine the weight corrections: w13 ¼ x1 3 ¼ 0:1 1 0:0381 ¼ 0:0038 w23 ¼ x2 3 ¼ 0:1 1 0:0381 ¼ 0:0038 3 ¼ ð1Þ 3 ¼ 0:1 ð1Þ 0:0381 ¼ 0:0038 w14 ¼ x1 4 ¼ 0:1 1 ð0:0147Þ ¼ 0:0015 w24 ¼ x2 4 ¼ 0:1 1 ð0:0147Þ ¼ 0:0015 4 ¼ ð1Þ 4 ¼ 0:1 ð1Þ ð0:0147Þ ¼ 0:0015 At last, we update all weights and threshold levels in our network: w13 ¼ w13 þ w13 ¼ 0:5 þ 0:0038 ¼ 0:5038 w14 ¼ w14 þ w14 ¼ 0:9 0:0015 ¼ 0:8985 w23 ¼ w23 þ w23 ¼ 0:4 þ 0:0038 ¼ 0:4038 w24 ¼ w24 þ w24 ¼ 1:0 0:0015 ¼ 0:9985 w35 ¼ w35 þ w35 ¼ 1:2 0:0067 ¼ 1:2067 w45 ¼ w45 þ w45 ¼ 1:1 0:0112 ¼ 1:0888 3 ¼ 3 þ 3 ¼ 0:8 0:0038 ¼ 0:7962 4 ¼ 4 þ 4 ¼ 0:1 þ 0:0015 ¼ 0:0985 5 ¼ 5 þ 5 ¼ 0:3 þ 0:0127 ¼ 0:3127 The training process is repeated until the sum of squared errors is less than 0.001. Why do we need to sum the squared errors? The sum of the squared errors is a useful indicator of the network’s performance. The back-propagation training algorithm attempts to minimise this criterion. When the value of the sum of squared errors in an entire pass through all MULTILAYER NEURAL NETWORKS 183 Figure 6.11 Learning curve for operation Exclusive-OR training sets, or epoch, is sufficiently small, a network is considered to have converged. In our example, the sufficiently small sum of squared errors is defined as less than 0.001. Figure 6.11 represents a learning curve: the sum of squared errors plotted versus the number of epochs used in training. The learning curve shows how fast a network is learning. It took 224 epochs or 896 iterations to train our network to perform the Exclusive-OR operation. The following set of final weights and threshold levels satisfied the chosen error criterion: w13 ¼ 4:7621, w14 ¼ 6:3917, w23 ¼ 4:7618, w24 ¼ 6:3917, w35 ¼ 10:3788, w45 ¼ 9:7691, 3 ¼ 7:3061, 4 ¼ 2:8441 and 5 ¼ 4:5589. The network has solved the problem! We may now test our network by presenting all training sets and calculating the network’s output. The results are shown in Table 6.4. Table 6.4 Final results of three-layer network learning: the logical operation Exclusive-OR Desired Actual Sum of Inputs output output Error squared x1 x2 yd y5 e errors 1 1 0 0.0155 0.0155 0.0010 0 1 1 0.9849 0.0151 1 0 1 0.9849 0.0151 0 0 0 0.0175 0.0175 184 ARTIFICIAL NEURAL NETWORKS The initial weights and thresholds are set randomly. Does this mean that the same network may find different solutions? The network obtains different weights and threshold values when it starts from different initial conditions. However, we will always solve the problem, although using a different number of iterations. For instance, when the network was trained again, we obtained the following solution: w13 ¼ 6:3041, w14 ¼ 5:7896, w23 ¼ 6:2288, w24 ¼ 6:0088, w35 ¼ 9:6657, w45 ¼ 9:4242, 3 ¼ 3:3858, 4 ¼ 2:8976 and 5 ¼ 4:4859. Can we now draw decision boundaries constructed by the multilayer network for operation Exclusive-OR? It may be rather difficult to draw decision boundaries constructed by neurons with a sigmoid activation function. However, we can represent each neuron in the hidden and output layers by a McCulloch and Pitts model, using a sign function. The network in Figure 6.12 is also trained to perform the Exclusive-OR operation (Touretzky and Pomerlean, 1989; Haykin, 1999). The positions of the decision boundaries constructed by neurons 3 and 4 in the hidden layer are shown in Figure 6.13(a) and (b), respectively. Neuron 5 in the output layer performs a linear combination of the decision boundaries formed by the two hidden neurons, as shown in Figure 6.13(c). The network in Figure 6.12 does indeed separate black and white dots and thus solves the Exclusive-OR problem. Is back-propagation learning a good method for machine learning? Although widely used, back-propagation learning is not immune from problems. For example, the back-propagation learning algorithm does not seem to function in the biological world (Stork, 1989). Biological neurons do not work backward to adjust the strengths of their interconnections, synapses, and thus back- propagation learning cannot be viewed as a process that emulates brain-like learning. Figure 6.12 Network represented by McCulloch–Pitts model for solving the Exclusive-OR operation. ACCELERATED LEARNING IN MULTILAYER NEURAL NETWORKS 185 Figure 6.13 (a) Decision boundary constructed by hidden neuron 3 of the network in Figure 6.12; (b) decision boundary constructed by hidden neuron 4; (c) decision boundaries constructed by the complete three-layer network Another apparent problem is that the calculations are extensive and, as a result, training is slow. In fact, a pure back-propagation algorithm is rarely used in practical applications. There are several possible ways to improve the computational efficiency of the back-propagation algorithm (Caudill, 1991; Jacobs, 1988; Stubbs, 1990). Some of them are discussed below. 6.5 Accelerated learning in multilayer neural networks A multilayer network, in general, learns much faster when the sigmoidal activation function is represented by a hyperbolic tangent, 2a Y tan h ¼ a; ð6:16Þ 1 þ ebX where a and b are constants. Suitable values for a and b are: a ¼ 1:716 and b ¼ 0:667 (Guyon, 1991). We also can accelerate training by including a momentum term in the delta rule of Eq. (6.12) (Rumelhart et al., 1986): wjk ð pÞ ¼ wjk ð p 1Þ þ yj ð pÞ k ð pÞ; ð6:17Þ where is a positive number ð0 4 < 1Þ called the momentum constant. Typically, the momentum constant is set to 0.95. Equation (6.17) is called the generalised delta rule. In a special case, when ¼ 0, we obtain the delta rule of Eq. (6.12). Why do we need the momentum constant? According to the observations made in Watrous (1987) and Jacobs (1988), the inclusion of momentum in the back-propagation algorithm has a stabilising effect on training. In other words, the inclusion of momentum tends to 186 ARTIFICIAL NEURAL NETWORKS Figure 6.14 Learning with momentum accelerate descent in the steady downhill direction, and to slow down the process when the learning surface exhibits peaks and valleys. Figure 6.14 represents learning with momentum for operation Exclusive-OR. A comparison with a pure back-propagation algorithm shows that we reduced the number of epochs from 224 to 126. In the delta and generalised delta rules, we use a constant and rather small value for the learning rate parameter, a. Can we increase this value to speed up training? One of the most effective means to accelerate the convergence of back- propagation learning is to adjust the learning rate parameter during training. The small learning rate parameter, , causes small changes to the weights in the network from one iteration to the next, and thus leads to the smooth learning curve. On the other hand, if the learning rate parameter, , is made larger to speed up the training process, the resulting larger changes in the weights may cause instability and, as a result, the network may become oscillatory. To accelerate the convergence and yet avoid the danger of instability, we can apply two heuristics (Jacobs, 1988):. Heuristic 1. If the change of the sum of squared errors has the same algebraic sign for several consequent epochs, then the learning rate parameter, , should be increased.. Heuristic 2. If the algebraic sign of the change of the sum of squared errors alternates for several consequent epochs, then the learning rate parameter, , should be decreased. ACCELERATED LEARNING IN MULTILAYER NEURAL NETWORKS 187 Adapting the learning rate requires some changes in the back-propagation algorithm. First, the network outputs and errors are calculated from the initial learning rate parameter. If the sum of squared errors at the current epoch exceeds the previous value by more than a predefined ratio (typically 1.04), the learning rate parameter is decreased (typically by multiplying by 0.7) and new weights and thresholds are calculated. However, if the error is less than the previous one, the learning rate is increased (typically by multiplying by 1.05). Figure 6.15 represents an example of back-propagation training with adaptive learning rate. It demonstrates that adapting the learning rate can indeed decrease the number of iterations. Learning rate adaptation can be used together with learning with momen- tum. Figure 6.16 shows the benefits of applying simultaneously both techniques. The use of momentum and adaptive learning rate significantly improves the performance of a multilayer back-propagation neural network and minimises the chance that the network can become oscillatory. Neural networks were designed on an analogy with the brain. The brain’s memory, however, works by association. For example, we can recognise a familiar face even in an unfamiliar environment within 100–200 ms. We can also recall a complete sensory experience, including sounds and scenes, when we hear only a few bars of music. The brain routinely associates one thing with another. Figure 6.15 Learning with adaptive learning rate 188 ARTIFICIAL NEURAL NETWORKS Figure 6.16 Learning with momentum and adaptive learning rate Can a neural network simulate associative characteristics of the human memory? Multilayer neural networks trained with the back-propagation algorithm are used for pattern recognition problems. But, as we noted, such networks are not intrinsically intelligent. To emulate the human memory’s associative character- istics we need a different type of network: a recurrent neural network. 6.6 The Hopfield network A recurrent neural network has feedback loops from its outputs to its inputs. The presence of such loops has a profound impact on the learning capability of the network. How does the recurrent network learn? After applying a new input, the network output is calculated and fed back to adjust the input. Then the output is calculated again, and the process is repeated until the output becomes constant. Does the output always become constant? Successive iterations do not always produce smaller and smaller output changes, but on the contrary may lead to chaotic behaviour. In such a case, the network output never becomes constant, and the network is said to be unstable. The stability of recurrent networks intrigued several researchers in the 1960s and 1970s. However, none was able to predict which network would be stable, THE HOPFIELD NETWORK 189 Figure 6.17 Single-layer n-neuron Hopfield network and some researchers were pessimistic about finding a solution at all. The problem was solved only in 1982, when John Hopfield formulated the physical principle of storing information in a dynamically stable network (Hopfield, 1982). Figure 6.17 shows a single-layer Hopfield network consisting of n neurons. The output of each neuron is fed back to the inputs of all other neurons (there is no self-feedback in the Hopfield network). The Hopfield network usually uses McCulloch and Pitts neurons with the sign activation function as its computing element. How does this function work here? It works in a similar way to the sign function represented in Figure 6.4. If the neuron’s weighted input is less than zero, the output is 1; if the input is greater than zero, the output is þ1. However, if the neuron’s weighted input is exactly zero, its output remains unchanged – in other words, a neuron remains in its previous state, regardless of whether it is þ1 or 1. 8 < þ1; > if X > 0 sign Y ¼ 1; if X < 0 ð6:18Þ > : Y; if X ¼ 0 The sign activation function may be replaced with a saturated linear function, which acts as a pure linear function within the region ½1; 1 and as a sign function outside this region. The saturated linear function is shown in Figure 6.18. The current state of the network is determined by the current outputs of all neurons, y1 ; y2 ;... ; yn. Thus, for a single-layer n-neuron network, the state can be defined by the state vector as 3 2 y1 6 y2 7 6 7 Y¼6 7 6.. 7 ð6:19Þ 4. 5 yn 190 ARTIFICIAL NEURAL NETWORKS Figure 6.18 The saturated linear activation function In the Hopfield network, synaptic weights between neurons are usually represented in matrix form as follows: X M W¼ Ym YTm MI; ð6:20Þ m¼1 where M is the number of states to be memorised by the network, Ym is the n-dimensional binary vector, I is n n identity matrix, and superscript T denotes a matrix transposition. An operation of the Hopfield network can be represented geometrically. Figure 6.19 shows a three-neuron network represented as a cube in the three- dimensional space. In general, a network with n neurons has 2n possible states and is associated with an n-dimensional hypercube. In Figure 6.19, each state is represented by a vertex. When a new input vector is applied, the network moves from one state-vertex to another until it becomes stable. Figure 6.19 Cube representation of the possible states for the three-neuron Hopfield network THE HOPFIELD NETWORK 191 What determines a stable state-vertex? The stable state-vertex is determined by the weight matrix W, the current input vector X, and the threshold matrix . If the input vector is partially incorrect or incomplete, the initial state will converge into the stable state-vertex after a few iterations. Suppose, for instance, that our network is required to memorise two opposite states, ð1; 1; 1Þ and ð1; 1; 1Þ. Thus, 2 3 2 3 1 1 6 7 6 7 Y1 ¼ 4 1 5 and Y2 ¼ 4 1 5; 1 1 where Y1 and Y2 are the three-dimensional vectors. We also can represent these vectors in the row, or transposed, form YT1 ¼ ½ 1 1 1 and YT2 ¼ ½ 1 1 1 The 3 3 identity matrix I is 2 3 1 0 0 6 7 I ¼ 40 1 05 0 0 1 Thus, we can now determine the weight matrix as follows: W ¼ Y1 YT1 þ Y2 YT2 2I or 2 3 2 3 2 3 2 3 1 1 1 0 0 0 2 2 6 7 6 7 6 7 6 7 W ¼ 4 1 5½ 1 1 1 þ 4 1 5½ 1 1 1 24 0 1 05 ¼ 42 0 25 1 1 0 0 1 2 2 0 Next, the network is tested by the sequence of input vectors, X1 and X2 , which are equal to the output (or target) vectors Y1 and Y2 , respectively. We want to see whether our network is capable of recognising familiar patterns. How is the Hopfield network tested? First, we activate it by applying the input vector X. Then, we calculate the actual output vector Y, and finally, we compare the result with the initial input vector X. Ym ¼ sign ðW Xm hÞ; m ¼ 1; 2;... ; M ð6:21Þ where h is the threshold matrix. 192 ARTIFICIAL NEURAL NETWORKS In our example, we may assume all thresholds to be zero. Thus, 82 32 3 2 39 2 3 < 0 2 > 2 1 0 >= 1 6 76 7 6 7 6 7 Y1 ¼ sign 4 2 0 2 54 1 5 4 0 5 ¼ 4 1 5 > : > ; 2 2 0 1 0 1 and 82 32 3 2 39 2 3 < 0 2 > 2 1 0 >= 1 6 76 7 6 7 6 7 Y2 ¼ sign 4 2 0 2 54 1 5 4 0 5 ¼ 4 1 5 > : > ; 2 2 0 1 0 1 As we see, Y1 ¼ X1 and Y2 ¼ X2. Thus, both states, ð1; 1; 1Þ and ð1; 1; 1Þ, are said to be stable. How about other states? With three neurons in the network, there are eight possible states. The remain- ing six states are all unstable. However, stable states (also called fundamental memories) are capable of attracting states that are close to them. As shown in Table 6.5, the fundamental memory ð1; 1; 1Þ attracts unstable states ð1; 1; 1Þ, ð1; 1; 1Þ and ð1; 1; 1Þ. Each of these unstable states represents a single error, compared to the fundamental memory ð1; 1; 1Þ. On the other hand, the Table 6.5 Operation of the three-neuron Hopfield network Inputs Outputs Possible Fundamental state Iteration x1 x2 x3 y1 y2 y3 memory 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 THE HOPFIELD NETWORK 193 fundamental memory ð1; 1; 1Þ attracts unstable states ð1; 1; 1Þ, ð1; 1; 1Þ and ð1; 1; 1Þ. Here again, each of the unstable states represents a single error, compared to the fundamental memory. Thus, the Hopfield network can indeed act as an error correction network. Let us now summarise the Hopfield network training algorithm. Step 1: Storage The n-neuron Hopfield network is required to store a set of M funda- mental memories, Y1 ; Y2 ;... ; YM. The synaptic weight from neuron i to neuron j is calculated as 8 m;i ym;j ; i 6¼ j wij ¼ ; ð6:22Þ > : m¼1 0; i¼j where ym;i and ym;j are the ith and the jth elements of the fundamental memory Ym , respectively. In matrix form, the synaptic weights between neurons are represented as X M W¼ Ym YTm MI m¼1 The Hopfield network can store a set of fundamental memories if the weight matrix is symmetrical, with zeros in its main diagonal (Cohen and Grossberg, 1983). That is, 2 3 0 w12 w1i w1n 6w 0 w2i w2n 7 6 21 7 6....... 7 6. 7 6.... 7 W¼6 6w 7; ð6:23Þ 6 i1 wi2 0 win 77 6....... 7 6. 7 4.... 5 wn1 wn2 wni 0 where wij ¼ wji. Once the weights are calculated, they remain fixed. Step 2: Testing We need to confirm that the Hopfield network is capable of recalling all fundamental memories. In other words, the network must recall any fundamental memory Ym when presented with it as an input. That is, xm;i ¼ ym;i ; i ¼ 1; 2;... ; n; m ¼ 1; 2;... ; M 0 1 X n ym;i ¼ sign@ wij xm;j i A; j¼1 194 ARTIFICIAL NEURAL NETWORKS where ym;i is the ith element of the actual output vector Ym , and xm;j is the jth element of the input vector Xm. In matrix form, Xm ¼ Ym ; m ¼ 1; 2;... ; M Ym ¼ sign ðWXm hÞ If all fundamental memories are recalled perfectly we may proceed to the next step. Step 3: Retrieval Present an unknown n-dimensional vector (probe), X, to the network and retrieve a stable state. Typically, the probe represents a corrupted or incomplete version of the fundamental memory, that is, X 6¼ Ym ; m ¼ 1; 2;... ; M (a) Initialise the retrieval algorithm of the Hopfield network by setting xj ð0Þ ¼ xj j ¼ 1; 2;... ; n and calculate the initial state for each neuron 0 1 Xn yi ð0Þ ¼ sign @ wij xj ð0Þ i A; i ¼ 1; 2;... ; n j¼1 where xj ð0Þ is the jth element of the probe vector X at iteration p ¼ 0, and yi ð0Þ is the state of neuron i at iteration p ¼ 0. In matrix form, the state vector at iteration p ¼ 0 is presented as Yð0Þ ¼ sign ½WXð0Þ h (b) Update the elements of the state vector, Yð pÞ, according to the following rule: 0 1 Xn yi ð p þ 1Þ ¼ sign@ wij xj ð pÞ i A j¼1 Neurons for updating are selected asynchronously, that is, randomly and one at a time. Repeat the iteration until the state vector becomes unchanged, or in other words, a stable state is achieved. The condition for stability can be defined as: 0 1 Xn yi ð p þ 1Þ ¼ sign@ wij yj ð pÞ i A; i ¼ 1; 2;... ; n ð6:24Þ j¼1 or, in matrix form, Yð p þ 1Þ ¼ sign ½W Yð pÞ h ð6:25Þ