Single Layer Perceptron PDF
Document Details
Uploaded by AccessibleSavanna
New Mansoura University
Tags
Summary
This document covers different aspects of neural networks, including single-layer perceptrons, decision boundaries for two class prototypes, multilayer perceptrons, forward and backpropagation. It demonstrates calculations and provides examples.
Full Transcript
Single layer perceptron A perceptron is tasked with learning to classify points in a 2D space into one of two categories based on the following training data: Input Vector Desired Output (x1,x2) (d) (1, 1) 1 (1, -1) -1 (-1, 1) -1 (-1, -1) -1 The perceptron has...
Single layer perceptron A perceptron is tasked with learning to classify points in a 2D space into one of two categories based on the following training data: Input Vector Desired Output (x1,x2) (d) (1, 1) 1 (1, -1) -1 (-1, 1) -1 (-1, -1) -1 The perceptron has an initial weight vector 𝑤 = [0.5, −0.5] and 𝑏𝑖𝑎𝑠 𝑏 = 0. The activation function is the sign function, ϕ(v)=sign(v). Task: 1. Use the Perceptron Learning Rule to update the weights and bias for one iteration through the dataset. o The rule is: 𝛥𝑤 = 𝜂𝑒𝑥, and 𝛥𝑏 = 𝜂𝑒, where: ▪ 𝜂 = 0.1 (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒), ▪ 𝑒 = 𝑑 − 𝑦 (𝑒𝑟𝑟𝑜𝑟 𝑠𝑖𝑔𝑛𝑎𝑙), ▪ 𝑦 = 𝜙(𝑣) = 𝑠𝑖𝑔𝑛(𝑤 ⋅ 𝑥 + 𝑏). 2. 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑡ℎ𝑒 𝑢pdated weights and bias after the iteration. Solution: For input (1,1)), desired output d=1. 1- Compute 𝑣 = 𝑤 ⋅ 𝑥 + 𝑏 = (0.5)(1) + (−0.5)(1) + 0 = 0. 2- Apply activation function: 𝑦 = 𝑠𝑖𝑔𝑛(𝑣) = 𝑠𝑖𝑔𝑛(0) = 1. 3- Compute error: 𝑒 = 𝑑 − 𝑦 = 1 − 1 = 0. 4- No update needed as 𝑒 = 0: 𝑤 = [0.5, −0.5], 𝑏 = 0.0. For input (1,-1)), desired output= -1 1- Compute 𝑣 = 𝑤 ⋅ 𝑥 + 𝑏 = (0.5)(1) + (−0.5)(−1) + 0 = 1.0. 2- Apply activation function: 𝑦 = 𝑠𝑖𝑔𝑛(𝑣) = 𝑠𝑖𝑔𝑛(1.0) = 1. 3- Compute error: 𝑒 = 𝑑 − 𝑦 = −1 − 1 = −2. 4- Update weights and biase needed as : - 𝑤 = 𝑤 + 𝜂𝑒𝑥=[0.5,−= +0.1(−2)[1,−1]=[0.3,−0.3], - 𝑏 = 𝑏 + 𝜂𝑒=0.0+0.1(−2) =−0.2. For input (-1,1)), desired output d=-1. 1- Compute 𝑣 = 𝑤 ⋅ 𝑥 + 𝑏 = (0.3)(−1) + (−0.3)(1) + (−0.2) = −0.8. 2- Apply activation function: 𝑦 = 𝑠𝑖𝑔𝑛(𝑣) = 𝑠𝑖𝑔𝑛(−0.8) = −1. 3- Compute error: 𝑒 = 𝑑 − 𝑦 = −1 − (−1) = 0. 4- No update needed as 𝑒 = 0: 𝑤 = [0.3, −0.3], 𝑏 = −0.2. For input (-1,-1)), desired output d=-1. 1- Compute 𝑣 = 𝑤 ⋅ 𝑥 + 𝑏 = (0.3)(−1) + (−0.3)(−1) + (−0.2) = −0.2. 2- Apply activation function: 𝑦 = 𝑠𝑖𝑔𝑛(𝑣) = 𝑠𝑖𝑔𝑛(−0.2) = −1. 3- Compute error: 𝑒 = 𝑑 − 𝑦 = −1 − (−1) = 0. 4- No update needed as 𝑒 = 0: 𝑤 = [0.3, −0.3], 𝑏 = −0.2. Final results: weights 𝑤 = [0.3, −0.3], and bias 𝑏 = −0.2. Single layer perceptron: Two class prototypes are given as: Class 1 prototype: 𝑥1 = [2,5] Class 2 prototype: 𝑥2 = [−1, −3] The decision boundary between these classes is defined by the hyperplane that equidistantly separates them. Task: 1. Compute the equation of the decision boundary in the form 𝑎𝑥1 + 𝑏𝑥2 + 𝑐 = 0. 2. Verify whether the point 𝑝 = [0,0] Lies on the boundary. Solution: 1- calculate the 𝑣 vector difference: 𝑥1 − 𝑥2 = [2 − (−1),5 − (−3)] = [3,8] This vector [3,8] is perpendicular to the decision boundary. 2- compute the midpoint The midpoint of the line segment connecting the two prototypes is: 𝑥1 +𝑥2 [2+(−1),5+(−3) Midpoint= = = [0.5,1]. 2 2 The decision boundary passes through this midpoint. 3- the decision boundary is a form of : 𝑣 𝑇 𝑥 + 𝑐 = 0 1 To calculate 𝑐: 𝑐 = (∥ 𝑥2 ∥2 −∥ 𝑥1 ∥2 ) 2 ∥ 𝑥1 ∥2 = 22 + 52 = 4 + 25 = 29 ∥ 𝑥2 ∥2 = (−1)2 + (−3)2 = 1 + 9 = 10 1 −19 Substitute into 𝑐: 𝑐 = 2 (10 − 29) = 2 = −9.5 The linear equation: 3𝑥1 + 8𝑥2 − 9.5 = 0 4- Verify [0,0] lies on the boundary: 3(0) + 8(0) − 9.5 = −9.5 The result is -9.5 ≠ 0, so [0,0] is not on the decision boundary. Multilayer perceptron: A neural network has Input layer with two neurons (𝑥1, 𝑥2), One hidden layer with two neurons (ℎ1, ℎ2), One output layer with a single neuron (𝑦). The weights and biases are initialized as follows: Weights from input to hidden layer: 𝑤11 = 0.5, 𝑤12 = −0.3, 𝑤21 = 0.8, 𝑤22 = 0.2, 𝐵𝑖𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟: 𝑏ℎ1 = 0.1, 𝑏ℎ2 = −0.1, 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑟𝑜𝑚 ℎ𝑖𝑑𝑑𝑒𝑛 𝑡𝑜 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟: 𝑤ℎ1𝑦 = 0.7, 𝑤ℎ2𝑦 = −0.6, 𝐵𝑖𝑎𝑠 𝑓𝑜𝑟 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟: 𝑏𝑦 = 0.05. 1 The activation function is sigmoid: 𝜎(𝑧) = 1+𝑒 −𝑧 The learning rate is 𝜂 = 0.1, and the input-output pair for training is: Input: 𝑥 = [0.6,0.9], Target output: 𝑡 = 1. Tasks: 1. Perform forward propagation to calculate the network output 𝑦. 2. Compute the error at the output layer. 3. Perform backpropagation to update the weights and biases. Solution: 1- compute hidden layer activation 𝑎ℎ = [0.754, 0.475] 2- compute the output layer activation 𝑧𝑦 = 0.2928 3- apply sigmoid y = 0.5727 4- compute error: t-y= 1-0.5727= 0.4273. Self-organizing map Consider a Self-Organizing Map (SOM) with two inputs 𝑥1 and 𝑥2 and three output nodes A, B, and C. The weight vectors for each output node are as follows: Node A: 𝑤𝐴 = (2, −1) Node B: 𝑤𝐵 = (−1,2) Node C: 𝑤𝐶 = (1,1) If the input vector is 𝑥 = (3, −2), determine: 1. Which output node is the "winner" based on Euclidean distance. 2. Update the weight vector of the winning node using a learning rate 𝛼 = 0.3. Solution: 1. Calculate the Euclidean distance between 𝑥 and the weight vector of each node. 𝑑𝐴 = √(𝑥1 − 𝑤𝐴1 )2 + (𝑥2 − 𝑤𝐴2 )2 , repeat B, C 𝑑𝐴 = √(3 − 2)2 + ((−2) − (−1))2 = √2 = 1.41, 𝑑𝐵 = 5.6, 𝑑𝐶 = 3.61 Node A is the winner. The weight updated: 𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 + 𝛼(𝑥 − 𝑤𝑜𝑙𝑑 ), where 𝑤𝑜𝑙𝑑 = (2, −1), learning rate 𝛼 = 0.3, and input vector is 𝑥 = (3, −2). 𝑤𝑛𝑒𝑤 = (2, −1) + 0.3 ⋅ [(3, −2) − (2, −1)] 𝑤𝑛𝑒𝑤 = (2, −1) + 0.3 ⋅ [(1, −1)] 𝑤𝑛𝑒𝑤 = (2, −1) + (0.3, −0.3) = (2.3, −1.3) Copyright Notice These slides are distributed under the Creative Commons License. DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute these slides for commercial purposes. You may make copies of these slides and use or distribute them for educational purposes as long as you cite DeepLearning.AI as the source of the slides. For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode Recurrent Neural Networks Why sequence deeplearning.ai models? Examples of sequence data “The quick brown fox jumped Speech recognition over the lazy dog.” Music generation ∅ “There is nothing to like Sentiment classification in this movie.” DNA sequence analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG Machine translation Voulez-vous chanter avec Do you want to sing with moi? me? Video activity recognition Running Name entity recognition Yesterday, Harry Potter Yesterday, Harry Potter met Hermione Granger. met Hermione Granger. Andrew Ng Recurrent Neural Networks Notation deeplearning.ai Motivating example x: Harry Potter and Hermione Granger invented a new spell. Andrew Ng Representing words x: Harry Potter and Hermione Granger invented a new spell. ! "#$ ! "%$ ! "&$ ⋯ ! "($ Andrew Ng Representing words x: Harry Potter and Hermione Granger invented a new spell. ! "#$ ! "%$ ! "&$ ⋯ ! "($ And = 367 Invented = 4700 A=1 New = 5976 Spell = 8376 Harry = 4075 Potter = 6830 Hermione = 4200 Gran… = 4000 Andrew Ng Recurrent Neural Networks Recurrent Neural deeplearning.ai Network Model Why not a standard network? ! "#$ ) "#$ ! "%$ ) "%$ ⋮ ⋮ ⋮ ⋮ ! "'($ ) "'*$ Problems: - Inputs, outputs can be different lengths in different examples. - Doesn’t share features learned across different positions of text. Andrew Ng Recurrent Neural Networks He said, “Teddy Roosevelt was a great President.” He said, “Teddy bears are on sale!” Andrew Ng Forward Propagation )- "#$ )- "%$ )- ".$ )- "'* $ +"#$ +"%$ +"'( /#$ +",$ ⋯ ! "#$ ! "%$ ! ".$ ! "'( $ Andrew Ng Simplified RNN notation +"1$ = 3(566 +"1/#$ + 568 ! "1$ + 96 ) )- "1$ = 3(5;6 +"1$ + 9; ) Andrew Ng Recurrent Neural Networks Backpropagation deeplearning.ai through time Forward propagation and backpropagation '( "&$ '( ")$ '( "*$ '( "+. $ !"&$ !")$ !"+, -&$ !"#$ ⋯ % "&$ % ")$ % "*$ % "+, $ Andrew Ng Forward propagation and backpropagation ℒ "1$ '( "1$ , ' "1$ = Backpropagation through time Andrew Ng Recurrent Neural Networks Different types deeplearning.ai of RNNs Examples of sequence data “The quick brown fox jumped Speech recognition over the lazy dog.” Music generation ∅ “There is nothing to like Sentiment classification in this movie.” DNA sequence analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG Machine translation Voulez-vous chanter avec Do you want to sing with moi? me? Video activity recognition Running Name entity recognition Yesterday, Harry Potter Yesterday, Harry Potter met Hermione Granger. met Hermione Granger. Andrew Ng Examples of RNN architectures Andrew Ng Examples of RNN architectures Andrew Ng Summary of RNN types () #'% () #'% () #*% () #+, % () "#$% "#$% ⋯ "#$% ⋯ & #'% & & #'% & #*% & #+. % One to one One to many Many to one () #'% () #*% () #+, % () #'% () #+, % "#$% "#$% ⋯ ⋯ ⋯ ⋯ & #'% & #*% & #'% & #+. % & #+. % Many to many Many to many Andrew Ng Recurrent Neural Networks Language model and deeplearning.ai sequence generation What is language modelling? Speech recognition The apple and pair salad. The apple and pear salad. !(The apple and pair salad) = !(The apple and pear salad) = Andrew Ng Language modelling with an RNN Training set: large corpus of english text. Cats average 15 hours of sleep a day. The Egyptian Mau is a bread of cat. Andrew Ng RNN model Cats average 15 hours of sleep a day. ℒ &' ()*, & ()* = − - &0()* log &'0()* 0 ℒ = - ℒ ()* &' ()*, & ()* ) Andrew Ng Recurrent Neural Networks Sampling novel deeplearning.ai sequences Sampling a sequence from a trained RNN '( "&$ '( "/$ '( "0$ '( ")* $ !"#$ !"&$ !"/$ !"0$ ⋯ !")* $ % "&$ ' "&$ ' "/$ ' ")-.&$ Andrew Ng Character-level language model Vocabulary = [a, aaron, …, zulu, ] '( "&$ '( "/$ '( "0$ '( ")* $ !"#$ !"&$ !"/$ !"0$ ⋯ !")* $ % "&$ '( "&$ '( "/$ '( ")-.&$ Andrew Ng Sequence generation News Shakespeare President enrique peña nieto, announced The mortal moon hath her eclipse in love. sench’s sulk former coming football langston paring. And subject of this thou art another this fold. “I was not at all surprised,” said hich langston. When besser be my love to me see sabl’s. “Concussion epidemic”, to be examined. For whose are ruse of mine eyes heaves. The gray football the told some and this has on the uefa icon, should money as. Andrew Ng Recurrent Neural Networks Vanishing gradients deeplearning.ai with RNNs Vanishing gradients with RNNs '( "&$ '( "-$ '( "/$ '( ")* $ !"#$ !"&$ !"-$ !"/$ ⋯ !")* $ % "&$ % "-$ % "/$ % "). $ % ⋮ ⋮ ⋮ ⋮ ⋯ ⋮ ⋮ ⋮ '( Exploding gradients. Andrew Ng Recurrent Neural Networks Gated Recurrent deeplearning.ai Unit (GRU) RNN unit !"#$ = &(() !"#*+$ , - "#$ + /) ) Andrew Ng GRU (simplified) The cat, which already ate …, was full. [Cho et al., 2014. On the properties of neural machine translation: Encoder-decoder approaches] [Chung et al., 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling] Andrew Ng Full GRU 5̃ "#$ = tanh((> [ 5 "#*+$ , - "#$ ] + /> ) Γ2 = 3((2 5 "#*+$ , - "#$ + /2 ) 5 "#$ = Γ2 ∗ 5̃ "#$ + 1 − Γ2 + 5 "#*+$ The cat, which ate already, was full. Andrew Ng Recurrent Neural Networks LSTM (long short deeplearning.ai term memory) unit GRU and LSTM GRU LSTM !̃ #$% = tanh(,- Γ/ ∗ ! #$12%, 4 #$% + 6- ) Γ8 = 9(,8 ! #$12%, 4 #$% + 68 ) Γ/ = 9(,/ ! #$12%, 4 #$% + 6/ ) ! #$% = Γ8 ∗ !̃ #$% + 1 − Γ8 ∗ ! #$12% =#$% = ! #$% [Hochreiter & Schmidhuber 1997. Long short-term memory] Andrew Ng LSTM units GRU LSTM !̃ #$% = tanh(,- Γ/ ∗ ! #$12%, 4 #$% + 6- ) !̃ #$% = tanh(,- =#$12%, 4 #$% + 6- ) Γ8 = 9(,8 ! #$12%, 4 #$% + 68 ) Γ8 = 9(,8 =#$12%, 4 #$% + 68 ) Γ/ = 9(,/ ! #$12%, 4 #$% + 6/ ) Γ> = 9(,> =#$12%, 4 #$% + 6> ) ! #$% = Γ8 ∗ !̃ #$% + 1 − Γ8 ∗ ! #$12% Γ? = 9(,? =#$12%, 4 #$% + 6? ) =#$% = ! #$% ! #$% = Γ8 ∗ !̃ #$% + Γ> ∗ ! #$12% =#$% = Γ? ∗ ! #$% [Hochreiter & Schmidhuber 1997. Long short-term memory] Andrew Ng LSTM in pictures D #$% !̃ #$% = tanh(,- =#$12%, 4 #$% + 6- ) softmax =#$% Γ8 = 9(,8 =#$12%, 4 #$% + 68 ) ! #$12% * ⨁ ! #$% -- Γ> = 9(,> =#$12%, 4 #$% + 6> ) tanh ! #$% * =#$% Γ? = 9(,? =#$12%, 4 #$% + 6? ) =#$12% B #$% C #$% !̃ #$% A #$% * =#$% ! #$% = Γ8 ∗ !̃ #$% + Γ> ∗ ! #$12% forget gate update gate tanh output gate =#$% = Γ? ∗ ! #$% 4 #$% D #2% D #F% D #G% softmax softmax softmax =#2% =#F% =#G% ! #F% -- -- ! #G% -- ! #2% ! #E% * ⨁ ! #2% * ⨁ ! #F% * ⨁ =#E% #2% =#2% =#F% =#F% =#G% = 4 #2% 4 #F% 4 #G% Andrew Ng Recurrent Neural Networks Bidirectional RNN deeplearning.ai Getting information from the future He said, “Teddy bears are on sale!” He said, “Teddy Roosevelt was a great President!” !" #)% !" #(% !" #*% !" #.% !" #-% !" #/% !" #$% +#,% +#)% +#(% +#*% +#.% +#-% +#/% +#$% ' #)% ' #(% ' #*% ' #.% ' #-% ' #/% ' #$% He said, “Teddy bears are on sale!” Andrew Ng Bidirectional RNN (BRNN) Andrew Ng Recurrent Neural Networks Deep RNNs deeplearning.ai Deep RNN example , "#$ , "%$ , "&$ , "'$ ([&]"+$ ([&]"#$ ([&]"%$ ([&]"&$ ([&]"'$ ([%]"+$ ([%]"#$ ([%]"%$ ([%]"&$ ([%]"'$ ([#]"+$ ! "#$ ! "%$ ! "&$ ! "'$ Andrew Ng Neural Networks AIE231 Neural Networks and Learning Machines Third Edition, pearson Simon Haykin McMaster University Hamilton, Ontario, Canada neural_network_AIE231 2 Neural Networks and Deep Learning, springer Charu C. Aggarwal IBM T J Watson Research Center Yorktown Heights, NY Springer, 2018 neural_network_AIE231 3 Assistant Dr. Sara Sweidan professor Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University. Email: [email protected] neural_network_AIE231 4 Assessment Activity Score Final Exam 40 Midterm Exam 20 Quizzes (2) 10 Attending Lect. & Lab 5 Project 15 Assignments 10 Total 100 neural_network_AIE231 5 Understand the basic concepts of neural networks and artificial neural computation. Learn the history and evolution of neural networks in the context of machine learning and artificial intelligence. Grasp the foundational elements of artificial neurons and activation functions. Course Understand how to build and train simple feedforward neural objectives networks. Explore deep learning architectures, including convolutional neural networks (CNNs) for image data and recurrent neural networks (RNNs) for sequential data. Learn about advanced architectures such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and transformer models. neural_network_AIE231 6 Introduction to Neural network. Multilayer neural networks Exploding gradient problems Course Common neural architectures outline Neural architecture for binary classification Neural architecture for multiclass model Training network to generalize neural_network_AIE231 7 Introduction to neural network neural_network_AIE231 8 Agenda History of Artificial Neural Network ANN definition Benefits of ANN The biological neuron Activation function Decision boundary The artificial neuron learning Network architecture/topology Neural processing neural_network_AIE231 9 History of the Artificial Neural Networks neural_network_AIE231 10 Artificial neural network Origins: Algorithms that try to mimic the brain. Used in the 1980’s and early 1990’s. Fell out of favor in the late 1990’s. Resurgence from around 2005. speech images text (NLP) neural_network_AIE231 11 Artificial neural network - A neural network is a massively parallel distributed processor made up of simple processing units that has a natural propensity for storing experiential knowledge and making it available for use. - It resembles the brain in two respects: 1. Knowledge is acquired by the network from its environment through a learning process. 2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge. neural_network_AIE231 12 Benefits of Neural Network Nonlinearity Input-output mapping Adaptivity Fault tolerance contextual information neural_network_AIE231 13 Artificial Neural Network A set of major aspects of a parallel distributed model include: ▪ a set of processing units (cells): computational elements in the model. ▪ State of activation for every unit: represent the current activity content of the unit. ▪ connections between the units: generally, each connection is defined by a weight, which influences the flow of information between units. ▪ Propagation rule: which determines the effective input of a unit from its external inputs. It specifies how information is passed between units. neural_network_AIE231 14 Artificial Neural Network ▪ Activation function, which determines the new level of activation based on the effective input and the current activation. Standard activation functions include sigmoid, tanh, and ReLU. ▪ External input for each unit: Each unit can receive external input, which represents information from the environment or other sources affecting its activation. ▪ A method for information gathering (the learning rule): Learning rules can be supervised or unsupervised, depending on the specific task and training approach. ▪ environment within which the system must operate, providing input signals and _ if necessary _ error signals. neural_network_AIE231 15 Why Artificial Neural Networks? There are two basic reasons why we are interested in building artificial neural networks (ANNs): Technical viewpoint: Some problems, such as character recognition or the prediction of future states of a system, require massively parallel and adaptive processing. Biological viewpoint: ANNs can be used to replicate and simulate components of the human (or animal) brain, thereby giving us insight into natural information processing. neural_network_AIE231 16 Artificial Neural Networks The “building blocks” of neural networks are the neurons. In technical systems, we refer to them as units or nodes. Basically, each neuron receives input from many other neurons. changes its internal state (activation) based on the current input. sends one output signal to many other neurons, possibly including its input neurons (recurrent network). neural_network_AIE231 17 How do ANNs work? An artificial neural network (ANN) is a computer program that strives to simulate the information processing capabilities of its biological exemplar. ANNs are typically composed of a great number of interconnected artificial neurons. The artificial neurons are simplified models of their biological counterparts. ANN is a technique for solving problems by constructing software like our brains. neural_network_AIE231 18 THE BIOLOGICAL NEURON neural_network_AIE231 19 The Human Brain ▪ The Brain is A massively parallel information processing system. ▪ Our brains are a huge network of processing elements. A typical brain contains a network of 10 billion neurons. neural_network_AIE231 20 How do our brains work? ▪ A processing element Dendrites: Input Cell body: Processor Synaptic: Link Axon: Output neural_network_AIE231 21 The Structure of Neurons A neuron has a cell body, a branching Input structure (the dendrIte) and a branching Output structure (the axOn) Axons connect to dendrites via synapses. Electro-chemical signals are propagated from the dendritic input, through the cell body, and down the axon to other neurons neural_network_AIE231 22 The Structure of Neurons A neuron only fires if its input signal exceeds a certain amount (the threshold) in a short time period. Synapses vary in strength – Good connections allowing a large signal – Slight connections allow only a weak signal. – Synapses can be either excitatory orinhibitory. neural_network_AIE231 23 The Structure of Neurons The ability of the brain to alter its neural pathways. Recovery from brain damage – Dead neurons are not replaced, but branches of the axons of healthy neurons can grow into the pathways and take over the functions of damaged neurons. – Equipotentiality: more than one area of the brain may be able to control a given function. – The younger the person, the better the recovery (e.g. recovery from left hemispherectomy). neural_network_AIE231 24 Nervous system neural_network_AIE231 25 How do ANNs work? An artificial neuron is an imitation of a human neuron neural_network_AIE231 26 Learning in Biological vs Artificial Networks In living organisms, synaptic weights change in response to external stimuli. – An unpleasant experience will change the synaptic weights of an organism, which will train the organism to behave differently. In artificial neural networks, the weights are learned using training data, which are input-output pairs (e.g., images and their labels). – An error in predicting an image's label is the unpleasant “stimulus” that changes the neural network's weights. – When trained over many images, the network learns to classify images correctly. neural_network_AIE231 27 Model of a Neuron ◼ Neuron is an information processing unit ◼ A set of synapses or connecting links – characterized by weight or strength ◼ An adder – summing the input signals weighted by synapses – a linear combiner ◼ An activation function – also called squashing function squash (limits) the output to some finite values neural_network_AIE231 28 Nonlinear model of a neuron Bias bk x1 wk1 Activation function x2 wk2 vk Output (.) yk...... xm wkm Summing Input Synaptic junction signal weights y = (v ) m v = w x +b k j=1 kj j k k k neural_network_AIE231 29 How do ANNs work? The signal is not passed down to the next neuron verbatim x........... x2 x1 Input m. w w w weights m..... 2 1 Processing ∑ 𝑚 𝑤𝑘𝑗 𝑥𝑗 Transfer Function 𝑗=1 (Activation Function) f(𝜐𝑘 ) 𝜑(𝜐𝑘 ) Output y neural_network_AIE231 30 Nonlinear model of a neuron X0 = +1 wk0 Wk0 = bk (bias) x1 wk1 Activation function x2 wk2 vk Output... (.) yk... xm wkm Summing Input Synaptic junction signal weights y = (v ) m v = w x k j=0 kj j k k neural_network_AIE231 31 Nonlinear model of a neuron 𝑚 𝜐𝑘 = 𝑗=1 𝑤𝑘𝑗 𝑥𝑗 σ (1) 𝜐𝑘 = 𝑢𝑘 + 𝑏𝑘 (2) 𝑦𝑘 = 𝜑(𝑢𝑘 + 𝑏𝑘 ) (3) 𝑦𝑘 = 𝜑(𝜐𝑘 ) (4) neural_network_AIE231 32 Nonlinear model of a neuron Inputs represent synapses Weights represent the strengths of synaptic links Summation block represents the addition of the inputs Output represents axon voltage neural_network_AIE231 33 THE ARTIFICIAL NEURON Activation Function neural_network_AIE231 34 Types of Activation Function Oj Oj Oj +1 +1 +1 ini ini t t ini The hard-limiting Linear Function Threshold Function Sigmoid Function (differentiable) Corresponds to ('S'-shaped curves) the biological (v) = paradigm: either 1 fires or not 1 in t 1+ exp(−av) O(in) = 0 in t neural_network_AIE231 a is slope parameter 35 Activation Functions... ◼ Threshold or step function (McCulloch & Pitts model) ◼ Linear: neurons using a linear activation function are called in the literature ADALINEs (Widrowy1960) ◼ Sigmoidal functions: functions which more exactly describe non-linear functions of the biological neurons. neural_network_AIE231 36 Activation Functions...sigmoid 1 1 2 0 v 1 (v) = 1+ exp(− v) if (i) → then ( ) → 1 (ii) → − then (v) → 0 37 neural_network_AIE231 Activation Function value range +1 +1 vi vi -1 Hyperbolic tangent Function (v) = tanh(v) Signum Function (sign) neural_network_AIE231 38 Stochastic Model of a Neuron So far we have introduced only deterministic models of ANNs. A stochastic (probabilistic) model can also be defined. If x denotes the state of a neuron, then P(v) denotes the prob. of firing a neuron, where v is the induced activation potential (bias + linear combination). 1 P(v) = −v 1+ e T neural_network_AIE231 39 Stochastic Model of a Neuron… Where T is a pseudo-temperature used to control the noise level (and therefore the uncertainty in firing) T →0 Stochastic model → deterministic model + 1 v 0 x= − 1 v 0 neural_network_AIE231 40 DECISION BOUNDARIES neural_network_AIE231 41 Decision boundaries In simple cases, divide feature space by drawing a hyperplane across it. Known as a decision boundary. Discriminant function: returns different values on opposite sides. (straight line) Problems which can be thus classified are linearly separable. neural_network_AIE231 42 E.g. Decision Surface of a Perceptron x2 x2 + + + - + - - x1 x1 + - - + - Linearly separable Non-Linearly separable Perceptron is able to represent some useful functions AND(x1,x2) choose weights w0=-1.5, w1=1, w2=1 But functions that are not linearly separable (e.g. XOR) are not representable neural_network_AIE231 43 Decision boundary Affine transformation produced by a bias; note that 𝜐𝑘 = 𝑏𝑘 at 𝑢𝑘 = 0. neural_network_AIE231 44 Linear Separability X2 x2 = − w1 x1+ t A A w2 w Decision 2 A (x2,y2) (x3,y3) Boundary (x1,y1) B (x8,y8) A A (x4,y4) B B (x10,y10) (x5,y 5) B A B (x6,y6) A B (x7,y7) (x11,y 11) B X1 B neural_network_AIE231 45 Rugby players & Ballet dancers 2 Rugby ? Height (m) Ballet? 1 50 120 Weight (Kg) neural_network_AIE231 46 1 0 Training the neuron f ( ) = 0 = 0 −1 0 X0= 1 +1 W0 = ? v x1 t W1 = ? v W2 = ? x2 -1 x0 w0 + x1w1 + x2 w2 = 0 x0 = 1; It is clear that: (x, y) A iff x1w1 + x2 w2 t Finding wi is called (x, y) B iff x1w1 + x2 w2 t learning neural_network_AIE231 47 neural_network_AIE231 48 THE ARTIFICIAL NEURON Learning neural_network_AIE231 49 Supervised Learning –The desired response of the system is provided by a teacher, e.g., the distance ρ[d,o] as as error measure – Estimate the negative error gradient direction and reduce the error accordingly – Modify the synaptic weights to reduce the stochastic minimization of error in multidimensional weight space neural_network_AIE231 50 Unsupervised Learning (Learning without a teacher) –The desired response is unknown, no explicit error information can be used to improve network behavior. E.g. finding the cluster boundaries of input pattern –Suitable weight self-adaptation mechanisms have to embedded in the trained network neural_network_AIE231 51 Training 1 if wi xi >t Output= 0 {i=0 otherwise ◼ Linear threshold is used. ◼ W - weight value ◼ t - threshold value neural_network_AIE231 52 Simple network AND with a Biased input 1 if wi xi >t Output= 0 otherwise -1 W1 = 1.5 W2 = 1 X t = 0.0 W3 = 1 Y neural_network_AIE231 53 Learning algorithm While epoch produces an error Present network with next inputs from epoch Error = T – O If Error 0 then Wj = Wj + LR * Ij * Error End If End While neural_network_AIE231 54 Learning algorithm Epoch : Presentation of the entire training set to the neural network. In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) Error: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it output a 1, then Error = -1 neural_network_AIE231 55 Learning algorithm Target Value, T : When we are training a network we not only present it with the input but also with a value that we require the network to produce. For example, if we present the network with [1,1] for the AND function the target value will be 1 Output , O : The output value from the neuron Ij : Inputs being presented to the neuron Wj : Weight from input neuron (Ij) to the output neuron LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1 neural_network_AIE231 56 Training the neuron For AND -1 A B Output W1 = ? 00 0 01 0 x t = 0.0 10 0 W2 = ? 11 1 W3 = ? y What are the weight values? Initialize with random weight values neural_network_AIE231 57 Training the neuron For AND -1 A B Output W1 = 0.3 00 0 01 0 x t = 0.0 10 0 W2 = 0.5 11 1 W3 =-0.4 y I1 I2 I3 Summation Output -1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0 -1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0 -1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1 -1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0 neural_network_AIE231 58 How it works? Set initial values of the weights randomly. Input: truth table of the XOR Do ▪ Read input (e.g. 0, and 0) ▪ Compute an output (e.g. 0.60543) ▪ Compare it to the expected output. (Diff= 0.60543) ▪ Modify the weights accordingly. Loop until a condition is met ▪ Condition: certain number of iterations ▪ Condition: error threshold neural_network_AIE231 59 How it works? neural_network_AIE231 60 Design Issues Initial weights (small random values ∈[‐1,1]) Transfer function (How are the inputs and weights combined to produce output?) Error estimation Weights adjusting Number of neurons Data representation Size of training set neural_network_AIE231 61 Learning in Neural Networks ◼ Learn values of weights from I/O pairs ◼ Start with random weights ◼ Load training example’s input ◼ Observe computed input ◼ Modify weights to reduce difference ◼ Iterate over all training examples ◼ Terminate when weights stop changing OR when error is very small neural_network_AIE231 62 NETWORK ARCHITECTURE/ TOPOLOGY neural_network_AIE231 63 Network Architecture ◼ Single-layer Feedforward Networks – input layer and output layer single (computation) layer – feedforward, acyclic ◼ Multilayer Feedforward Networks – hidden layers - hidden neurons and hidden units – enables to extract high order statistics – 10-4-2 network, 100-30-10-3 network – fully connected layered network ◼ Recurrent Networks – at least one feedback loop – with or without hidden neuron neural_network_AIE231 64 Network Architecture Multiple layer Single layer fully connected Unit delay operator Recurrent network without hidden units outputs inputs Recurrent network with hidden units neural_network_AIE231 65 Feedforward Networks (static) Input Hidden Output Layer Layers Layer neural_network_AIE231 66 Feedforward Networks… One I/P and one O/P layer One or more hidden layers Each hidden layer is built from artificial neurons Each element of the preceding layer is connected with each element of the next layer. There is no interconnection between artificial neurons from the same layer. Finding weights is a task which has to be done depending on which solution problem is to be performed by a specific network. neural_network_AIE231 67 Feedback Networks (Recurrent or dynamic systems) Input Hidden Output Layer Layers Layer neural_network_AIE231 68 Feedback Networks … (Recurrent or dynamic systems) The interconnections go in two directions between ANNs or with the feedback. Boltzmann machine is an example of recursive nets which is a generalization of Hopfield nets. Other example of recursive nets: Adaptive Resonance Theory (ART) nets. neural_network_AIE231 69 Neural network as directed Graph x0 = +1 Wk0 = bk x1 wk1 wk2 vk (.) x2 yk... wkm xm neural_network_AIE231 70 Neural network as directed Graph… ◼ Block diagram can be simplify by the idea of signal flow graph ◼ node is associated with signal ◼ directed link is associated with transfer function – synaptic links governed by linear input-output relation signal xj is multiplied by synaptic weight wkj – activation links governed by nonlinear input-output relation nonlinear activation function neural_network_AIE231 71 Feedback ◼ Output determines in part own output via feedback xj’(n) w yk(n) xj(n) (n) = w l+1 z-1 yk x j (n − l) i=0 ◼ depending on w – stable, linear divergence, exponential divergence – we are interested in the case of |w| P( 2 ); 2 otherwise. Bayesian Decision Theory... ◼ In general, we will have some features and more information. ◼ Feature: lightness measurement = x Different fish yield different lightness readings (x is a random variable) Bayesian Decision Theory …. ◼ Define p(x|1) = Class Conditional Probability Density Probability density function for x given that the state of nature is 1 ◼ The difference between p(x|1 ) and p(x|2 ) describes the difference in lightness between sea bass and salmon. Class conditioned probability density: p(x|) Hypothetical class-conditional probability Density functions are normalized (area under each curve is 1.0) Bayesian Decision Theory... ◼ Suppose that we know The prior probabilities P(1 ) and P(2 ), The conditional densities p(x | 1 ) and p(x | 2 ) Measure lightness of a fish = x. ◼ What is the category of the fish p( j | x) Bayes Formula ◼ Given – Prior probabilities P(j) – Conditional probabilities p(x| j) ◼ Measurement of particular item – Feature value x p(x | j )P( j ) ◼ Bayes formula: P( j x) = p(x) Likelihood Prior Posterior = Evidence (from p( j , x) = p(x | j )P( j ) = P( j | x) p(x)) P(i | x) = 1 where p(x) = p(x | i )P( i ) i so i Bayes' formula... p(x|j ) is called the likelihood of j with respect to x. (the j category for which p(x|j ) is large is more "likely" to be the true category) p(x) is the evidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. Posterior Probability Posterior probabilities for the particular priors P(1)=2/3 and P(2)=1/3. At every x the posteriors sum to 1. Error If we decide 2 P(1 | x) P(error | x) = If we decide 1 P(2 | x) For a given x, we can minimize the probability of error by deciding 1 if P(1|x) > P(2|x) and 2 otherwise. Bayes' Decision Rule (Minimizes the probability of error) 1 1 : if P(1|x) > P(2|x) i.e. P(1 x) P(2 x) 2 : otherwise 2 or 1 : if P ( x |1) P(1) > P(x|2) P(2) 2 : otherwise 1 1 p ( x | 1 ) P (1 ) p ( x | 1 ) P ( 1 ) p ( x | 2 ) P ( 2 ) p ( x | 2 ) P ( 2 ) 2 2 Likelihood ratio Threshold and P(Error|x) = min [P(1|x) , P(2|x)] Decision Boundaries ◼ Classification as division of feature space into non-overlapping regions X 1 , … , X R such that x X k x assigned to k ◼ Boundaries between these regions are known as decision surfaces or decision boundaries Discriminant functions ◼ Discriminant functions determine classification by comparison of their values: Classify x Xk if j k g k (x) g j (x) ◼ Optimum classification: based on posterior probability P(k x) ◼ Any monotone function g may be applied without changing the decision boundaries g k (x) = g(P(k x)) e.g. g k (x) = ln(P(k x)) The Two-Category Case ◼ Use 2 discriminant functions g1 and g2, and assigning x to 1 if g1>g2. ◼ Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide 1 if g(x)>0, otherwise decide 2. ◼ Two category case g(x) = P(1 | x) − P(2 | x) p(x | 1 ) P(1 ) g(x) = ln + ln p(x | 2 ) P(2 ) Summary ◼ Bayes approach: – Estimate class-conditioned probability density – Combine with prior class probability – Determine posterior class probability – Derive decision boundaries ◼ Alternate approach implemented by NN – Estimate posterior probability directly – i.e. determine decision boundaries directly neural_network_AIE231 63 Sara Sweidan PhD, Artificial Intelligence Assistant Professor Faculty of Computers & Artificial Intelligence, Thank you Benha University, Egypt E-mail: [email protected] big data analytics 64 Neural Networks AIE231 Single layer perceptron neural_network_AIE231 2 Outline Discriminant function and linear machine Training and classification using discrete perceptron Single layer continuous perceptron neural_network_AIE231 3 DISCRIMINANT FUNCTIONS neural_network_AIE231 4 Discriminant Functions ▪ Determine the membership in a category by the classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x) – When x is within the region Xk if gk(x) has the largest value Do not mix between n = dim of each I/P vector (dim of feature space); P= # of I/P vectors; and R= # of classes. neural_network_AIE231 5 Discriminant Functions… neural_network_AIE231 6 Discriminant Functions… neural_network_AIE231 7 Discriminant Functions… neural_network_AIE231 8 Linear Machine and Minimum Distance Classification Find the linear-form discriminant function for two class classification when the class prototypes are known Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes neural_network_AIE231 9 Linear Machine and Minimum Distance Classification… (dichotomizer) Norm vector Midpoint The dichotomizer’s discriminant function g(x): t neural_network_AIE231 10 Linear Machine and Minimum Distance Classification…(multiclass classification) The linear-form discriminant functions for multiclass classification – There are up to R(R-1)/2 decision hyperplanes for R pairwise separable classes (i.e. next to or touching another) neural_network_AIE231 11 Linear Machine and Minimum Distance Classification… (multiclass classification) Linear machine or minimum-distance classifier – Assume the class prototypes are known for all classes Euclidean distance between input pattern x and the center of class i, Xi: t neural_network_AIE231 12 Linear Machine and Minimum Distance Classification… (multiclass classification) neural_network_AIE231 13 Linear Machine and Minimum Distance Classification… P1, P2, P3 are the centres of gravity of the prototype points, we need to design a minimum distance classifier. Using the formulas S12 ➔ g₁(x) = g₂(x) from the previous 10x₁ + 2x₂ - 52 = 2x₁ - 5x₂ - 14.5 10x₁ + 2x₂ - 52 - (2x₁ - 5x₂ - 14.5) = 0 slide, we get wi (10x₁ - 2x₁) + (2x₂ + 5x₂) - (52 - (-14.5)) = 0 8x₁ + 7x₂ - 37.5 = 0 S13 ➔ g₁(x) = g₃(x) 10x₁ + 2x₂ - 52 = -5x₁ + 5x₂ - 25 10x₁ + 2x₂ - 52 - (-5x₁ + 5x₂ - 25) = 0 (10x₁ + 5x₁) + (2x₂ - 5x₂) - (52 - 25) = 0 15x₁ - 3x₂ - 27 = 0 -15x₁ + 3x₂ + 27 = 0 neural_network_AIE231 14 Linear Machine and Minimum Distance Classification… If R linear discriminant functions exist for a set of patterns such that g i (x ) g j (x ) for x Class i, i = 1, 2 ,..., R, j = 1, 2 ,..., R , i j The classes are linearly separable. neural_network_AIE231 15 Linear Machine and Minimum Distance Classification… Example: neural_network_AIE231 16 Linear Machine and Minimum Distance 1.First, let's identify what we have: 1. x₁ = [2,5] Classification… Example… 2. x₂ = [-1,-3] 2.Let's calculate the components of (x₁-x₂)ᵀx: 1. x₁-x₂ = [2-(-1), 5-(-3)] = [3,8] 2. So (x₁-x₂)ᵀx = 3x₁ + 8x₂ 3.Now let's calculate ‖x₂‖² and ‖x₁‖²: 1. ‖x₁‖² = 2² + 5² = 4 + 25 = 29 2. ‖x₂‖² = (-1)² + (-3)² = 1 + 9 = 10 4.Now for 1/2(‖x₂‖²-‖x₁‖²): 1. 1/2(10 - 29) 2. 1/2(-19) 3. -19/2 5.Putting it all together: 1. (x₁-x₂)ᵀx + 1/2(‖x₂‖²-‖x₁‖²) = 0 2. (3x₁ + 8x₂) + (-19/2) = 0 3. 3x₁ + 8x₂ - 19/2 = 0 neural_network_AIE231 17 The Discrete Perceptron neural_network_AIE231 18 Discrete Perceptron Training Algorithm So far, we have shown that coefficients of linear discriminant functions called weights can be determined based on a priori information about sets of patterns and their class membership. In what follows, we will begin to examine neural network classifiers that derive their weights during the learning cycle. The sample pattern vectors x1, x2, … , x p , called the training sequence, are presented to the machine along with the correct response. neural_network_AIE231 19 Discrete Perceptron Training Algorithm - Geometrical Representations neural_network_AIE231 20 The Continuous Perceptron neural_network_AIE231 21 Continuous Perceptron Training Algorithm Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons: – Gain finer control over the training procedure – Facilitate the differential characteristics to enable computation of the error gradient (of current error function) The factor ½ does not affect the location of the error minimum neural_network_AIE231 22 Continuous Perceptron Training Algorithm… The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface By definition of the steepest descent concept, each elementary move should be neural_network_AIE231 perpendicular to the current error contour. 23 Sara Sweidan PhD, Artificial Intelligence Assistant Professor Faculty of Computers & Artificial Intelligence, Thank you Benha University, Egypt E-mail: [email protected] neural_network_AIE231 24 Neural Networks AIE231 Multi-layer perceptron neural_network_AIE231 2 Outline What is a Multi-layer perceptron (MLP) What is backpropagation training neural_network_AIE231 3 What is a perceptronand what is a Multi-Layer Perceptron (MLP)? neural_network_AIE231 4 What is a perceptron? m Bias v = w x +b k j=1 kj j k bk x1 wk1 Activation x2 function y = ( v ) k k wk2 vk Output... (.) yk... xm wkm Summing junction Discrete Perceptron: Input Synaptic weights () = sign() signal Continous Perceptron: () = S − shape neural_network_AIE231 5 Activation Function of a perceptron +1 +1 vi vi -1 Signum Function (sign) Continous Perceptron: Discrete Perceptron: (v) = s − shape () = sign() neural_network_AIE231 6 MLP Architecture The Multi-Layer-Perceptron was first introduced by M. Minsky and S. Papert in 1969 Type: Feedforward Neuron layers: 1 input layer 1 or more hidden layers 1 output layer Learning Method: Supervised neural_network_AIE231 7 Terminology/Conventions ◼ Arrows indicate the direction of data flow. ◼The first layer, termed input layer, just contains the input vector and does not perform any computations. ◼ The second layer, termed hidden layer, receives input from the input layer and sends its output to the output layer. ◼After applying their activation function, the neurons in the output layer contain the output vector. neural_network_AIE231 8 Why the MLP? ◼ The single-layer perceptron classifiers discussed previously can only deal with linearly separable sets of patterns. ◼ The multilayer networks to be introduced here are the most widespread neural network architecture – Made useful until the 1980s, because of lack of efficient training algorithms (McClelland and Rumelhart 1986) – The introduction of the backpropagation training algorithm. neural_network_AIE231 9 What is backpropagation Training and how does it work? neural_network_AIE231 10 What is Backpropagation? ◼ Supervised Error Back-propagation Training – The mechanism of backward error transmission (delta learning rule) is used to modify the synaptic weights of the internal (hidden) and output layers The mapping error can be propagated into hidden layers – Can implement arbitrary complex/output mappings or decision surfaces to separate pattern classes For which, the explicit derivation of mappings and discovery of relationships is almost impossible – Produce surprising results and generalizations neural_network_AIE231 11 Architecture: Backpropagation Network Type: Feedforward Neuron layers: 1 input layer 1 or more hidden layers 1 output layer Learning Method: Supervised neural_network_AIE231 12 Backpropagation Preparation ◼ Training Set A collection of input-output patterns that are used to train the network ◼ Testing Set A collection of input-output patterns that are used to assess network performance ◼ Learning Rate- A scalar parameter, analogous to step size in numerical integration, used to set the rate of adjustments neural_network_AIE231 13 Backpropagation training cycle neural_network_AIE231 14 BP NN With Single Hidden Layer O/P layer w j ,k Hidden layer vi, j I/P layer neural_network_AIE231 15 Notation ◼x = input training vector ◼t = Output target vector. ◼ k =portion of error correction weight for wjk that is due to an error at output unit Yk; also the information about the error at unit Yk that is propagated back to the hidden units that feed into unit Yk ◼ j = portion of error correction weight for vjk that is due to the backpropagation of error information from the output layer to the hidden unit Zj ◼ = learning rate. ◼ voj = bias on hidden unit j ◼ wok = bias on output unit k neural_network_AIE231 16 Activation Functions It should be continuous, differentiable, and non- decreasing. Plus, its derivative should be easy to compute. neural_network_AIE231 17 Backpropagation training Algorithm neural_network_AIE231 18 neural_network_AIE231 19 neural_network_AIE231 20 neural_network_AIE231 21 neural_network_AIE231 22 How MLP neural network works ⦿ The inputs to the network correspond to the attributes measured for each training tuple. ⦿ Inputs are fed simultaneously into the units making up the input layer. ⦿ They are then weighted and fed simultaneously to a hidden layer ⦿ The number of hidden layers is arbitrary, although usually only one ⦿ The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction ⦿ The network is feed-forward in that none of the weights cycles back to an input unit or to an neural_network_AIE231 output unit of a previous layer 23 How MLP neural network works Iteratively process a set of training tuples & compare the network's prediction with the actual known target value ⦿ For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value ⦿ From a statistical point of view, networks perform nonlinear regression ⦿ Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation” neural_network_AIE231 24 How MLP neural network works Steps Initialize weights (to small random #s) and biases in the network Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases) Terminating condition (when error is very small, etc.) neural_network_AIE231 25 Generalisation ◼ Once trained, weights are held constant, and input patterns are applied in feedforward mode. - Commonly called “recall mode”. ◼ We wish network to “generalize”, i.e. to make sensible choices about input vectors which are not in the training set ◼ Commonly we check generalization of a network by dividing known patterns into a training set, used to adjust weights, and a test set, used to evaluate performance of trained network neural_network_AIE231 26 Generalisation … ◼ Generalisation can be improved by – Using a smaller number of hidden units (network must learn the rule, not just the examples) – Not overtraining (occasionally check that error on test set is not increasing) – Ensuring training set includes a good mixture of examples ◼ No good rule for deciding upon good network size (# of layers, # units per layer). neural_network_AIE231 27 Training algorithm ⦿ The training algorithm of back propagation involves four stages. ⚫ Initialization of weights- some small random values are assigned. ⚫ Feed forward- each input unit (X) receives an input signal and transmits this signal to each of the hidden units Z 1 ,Z 2 , … … Z n ⚫ Each hidden unit then calculates the activation function and sends its signal 𝑍𝑖 to each output unit. The output unit calculates the activation function to form the response of the given input pattern. ⚫ Back propagation of errors- each output unit compares activation 𝑌𝑘 with its target value 𝑇𝑘 to determine the associated error for that unit. Based on the error, the factor δ𝑂 ,(O=1,……,m) is computed and is used to distribute the error at output unit 𝑌𝑘 back to all units in the previous layer. Similarly, the factor δ𝐻 , (H=1,….,p) is compared for each hidden unit H𝑗. ⦿ Updating of the weights and biases neural_network_AIE231 28 bias x0 w0 Ɵk x1 w1 f output y xn wn For Example n y = sign( wi xi − k ) Input weight weighted Activation i=0 vector x vector w sum function ⦿ An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping ⦿ The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Thenneural_network_AIE231 a nonlinear activation function is applied to it. 29 Backpropagation Example neural_network_AIE231 30 neural_network_AIE231 31 neural_network_AIE231 32 neural_network_AIE231 33 neural_network_AIE231 34 neural_network_AIE231 35 neural_network_AIE231 36 neural_network_AIE231 37 neural_network_AIE231 38 neural_network_AIE231 39 neural_network_AIE231 40 neural_network_AIE231 41 neural_network_AIE231 42 Sara Sweidan PhD, Artificial Intelligence Assistant Professor Faculty of Computers & Artificial Intelligence, Thank you Benha University, Egypt E-mail: [email protected] neural_network_AIE231 43 Neural Networks AIE231 Neural network based on competition neural_network_AIE231 2 Outline Introduction Fixed weight competitive nets Self-Organizing Maps neural_network_AIE231 3 Introduction… competitive nets ◼ The most extreme form of competition among a group of neurons is called “Winner Take All”. ◼ As the name suggests, only one neuron in the competing group will have a nonzero output signal when the competition is completed. ◼ A specific competitive net that performs Winner Take All (WTA) competition is the Maxnet. ◼ A more general form of competition, the “Mexican Hat” will also be described in this lecture (instead of a non-zero o/p for the winner and zeros for all other competing nodes, we have a bubble around the winner ). ◼ With the exception of the fixed-weight competitive nets (namely Maxnet, Mexican Hat, and Hamming net) , all of the other nets combine competition with some form of learning to adjust the weights of the net (i.e., the weights that are not part of any interconnections in the competitive layer). neural_network_AIE231 4 Introduction… competitive nets ◼ The form of learning depends on the purpose for which the net is being trained: – LVQ and counterpropagation net are trained to perform mappings. The learning, in this case, is supervised. – SOM (used for clustering of input data): a common use of unsupervised learning. – ART are also clustering nets: also unsupervised. ◼ Several of the nets discussed use the same learning algorithm known as “Kohonen learning”: where the units that update their weights do so by forming a new weight vector that is a linear combination of the old weight vector and the current input vector. ◼ Typically, the unit whose weight vector was closest to the input vector is allowed to learn. neural_network_AIE231 5 Introduction… competitive nets ◼ The weight update for output (or cluster) unit j is given as: 𝒘𝒋 (new) = 𝒘𝒋 (old) + x − 𝒘𝒋 (old) = x + (1−)w. j (old) ◼ where x is the input vector, 𝒘𝒋 is the weight vector for unit j, and the learning rate, decreases as learning proceeds. neural_network_AIE231 6 Introduction… competitive nets ◼ Two methods of determining the closest weight vector to a pattern vector are commonly used for self-organizing nets. ◼ Both are based on the assumption that the weight vector for each cluster (output) unit serves as an exemplar for the input vectors that have been assigned to that unit during learning. – The first method of determining the winner uses the squared Euclidean distance between the I/P vector and the weight vector and chooses the unit whose weight vector has the smallest Euclidean distance from the I/P vector. – The second method uses the dot product of the I/P vector and the weight vector. The dot product can be interpreted as giving the correlation between the I/P and weight vector. neural_network_AIE231 7 Introduction… competitive nets ◼ In general and for consistency, we will use Euclidean distance squared. ◼ Many NNs use the idea of competition among neurons to enhance the contrast in activations of the neurons ◼ In the most extreme situation, the case of the Winner-Take- All, only the neuron with the largest activation is allowed to remain “on”. neural_network_AIE231 8 neural_network_AIE231 9 Introduction… competitive nets 1. Euclidean distance between two vectors 2. Cosine of the angle between two vectors Self–Organising Maps (SOM) AIE231 Neural Network 11 Self–Organising Maps (SOM) ◼ HISTORICAL BACKGROUND ◼ 1960s Vector quantisation problems studied by mathematicians (Glienn, 1964; Stratonowitch, 1966). ◼ 1973 von der Malsburg did the first computer simulation demonstrating self–organisation. ◼ 1976 Willshaw and von der Malsburg suggested the idea of SOM. ◼ 1980s Kohonen further developed and studied computational algorithms for SOM. AIE231 Neural Network 12 EUCLIDEAN SPACE ◼ Points in Euclidean space have coordinates (e.g. x, y, z) presented by real numbers R. We denote n–dimensional space by Rn. ◼ Every point in Rn is defined by n coordinates: {x1,... , xn} or by an n–dimensional Vector x = (x1,... , xn) AIE231 Neural Network 13 EXAMPLES ◼ Example 1 In R2 (two–dimensional space or a line) points are represented by just one number, such as a = (2,3) or b = (−1,1). ◼ Example 2 In R3 (three–dimensional space) points are represented by three coordinates x, y and z (or x1, x2 and x3), such as a = (2,−1, 3). AIE231 Neural Network 14 EUCLIDEAN DISTANCE ◼ Distance between two points a = (a1,... , an) and b = (b1,... , bn) in Euclidean space Rn is calculated as: AIE231 Neural Network 15 EXAMPLES AIE231 Neural Network 16 MULTIDIMENSIONAL DATA IN BUSINESS ◼ A bank gathered information about its customers: ◼ We may consider each entry as a coordinate xi and all the information about one customer as a point in Rn (n–dimensional space). ◼ How to analyse such data? AIE231 Neural Network 17 CLUSTERS ◼ Multivariate analysis offers variety of methods to analyse multidimensional data (e.g. NN). SOM is one of such techniques. One of the main goals is to find clusters of points. ◼ Clusters are groups of points close to each other. ◼ “Similar” customers would have small Euclidean distance between them and would belong to the same group (cluster). AIE231 Neural Network 18 SOM ARCHITECTURE ◼ SOM uses neural networks without hidden layer and with neurons in the output layer competing with each other, so that only one neuron (the winner) can fire at a time. AIE231 Neural Network 19 SOM ARCHITECTURE (CONT.) ◼ Input layer has n nodes. We can represent an input pattern by n– dimensional vector x = (x1,... , xn) ∈ Rn. ◼ Each neuron j on the output layer is connected to all input nodes, so each neuron has n weights. We represent them by n dimensional vector wj = (w1j,... ,wnj) ∈ Rn. ◼ Usually neurons in the output layer are arranged in a line (one– dimensional lattice) or in a plane (two–dimensional). ◼ SOM uses unsupervised learning algorithm, which organises weights wj in the output lattice so that they “mimic” the characteristics of the input patterns. AIE231 Neural Network 20 HOW DOES AN SOM WORK ◼ The algorithm consists of three processes: competition, cooperation and adaptation. ◼ Competition Input pattern x = (x1,... , xn) is compared with the weight vector wj = (w1j,... ,wnj) of every neuron in the output layer. The winner is the neuron whose weight wj is the closest to the input x in terms of Euclidean distance: AIE231 Neural Network 21 Example ◼ Consider SOM with three inputs and two output nodes (A and B). Let wA = (2,−1, 3) and wB = (−2, 0, 1). ◼ Find which node wins if the input x = (1,−2, 2) ◼ Solution: What if x = (−1,−2, 0)? AIE231 Neural Network 22 Cooperation ◼ The winner helps its neighbours in the output lattice. ◼ Those nodes which are closer to the winner in the lattice get more help, those which are further away get less. ◼ If the winner is node i, then the amount of help to node j is calculated using the neighbourhood function hij(dij), where dij is the distance between i and j in the lattice. A good example of hij(d) is Gaussian function: ◼ Note that the winner also helps itself more than others (for dii = 0). AIE231 Neural Network 23 Adaptation ◼ After the input x has been presented to SOM, the weights wj of the nodes are adjusted so that they become “closer” to the input. The exact formula for adaptation of weights is: w’j = wj + αhij [x − wj ] , where α is the learning rate coefficient. ◼ One can see that the amount of change depends on the neighbourhood hij of the winner. So, the winner helps itself and its neighbours to adapt. ◼ Finally, the neighbourhood hij is also a function of time, such that the neighbourhood shrinks with time (e.g. σ decreases with t). AIE231 Neural Network 24 Example ◼ Let us adapt the winning node from earlier Example (wA = (2,−1, 3) for x = (1,−2, 2)) if α = 0.5 and h = 1: AIE231 Neural Network 25 TRAINING PROCEDURE 1. Initially set all the weights to some random values 2. Feed a set of data into the network 3. Find the winner 4. Adjust the weight of the winner and its neighbours to be more like the input 5. Repeat from step 2 until the network stabilises AIE231 Neural Network 26 APPLICATIONS OF SOM IN BUSINESS ◼ SOM can be very useful during the intelligence phase of decision making. It helps to analyse and understand rather complex and large amounts of information (data). ◼ Ability to visualise multi–dimensional data can be used for presentations and reports. ◼ Identifying clusters in the data (e.g. typical groups of customers) can help optimise distribution of resources (e.g. advertising, products selection, etc). ◼ Can be used to identify credit–card fraud, errors in data, etc. AIE231 Neural Network 27 USEFUL PROPERTIES OF SOM ◼ Reducing dimensions (Indeed, SOM is a map f : Rn → Zm) ◼ Visualisation of clusters ◼ Ordered display ◼ Handles missing data ◼ The learning algorithm is unsupervised. AIE231 Neural Network 28 Example ◼ Consider the self-organising map: ◼ The output layer of this map consists of six nodes, A, B, C, D, E and F, which are organised into a two-dimensional lattice with neighbours connected by lines. ◼ Each of the output nodes has two inputs x1 and x2. Thus, each node has two weights corresponding to these inputs: w1 and w2. The values of the weights for all output in the SOM nodes are given in the table below: Calculate which of the six output nodes is the winner if the input pattern is x = (2, -4)? AIE231 Neural Network 29 Solution ◼ First, we calculate the distance for each node: The winner is the node with the smallest distance from x. Thus, in this case the winner is node C (because 5 is the smallest distance here). Continued… AIE231 Neural Network 30 Continued… AIE231 Neural Network 31 AIE231 Neural Network 32 Sara Sweidan PhD, Artificial Intelligence Assistant Professor Faculty of Computers & Artificial Intelligence, Thank you Benha University, Egypt E-mail: [email protected] neural_network_AIE231 33 Neural Networks AIE231 Convolutional Neural network CNN neural_network_AIE231 2 Outline Introduction to CNN What is a convolution neural network? How does CNN recognize images? Layers in convolution neural network CNN architectures CNN algorithm AIE231 Neural Network 3 How does image recognition work? AIE231 Neural Network 4 How does image recognition work? AIE231 Neural Network 5 How does image recognition work? AIE231 Neural Network 6 How does image recognition work? AIE231 Neural Network 7 How does image recognition work? AIE231 Neural Network 8 How does image recognition work? AIE231 Neural Network 9 How does image recognition work? AIE231 Neural Network 10 Introduction to CNN (History) AIE231 Neural Network 11 Introduction to CNN (History) AIE231 Neural Network 12 Introduction to CNN (History) AIE231 Neural Network 13 Introduction to CNN (History) AIE231 Neural Network 14 Convolution Neural Network: layers & functionality AIE231 Neural Network 15 Convolution Neural Network: layers & functionality Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer, Pooling layer, and fully connected layers. The Convolutional layer applies filters to the input image to extract features, the Pooling layer downsamples the image to reduce computation, and the fully connected layer makes the final prediction. The network learns the optimal filters through backpropagation and gradient descent. AIE231 Neural Network 16 Convolution Neural Network: layers & functionality Convolutional Neural or covnets are neural networks that share their parameters. Imagine you have an image. It can be represented as a cuboid having its length, width (dimension of the image), and height (i.e the channel as images generally have red, green, and blue channels). AIE231 Neural Network 17 Convolution Neural Network: layers & functionality Now imagine taking a small patch of this image and running a small neural network, called a filter or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across th