Artificial Neural Networks PDF
Document Details
Uploaded by Deleted User
A. D. Patel Institute of Technology
Dr. N C CHAUHAN
Tags
Summary
This document provides an overview of Artificial Neural Networks (ANNs). It covers concepts such as biological neurons, artificial neurons, models of artificial neurons, activation functions, neural network based algorithms, and backpropagation learning.
Full Transcript
Artificial Neural Networks Dr. N C CHAUHAN Professor & Head Department of Information Technology A. D. Patel Institute of Technology 1 Dr. N.C.Chauhan @ Departm...
Artificial Neural Networks Dr. N C CHAUHAN Professor & Head Department of Information Technology A. D. Patel Institute of Technology 1 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat What is Artificial Neural Network? Neural Networks: – Extremely simplified models of the brain (biological nervous system) – massively parallel distributed processing system – ability to learn (acquire knowledge and make it available for use) Learning from Experience – Look at the DOG !!! – Child learning Role model is human mind. (to mimic human behavior) Testing Data Training Model Data Magic Black Box 2 Prediction (Machine Learning) Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Characteristics of NN NN exhibit Mapping capabilities (pattern association) NN learn by example (labeled or unlabeled) Capability to generalize (can give answer for unknown patterns) Adaptive (changes the connection strengths to learn new things) fault tolerant (i.e. if one of the neurons or connections is damaged, the whole network still works quite well) NN can process information in parallel, at high speed 3 and in a distributed manner Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Biological Neuron Large number of neurons interconnected with each other. Threshold firing Terminal Branches Dendrites of Axon (Carries signal in) Synapses Nucleus (Connected to other neurons; Strength changes in response Axon to learning) (Carries signal away) 4 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Artificial Neuron (The Perceptron) Threshold activation function Terminal Branches Dendrites of Axon x1 w1 x2 w2 x3 w3 S Axon wn xn 5 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Model of Artificial Neuron Weighted Activation Summation Function (Threshold) x1 w1 n x2 w2 z wi xi y f ( z) Inputs y x3 i 1. w3 Output.. wn-1 xn-1 wn xn 7 The McCullogh-Pitts model Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Activation Functions Step (Threshold) function a if netθ b c Signum function d f (net) a if net c 8 a b if net d Piecewise Linear a ((net c)(b a)) /(d c) otherwise Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Neural Network based Algorithms Supervised learning networks: – Perceptron Learning Algorithm – Adaline, Madaline – Back-propagation Learning Algorithm – Radial Basis Function (RBF) Networks – Convolution Neural Networks – … Unsupervised learning: – Associative Memory (Auto-association and Hetero-association) – Self Organizing Map – Adaptive Resonance Theory – Hopfield Network – Learning Vector Quantization – Competitive Learning Networks 9 – Recurrent Neural Networks – … Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat How does Perceptron Learn? Perceptron Learning Rule: (Supervised learning) If the output is correct wnew = wold (no adjustment of weights) If the output is not correct, wnew = wold + (desired – output) * input 10 Effect of parameter ‘’ Limitations of Perceptron – NONLINEAR SEPERABILITY Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat How does Perceptron Learn? Effect of parameter ‘’ Limitations of Perceptron – NONLINEAR SEPERABILITY 11 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat NN Learning Concept (Learning as Optimization) Target Neural Compare Input Network Output Memory of Adjust Error = NN weight Target - Output 12 Empirical Risk Minimization Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Different NN Structures Multi-layer feed forward Single-layer feed forward 13 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Different NN Structures Competitive network Recurrent neural network 14 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat A Most popular NN structure for Supervised learning Structure: Layered, Feed-forward, MLP Inputs Output Output layer Hidden Input layer 15 layer Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Multilayer Perceptron with Backpropagation Learning 16 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Terminology 17 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 18 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 19 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 20 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Gradient Descent based Optimization 21 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 22 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 23 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 24 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 25 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 26 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Sigmoidal Neurons 27 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Sigmoidal Neurons 28 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 29 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning 30 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Backpropagation Learning Algorithm 31 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Flow of error in back propagation 32 Error terms are calculated in the direction opposite to most of the node output calculation. Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Parameters of Training Neural Networks 33 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters Criteria for weight initialization Frequency of weight update (Per pattern vs. Per epoch) Selection of Learning rate Effect of Momentum Coefficient Termination criteria NN structure optimization No. of training sets Generalization ability (Confirmation of Learning) 34 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Weight Initializations Generally weights are generated randomly between -1 to 1 or -0.5 to 0.5, since larger weights may drive the output of layer 1 nodes to saturation, requiring large amount of training time. This is due to property of sigmoid function. If magnitude of some inputs are much larger than others, random weights may bias network to give much importance to inputs whose magnitudes are large. For these cases weights can be initialized as: 35 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Frequency of Weight Updates Two approaches to weight updates (learning): – Per pattern: Weights are changed after every sample presentation. – Per epoch: Weights are updated only after all samples are presented. Weight changes suggested by different training samples are accumulated together into a single change to occur at the end of each epoch. Both methods are in wide use, each has advantages and disadvantages. – Some application input-output patterns are available online, hence batch mode not possible. – Per-pattern training is more expensive then per-epoch training. – For per-epoch training, training time can be reduced by 36 parallelism. Per-patter training is not parallelizable. Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Selection of Learning Rate The magnitude of weight change in BP is proportional to –ve gradient of the error. This change is relative. The exact magnitude of weight change depends on appropriate value of learning rate . A large value of will lead to rapid learning but the weight may then oscillate. A low value imply slow learning. Right value of depends on application. Value between 0.1 to 0.9 are used in many applications. (DL: 1e-5 to 1e-1) 37 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Selection of Learning Rate Study of in literature: – In some formulations, each weight in the network is associated with its own learning rate. Methods of adapting learning rate – Begin with large value of in early iterations, and steadily decrease it (idea is that changes to weight vector must be small to reduce the likelihood of divergence or weight oscillation) – Increase at every iteration that improves performance by some significant amount and to decrease at every iteration that worsens performance by some significant amount. 38 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Momentum BP leads the weights in a NN to a local minimum of the MSE, which is different from global minimum that corresponds to best choice of weights. We may prevent the network from getting stuck in some local minimum by making the weight changes depend on the average gradient of MSE in a small region rather than the precise gradient at a point. 𝜕𝐸 Averaging in a small neighborhood can allow the network 𝜕𝑤 weights to be modified in the general direction of MSE decrease, without getting stuck in some local minima. 39 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Momentum Calculating averages can be an expensive task. Shortcut, Make weight change in lth iteration of BP depend on immediately preceding weight change (l-1)th iteration. Implementation of this is possible by adding momentum term to the weight update rule. Use of momentum term in the weight update equation introduces yet another parameter, , whose optimum value depends on the application and is not easy to determine a priori. can be chosen adaptively as . A well chosen value of can significantly reduce no. of iterations for convergence. A value close to 0 –past history has no much effect. 40 A value close to 1 –current error has little effect on the weight change. Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: Generalizability Given a n/w, it is possible that repeated training iterations successively improve performance of n/w on training data (by memorizing training samples), but resultant n/w may perform poorly on test data. This is overfitting. One solution: constantly monitor performance of n/w on test data. Weights should be adjusted only on the bases of Overfitting training set, but error should be monitored Good fitting on the test set. Training continues as long as error on test set continues to decrease, and is terminated if the error on test set increases. A network with large no. of nodes cause memorizing training set and may not generalize well. Hence n/w with small size are preferred. Injecting noise into the training set has been 41 found to be useful technique to improve generalization ability of the network. Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: No. of Hidden Layers and Nodes How large a NN should be for a given problem? –solved in practice by trial and error. With too few nodes, n/w may not be powerful enough for a given learning task. With a large no. of nodes (and connections), computation is too expensive. For large n/w, NN may “memorize” i/p training samples; such n/w may tend to perform poor on test data. NN learning is considered successful only if system can perform well on test data. We emphasize n/w to generalize from i/p training samples, not to memorize them. Adaptive algorithms: – Begin from large n/w and successively remove some nodes and links until n/w performance degrades to an unacceptable level. OR – Begin from a very small n/w and introduce new nodes and weights until performance is 43 satisfactory. Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat Setting up the Parameters: No. of Samples How many samples are needed for good training? A rule of thumb (from related statistical problems): at least 5 to 10 times as many training samples as the no. of weights to be trained. Baum and Haussler (1989) – based on desired accuracy: – wh. P = desired no. of patterns (size of training set), – |w| = no. of weights to be trained, – ‘a’ = expected accuracy on test data. – (e.g. if for a n/w has |w|=27, and desired accuracy=95%(a=0.95), then P should be at least P>27/0.05=540.) – This is necessary condition Sufficient condition to ensure desired performance: – wh. n = no. of nodes 44 Dr. N.C.Chauhan @ Department of Infoamtion Technology, A D Patel Institue of Technology, Anand, Gujarat