ML_Ch7_Neural Networks.pdf
Document Details
Uploaded by QuaintDifferential
Tags
Related
- 10. Linear and Non-Linear Separation (Decision Trees, Neural Networks, and Support Vector Machines).pdf
- COMP9517 Deep Learning Part 1 - 2024 Term 2 Week 7 PDF
- Lecture 4: Optimizing Predictors & Neural Networks PDF
- Artificial Intelligence - Neural Networks PDF
- Classifying Machine Learning Techniques PDF
- CS401 Transfer Learning PDF
Full Transcript
Machine Learning CHAPTER 7: NEURAL NETWORKS ( C L A S S I F I C AT I O N PA RT 2 ) DR. ARYAF ALADWAN Outline 1. Neural network andlearning machines Perceptron 2. Multi-Layer Perceptron 3. FFNN 4. RNN 5. CNN 2 Cell nucleus or soma p...
Machine Learning CHAPTER 7: NEURAL NETWORKS ( C L A S S I F I C AT I O N PA RT 2 ) DR. ARYAF ALADWAN Outline 1. Neural network andlearning machines Perceptron 2. Multi-Layer Perceptron 3. FFNN 4. RNN 5. CNN 2 Cell nucleus or soma processes the information received from dendrites. Axon is a cable that is used by neurons to send information. Synapse is the connection between an axon and other neuron dendrites. 3 Types of Perceptron There are two types of perceptrons: 1. Single layer 2. Multilayer Single layer Perceptrons can learn only linearly separable patterns. Multilayer Perceptrons or feedforward neural networks with two or more layers have the greater processing power. The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision boundary. This enables you to distinguish between the two linearly separable classes +1 and -1. 4 Perceptron Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt. A perceptron (neuron) is a mathematical function modeled on the working of biological neurons. It is an elementary unit in an Artificial Neural Network (ANN). A Perceptron is an algorithm for supervised learning of binary classifiers. How do perceptrons work? A perceptron takes several binary inputs, x1 , x2,..., and produces a single binary output. 5 The biological neuron is analogous to artificial neurons in the following terms: 6 Perceptron Inputs Weights Net input Activation function function Scalar value x0 7 Properties of a neuron 1. Weights: w1 ,w2 ,...,wn are real numbers expressing the importance of the respective inputs to the output. 2. Bias: is the negative threshold after which you want the neuron to fire. 3. Activation function: defines the output of that node given an input or set of inputs. The neuron’s output, 0 or 1, is determined by whether the weighted sum of (wj*xj) is less than or greater than some threshold value. The basic mathematical model is : But bias = — threshold 8 Types of Neural Networks Activation Functions (Transfer Function) 9 Linear activation function is used in the output layer of a network when we have a regression problem. It does not make sense to use it in all layers, as such multi-layer network can be reduced to a single layer network. Also, networks with linear activation functions cannot model non-linear relationships between input and output. Binary step activation function is used in Perceptron. It cannot be used in multi-layers networks as they use back propagation learning algorithm, which changes network weights/biases based on the derivative of the activation function, which is zero. Hence, there would no weights/biases updates in back propagation. Sigmoid activation function can be used both at the output layer and hidden layers of a multilayer network. They allow the network to model non-linear relationships between input and output. The problem with Sigmoid activation function is that the derivative values away from the origin are very small and quickly approach zero, meaning that the weight updates will be minimal and the learning algorithm will be very slow. This is known as the vanishing gradient problem. In networks with many hidden layers (so called deep networks), we generally avoid Sigmoid and use ReLU activation function. Hyperbolic tangent (or tanh), similar to Sigmoid function, is a soft step function. But its range is between -1 and 1 (instead of 0 and 1). One benefit of tanh over Sigmoid is that its derivative values are larger, so it suffers less from the vanishing gradient problem. ReLU (Rectified Linear Unit) is an activation function popular is deep neural networks. Since it does not suffer from vanishing gradient problem, it is preferred to Sigmoid or tanh. Sigmoid or tanh can still be used in the output layer of deep networks. 10 The Perceptron Learning How a simple perceptron is trained, and how its weights are updated? A perceptron can produce 2 values: +1 / -1 where +1 means that the input example belongs to the + class, and -1 means the input example belongs to the – class. The perceptron must learn the weight vector in such a way that, for every training example the perceptron would produce the correct +1 / -1. So, in summary: What are we learning? The weights of the perceptron Is any weight acceptable? No ! We would want to find a weight vector that makes the perceptron produce +1 for the + class, and -1 for the negative class 11 Weights updating Process 1. Since we are discussing supervised learning, which means that we know the true class labels for every training example in our training set. As a result, in the perceptron training rule, we would initialize the weights at random and then feed the training examples into our perceptron and look at the produced output that can be either +1 or -1! 2. After observing the output for a given training example, we will NOT modify the weights unless the produced output was wrong! For example, if we want to produce +1 for + class and -1 for the – class, and if we fed an instance of the – class and the perceptron returned +1, then it means that we need to modify the parameters of our network, i.e., the weights. 3. We will keep this process, and we will keep iterating through the training set as long as necessary until the perceptron classifies all the training examples correctly. 12 How do we update the weights? Using the Delta Rule 13 Delta Rule for SLP Error Learning rate between 0 and 1 14 Example (OR Gate) Learning rate = 0.1 15 First Epoch An epoch means training the neural network with all the training data for one cycle. w1, w2, wb are initialized randomly Calculate net= x1*w1 + x2*w2 + bias*wb Calculate y (Activation Function) y = 1 if net > = threshold y = 0 if net < threshold Where threshold = 0.1 Calculate error = t – y if the error = 0 , then no need to update the weights for the next input 16 But error = ( t – y ) = 1 update weights Weights updating using the delta rule 0.1 0.3 - 0.1 17 Calculate net= x1*w1 + x2*w2 + bias*wb Calculate y y = 1 if net > = threshold y = 0 if net < threshold Where threshold = 0.1 But error = ( t – y ) = 1 update weights 18 Weights updating using the delta rule 19 Calculate net= x1*w1 + x2*w2 + bias*wb Calculate y y = 1 if net > = threshold y = 0 if net < threshold Where threshold = 0.1 But error = ( t – y ) = 0 No need to update weights 20 But we have 2 errors So the total error is not equal to zero Therefor we need another Epoch 21 Second Epoch Second Epoch has no errors Stop training Optimal Weights W1 = 0.2 W2 = 0.3 Wb = 0 22 Why do we need MLP? MLP 23 Multilayer Perceptron (MLP) A multilayer perceptron is a special case of a feedforward neural network where every layer is a fully connected layer, and in some definitions the number of nodes in each layer is the same. 24 Feed Forward Neural Network (FFNN) Why Use Neural Networks? Classical machine learning techniques such as SVMs, random forests, and KNN work well in prediction with structured data where the inputs have a clear meaning. However, for unstructured data such as images, raw speech waveforms, or wearable sensor data, plugging in the raw data into classical models tends not to work very well. 25 Layers of FFNN 1) Input Layer : The neurons of this layer receive input and pass it on to the other layers of the network. Feature or attribute numbers in the dataset must match the number of neurons in the input layer. 2) Hidden Layer: a layer is located between the input and output of the algorithm, in which the function applies weights to the inputs and directs them through an activation function as the output, the type of hidden layer distinguishes the different types of Neural Networks like CNNs, RNNs etc. The number of hidden layers is termed as the depth of the neural network. 3) Output Layer: This is the layer which gives out the predictions. 26 FFNN 27 Forward Propagation Net1 = A*w1+ B*w3 Net Activation Oh1 = Activation (Net1 ) O = w5 * Oh1 + w6 * Oh2 Final Output = Activation (O) Net Activation Oh2 = Activation (Net2) Net Activation Net2 =A*w2+ B*w4 28 Inside a unit, two operations happen: 1) computation of weighted sum and 2) squashing of the weighted sum using an activation function. The result from the activation function becomes an input to the next layer (until the next layer is an Output Layer). 29 FFNN Training Steps 1. Weight Initialization 2. Inputs Application 3. Sum of inputs – weight products (SOP) 4. Activation functions 5. Weight Adaptation (if error) 6. Back to step2 30 Weight Adaptation In machine learning, backpropagation is a widely used algorithm for training feedforward artificial neural networks. Backpropagation computes the gradient in weight space of a feedforward neural network, with respect to a loss function The aim of backpropagation (backward pass) is to distribute or propagate the total error back to the network so as to update the weights in order to minimize the cost function (loss). Chain Rule in Calculus If we have y = f(u) and u = g(x) then we can write the derivative of y as: 31 32 Forward Phase Input weights Sum of Products Prediction Prediction (SOP) Output Error Backward Phase Prediction Prediction Sum of Products Input Error Output (SOP) weights 33 Backward Pass 34 What is the change of the predicted error? The answer is to get the partial derivative of the error with respect to the weights 35 What is the change of the predicted error? Weight Derivative = x x 36 Cont. Update the weights using the gradient descent 37 38 Solved Example https://www.ic.unicamp.br/~sandra/pdf/class/2019-2/mc886/2019-09-16-MC886-Neural-Networks.pdf 39 40 41 What are neural networks used for? 1. Medical diagnosis by medical image classification 2. Financial predictions 3. Electrical load and energy demand forecasting 4. Computer vision is the ability of computers to extract information and insights from images and videos. 5. Neural networks can analyze human speech despite varying speech patterns, pitch, tone, language, and accent. 6. Natural language processing (NLP) is the ability to process natural, human-created text. Neural networks help computers gather insights and meaning from text data and documents. 7. More 42 Recap 43 Deep Learning Neural Networks Deep neural networks employ deep architectures in neural networks. “Deep” refers to functions with higher complexity in the number of layers and units in a single layer. need millions of examples of training data rather than perhaps the hundreds or thousands that a simpler network might need. Types of Deep neural networks : 1. Recurrent Neural Networks (RNN) 2. Convolutional Neural Networks (CNN) 44 Convolutional Neural Networks (CNN) 45 CNN A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning neural network that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image. An image is nothing but a matrix of pixel values. 46 47 Layers in a Convolutional Neural Network A convolution neural network has multiple hidden layers that help in extracting information from an image. The four important layers in CNN are: 1. Convolution layer 2. ReLU layer 3. Pooling layer 4. Fully connected layer 48 1. Convolution Layer The purpose of the convolution is to extract the features of the object on the image locally. It means the network will learn specific patterns within the picture and will be able to recognize it everywhere in the picture. The computer will scan a part of the image, usually with a dimension of 3×3 and multiplies it to a filter. The output of the element-wise multiplication is called a feature map. This step is repeated until all the image is scanned. Note that, after the convolution, the size of the image is reduced. 49 50 The convolved features are controlled by three parameters: 1. Depth: It defines the number of filters to apply during the convolution. 2. Stride: It defines the number of “pixel’s jump” between two slices. 3. Padding : is an operation of adding a corresponding number of rows and column on each side of the input features maps. 51 52 53 Convolution Layer 54 2. ReLU layer In this layer we remove every negative value from the filtered image and replace it with zero. The usage of ReLU helps to prevent the exponential growth in the computation required to operate the neural network. If the CNN scales in size, the computational cost of adding extra ReLUs increases linearly. ReLUs also prevent the emergence of the so-called “vanishing gradient” problem, which is common when using sigmoidal functions. You can think of this as the desire for an image to be as close to gray-and-white as possible. By removing negative values from the neurons' input signals, the rectifier function is effectively removing black pixels from the image and replacing them with gray pixels. 55 3. Pooling layer Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The purpose of max pooling it to teach the convolutional neural networks to detect features in an image when the feature is presented in any manner. A few examples of this are below: Recognizing cats when they are standing or laying down Recognizing eyes regardless of their eye color Recognizing a face whether it is smiling or growing 56 4. Fully connected layer 57 Recurrent Neural Networks (RNN) 58 Recurrent Neural Networks (RNNs) A recurrent neural network (RNN) is a type of deep learning neural network which uses sequential data or time series data. These deep learning algorithms are commonly used for ordinal or temporal problems, such as language translation, natural language processing (NLP), speech recognition, and image captioning; they are incorporated into popular applications such as Siri, voice search, and Google Translate. RNN is distinguished by its “memory” as they take information from prior inputs to influence the current input and output. While traditional deep neural networks assume that inputs and outputs are independent of each other, the output of recurrent neural networks depend on the prior elements within the sequence. 59 Types of Input and Output to RNN 60 What is for dinner? 61 Types of RNN based on different lengths of inputs and outputs 62 Types of RNN 63 The idea behind RNNs is to make use of sequential information. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. 64 RNN Structure 65 66 Cont. 67 68 SoftMax Activation Function The activation function is an integral part of a neural network. Without an activation function, a neural network is a simple linear regression model. This means the activation function gives non-linearity to the neural network. Softmax used for multi-classification problems by calculating the probability for each class The Softmax Activation Function can be mathematically expressed as :- 69 SoftMax Example Suppose the value of Z21, Z22, Z23 comes out to be 2.33, -1.46, and 0.56 respectively. Now the SoftMax activation function is applied to each of these neurons and the following In this case it clear that the input belongs to class 1 values are generated. 70 71 Vanishing Gradient Problem in RNN The gradient calculated deep in the network is "diluted" as it moves back through the many time steps layers of the RNN, this is called the vanishing gradient problem. The vanishing gradient problem causes the gradients to shrink since the derivative will be very small along the layers particularly with the sigmoid function where the derivative of it range from 0 to 1. The lower the gradient is, the harder it is for the network to update the weights and the longer it takes to get to the final result or never converge to the global minimum. Solution is LSTM LONG SHORT-TERM MEMORY LSTMs enable RNNs to remember inputs over a long period of time. Advance topic in deep learning 72 Neural Networks and Overfitting Large neural nets trained on relatively small datasets can overfit the training data. Dropout: A regularization technique to Prevent Neural Networks from Overfitting The term “dropout” refers to dropping out some nodes in a neural network. The nodes are dropped by a dropout probability of p. 73 Dropout In the above Image, Consider in the first forward pass in the Middle layer dropout may disable the 1st & 3th nodes but in the second forward pass maybe 2nd& 4th nodes will get disabled. In each iteration, during backpropagation, the weights corresponding to these nodes will not be updated. 74 END 75