Artificial Neural Network (ANN) Lecture Notes PDF
Document Details
Uploaded by ExquisiteNarwhal
Tags
Summary
These lecture notes cover the fundamental concepts of artificial neural networks (ANNs). The document explores the relationship between biological neurons and artificial neural networks, including the perceptron model and different activation functions. It also discusses the training process and optimization algorithms like backpropagation.
Full Transcript
ARTIFICIAL NEURAL NETWORK (ANN) If the whole idea behind neural network is to have computers artificially mimic biological natural intelligence, we should probably build a general understanding of how biological neurons work! Stained Neurons in cerebral cortex Illust...
ARTIFICIAL NEURAL NETWORK (ANN) If the whole idea behind neural network is to have computers artificially mimic biological natural intelligence, we should probably build a general understanding of how biological neurons work! Stained Neurons in cerebral cortex Illustration of biological neurons Simplified Biological Neuron Model Dendrites Axon Nucleus A perceptron was a form of neural network introduced in 1958 by Frank Rosenblatt. Amazingly, even back then he saw huge potential: ○ "...perceptron may eventually be able to learn, make decisions, and translate languages." However, in 1969 Marvin Minsky and Seymour Papert's published their book Perceptrons. It suggested that there were severe limitations to what perceptrons could do. This marked the beginning of what is known as the AI Winter, with little funding into AI and Neural Networks in the 1970s. Fortunately for us, we now know the amazing power of neural networks, which all stem from the simple perceptron model, so let’s head back and convert our simple biological neuron model into the perceptron model. We can expand this to a generalization: X1 *W1 + B1 Y INPUTS F(X) OUTPUT X2 *W2 + B2 XN *WN + BN **Every connection between neurons in the network has a weight. These weights determine the strength of the connection. Initially, the weights are set randomly. We’ve been able to model a biological neuron as a simple perceptron! Mathematically our generalization was: A single perceptron won’t be enough to learn complicated systems. Fortunately, we can expand on the idea of a single perceptron, to create a multi-layer perceptron model. To build a network of perceptrons, we can connect layers of perceptrons, using a multi-layer perceptron model. Hidden layers are difficult to interpret, due to their high interconnectivity and distance away from known input or output values. Terminology: Input Layer: First layer that directly accepts real data values Hidden Layer: Any layer between input and output layers Output Layer: The final estimate of the output. What is incredible about the neural network framework is that it can be used to approximate any function. Zhou Lu and later on Boris Hanin proved mathematically that Neural Networks can approximate any convex continuous function. For more details on this check out the Wikipedia page for “Universal Approximation Theorem” How Does ANN Works? Activation Functions Activation function is a function that you use to get the output of node. It is also known as Transfer Function. It is used to determine the output of neural network for classification problem. It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function). The Activation Functions can be basically divided into 2 types- 1. Linear Activation Function 2. Non-linear Activation Functions LINEAR ACTIVATION FUNCTION NONLINEAR ACTIVATION FUNCTIONS: SIGMOID NONLINEAR ACTIVATION FUNCTIONS: Tanh Tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). Tanh is also sigmoidal (s - shaped). The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The function is differentiable. The function is monotonic while its derivative is not monotonic. NONLINEAR ACTIVATION FUNCTIONS: ReLU NONLINEAR ACTIVATION FUNCTIONS: Softmax Softmax is an activation function that scales numbers/logits into probabilities. The output of a Softmax is a vector with probabilities of each possible outcome. The probabilities in vector, v sums to one for all possible outcomes or classes. Mathematically, Softmax is defined as, Training Neural Network The key to an ANN's success is its ability to learn from data. During training, the network is fed with a large dataset for which the correct answers are known. The network compares its predictions to the actual answers and adjusts the weights of its connections to reduce the error. This process is typically done using optimization algorithms like backpropagation. When we start off with our neural network, we initialize our weights randomly. Obviously, it won’t give you very good results. In the process of training, we want to start with a bad performing neural network and wind up with network with high accuracy. In terms of loss function, we want our loss function to much lower in the end of training. Improving the network is possible, because we can change its function by adjusting weights. We want to find another function that performs better than the initial one. The problem of training is equivalent to the problem of minimizing the loss function. Why minimize loss instead of maximizing? Turns out loss is much easier function to optimize. There are a lot of algorithms that optimize functions. These algorithms can gradient-based or not, in sense that they are not only using the information provided by the function, but also by its gradient. Gradient Descent A gradient simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops learning. In mathematical terms, a gradient is a partial derivative with respect to its inputs. Smaller steps sizes take longer to find the minimum’ Larger steps are faster, but we risk overshooting the minimum! This step size is known as the learning rate Learning rate is too large: Learning rate is too small: the next point will perpetually bounce learning will take too long haphazardly across the bottom of the well Learning rate Learning rate is just right Learning rate Types of Gradient Descent Batch gradient descent, also called vanilla gradient descent, calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch. Some advantages of batch gradient descent are its computational efficiency: it produces a stable error gradient and a stable convergence. Some disadvantages are that the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset to be in memory and available to the algorithm. Types of Gradient Descent SGD does this for each training example within the dataset, (SGD) Stochastic meaning it updates the parameters for each training example one by one. Depending on the problem, this can make SGD faster than batch gradient descent. One advantage is the frequent updates allow us to have a pretty detailed rate of improvement. The frequent updates, however, are more computationally expensive than the batch gradient descent approach. Additionally, the frequency of those updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing. Types of Gradient Descent Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. Common mini-batch sizes range between 50 and 256, but like any other machine learning technique, there is no clear rule because it varies for different applications. This is the go-to algorithm when training a neural network and it is the most common type of gradient descent within deep learning. Gradient Descent Optimization Algorithms The learning rate we showed in the figure was constant (each step size was equal) But we can be clever and adapt our step size as we go along. We could start with larger steps, then go smaller as we realize the slope gets closer to zero. This can be done by using gradient descent optimization algorithms. There are a lots of optimization algorithms such as Momentum, NAG, Adagrad, Adadelta, RMSprop, Adam, Nadam and Nadam. ADAM (Adaptive Moment Estimation) In 2015, Kingma and Ba published their paper: “Adam: A Method for Stochastic Optimization”. Adam is a much more efficient way of searching for these minimums, so you will see it is always been used in a Python code. It is the most popular optimization algorithm today. ADAM computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum ADAM (Adaptive Moment Estimation) Cost Function A cost function is a measure of error between what value your model predicts and what the value actually is. Loss Function Loss function is a method of evaluating “how well your algorithm models your dataset”. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. As you tune your algorithm to try and improve your model, your loss function will tell you if you’re improving or not. ‘Loss’ helps us to understand how much the predicted value differ from actual value In Python, we use binary_crossentropy (for 2 category only) and categorical_crossentropy (for more than 2 category) for classification problems while for regression we use mean_squared_error or mean_absolute_error Difference between Cost Function and Loss Function Cost and loss functions are synonymous (some people also call it error function). The more general scenario is to define an objective function first, which you want to optimize. This objective function could be to: - minimize a mean squared error cost (or loss) - minimize cross-entropy loss (or cost) function The loss function (or error) is for a single training example, while the cost function is over the entire training set (or mini-batch for mini-batch gradient descent). Backpropagation Backpropagation algorithm is probably the most fundamental building block in a neural network. It was first introduced in 1960s and almost 30 years later (1989) popularized by Rumelhart, Hinton and Williams in a paper called “Learning representations by back-propagating errors”. Illustration of back-propagation. Backpropagation According to the paper from 1989, backpropagation: “repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector”. and “the ability to create useful new features distinguishes back-propagation from earlier, simpler methods…” In other words, backpropagation aims to minimize the cost function by adjusting network’s weights and biases. The level of adjustment is determined by the gradients of the cost function with respect to those parameters. Backpropagation Gradient of a function C(x1, x2, …, xm) in point x is a vector of the partial derivatives of C in x. The derivative of a function C measures the sensitivity to change of the function value (output value) with respect to a change in its argument x (input value). In other words, the derivative tells us the direction C is going. The gradient shows how much the parameter x needs to change (in positive or negative direction) to minimize C. Backpropagation Those gradient can be computed by using chain rule. For a single weight, )wjk)^l, the gradient is: Backpropagation The main idea here is that we can use the gradient to go back through the network and adjust our weights and biases to minimize the output of the error vector on the last output layer. Using some calculus notation, we can expand this idea to networks with multiple neurons per layer. To summarize, here are the steps that you need to know in training neural networks: Step 1: Using input x set the activation function a for Step 2: Compute z and a Step 3: Compute error Step 4: Backpropagate the the input layer. This and current layer. vector. error for each layer resulting a then feed into the next layer (and so on). Now our A set of nodes, analogous to neurons, organized in layers. model has all the standard A set of weights representing the connections between each neural network components layer and the layer beneath it. The layer beneath may be another neural of what network layer, or some other kind of layer. people usually A set of biases, one for each node. mean when they say "neural An activation function that transforms the output of each node in a layer. network": Different layers may have different activation functions. Math behind Artificial Neural Networks Below is the sample data set with inputs A,B,C and output labelled “Target”: Normalized the dataset: A simple feed forward back propagation artificial neural network is shown below with one input, hidden and output layer. Let say all the weights are 0.1 The training consists of two steps: 1. Forward pass : The inputs pass through the network into output layer producing the output. 2. Back-propagation: The error is propagated backwards into the network adjusting the weights. A simple neuron does two functions: 1. Summation of input values with respective weights at each node. 2. Activating the input signal using activation function F(X). Let say we use sigmoid function in this example. Let us only consider calculations at hidden node 1, H1. Value at hidden layer: H1 = (1 * 0.1) + (1 * 0.1) + (1 * 0.1) = 0.3 Applying sigmoid activation function = sigmoid(0.3) = 1/(exp(0.3)+1) = 0.57 Similar calculations are done at each hidden node and the values are passed onto output layer. Value at node in output layer, Value at Output node = (0.57 * 0.1) + (0.57 * 0.1) + (0.57 * 0.1) + (0.57 * 0.1) = 0.228 Applying sigmoid activation function = sigmoid(2.28) = 1/(exp(2.28)+1) = 0.56 Error is the difference between the value predicted by network(output) and the original value(Target), value of error is Error = 0.5 * (1–0.56)² = 0.0968. Error value here is calculated only for one data point. Error is normally calculated once over all the data points or even in batches. Error propagation at Output node: Change of error with respect to only one weight is shown below, which is the weight connecting node H1 in hidden layer to output layer. Below the term on left hand side of equation is the change by which weights connecting hidden-output layer has to be updated to minimize the error so that output value matches the target. By applying chain’s rule, we get: 𝜕𝐸𝑟𝑟𝑜𝑟 𝜕𝐸𝑟𝑟𝑜𝑟 𝜕𝑌 𝜕𝑍 = ( ) 𝜕𝑊 𝜕𝑌 𝜕𝑍 𝜕𝑊 Splitting and calculating each term in the above equation: Error formula = 0.5(Target-Output)2 The weights get updated by error information at each node in hidden and output layers The updated weights after first iteration, weights connecting input-hidden layers = 0.1-(0.01*-0.1084) = 0.1011 (since input and all weights value the same, hence weights updated are the same for all) With the new updated weights the feed forward pass and back propagation iterates until the weights settle at particular value to minimize the error making the output value match the target value. The algorithm iterates for all the rows in data table and finally the ANN gets settled with particular weights making it ready for prediction. The final step after training is to feed the input data into trained neural network for prediction results. Now try this! A FIRST NEURAL NETWORK In this exercise, we will train our first little neural net. Neural nets will give us a way to learn nonlinear models without the use of explicit feature crosses. Task 1: The model as given combines our two input features into a single neuron. Will this model learn any nonlinearities? Run it to confirm your guess. Task 2: Try increasing the number of neurons in the hidden layer from 1 to 2 and try changing from a Linear activation to a nonlinear activation like ReLU. Can you create a model that can learn nonlinearities? Can it model the data effectively? Task 3: Try increasing the number of neurons in the hidden layer from 2 to 3, using a nonlinear activation like ReLU. Can it model the data effectively? How does model quality vary from run to run? Task 4: Continue experimenting by adding or removing hidden layers and neurons per layer. Also feel free to change learning rates, regularization, and other learning settings. What is the smallest number of neurons and layers you can use that gives test loss of 0.177 or lower? Does increasing the model size improve the fit, or how quickly it converges? Does this change how often it converges to a good model? For example, try the following architecture: First hidden layer with 3 neurons. Second hidden layer with 3 neurons. Third hidden layer with 2 neurons.