Feed Forward and Back-Propagation Neural Networks PDF

Feed forward and Back- Propagation Neural networks G.Karthikeyan, AP/AI Feed forward Neural networks ▪ It is a type of artificial neural network in which nodes’ connections do not form a loop. ▪ It is named because all information flows in a forward manner only. ▪ Information forward—from the input nodes, through the hidden nodes (if any) and to the output nodes. ▪ was the first and simplest type of artificial neural network invented. Feed forward Neural networks ▪ This assigns the value of input x to the category y. ▪ The feedforward network will mар y = f (x; θ). ▪ It then memorizes the value of θ that most closely approximates the function. Feedforward Neural Network’s Layers 1.Layer of input It contains the neurons that receive input. The data is subsequently passed on to the next tier. The input layer’s total number of neurons is equal to the number of variables in the dataset. Feedforward Neural Network’s Layers 2.Hidden layer This is the intermediate layer, which is covered between the input and output layers. This layer has a large number of neurons that perform alterations on the inputs. They then communicate with the output layer. Feedforward Neural Network’s Layers 3.Output layer It is the last layer and is depending on the model’s construction. Additionally, the output layer is the expected feature, as you are aware of the desired outcome. 4.Neurons weights Weights are used to describe the strength of a connection between neurons. The range of a weight’s value is from 0 to 1. Backpropagation in Neural Network Backpropagation is a method of training neural networks to perform tasks more accurately. The algorithm was first used for this purpose in 1974 in papers published by Werbos, Rumelhart, Hinton, and Williams. The term backpropagation is short for "backward propagation of errors". It works especially well for feed forward neural networks (networks without any loops) and problems that require supervised learning. Why We Need Backpropagation? While designing a Neural Network, in the beginning, we initialize weights with some random values or any variable for that fact. Obviously, it’s not necessary that whatever weight values we have selected will be correct, or it fits our model the best. we have selected some weight values in the beginning, but our model output is way different than our actual output i.e. the error value is huge. Why We Need Backpropagation?(Contd..) Now, how will you reduce the error? Basically, we need to somehow explain the model to change the parameters (weights), such that error becomes minimum. we need to train our model. One way to train our model is called as Backpropagation. Backpropagation process: Consider the diagram below… Backpropagation process: Summarizing the steps: Calculate the error – How far model output from the actual output. Minimum Error – Check whether the error is minimized or not. Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error. Repeat the process until the error becomes minimum. Model is ready to make a prediction – Once the error becomes minimum, we can feed some inputs to your model and it will produce the output. Feed forward Back-Propagation Information flows in only one Information passes from input layer direction. to output layer to produce result. i.e. from input layer to output layer. Error in result is then communicated back to previous layers When the weights are once decided, Weights are re-adjusted they are not usually changed. The nodes here do their job without Nodes get to know how much they being aware whether results contributed in the answer being produced are accurate or not. wrong. Perceptron Frank Rosenblatt, an American psychologist, proposed the classical perceptron model in 1958. Further refined and carefully analyzed by Minsky and Papert (1969) — their model is referred to as the perceptron model. The perceptron model, proposed by Minsky-Papert, is a more general computational model than McCulloch-Pitts neuron. It overcomes some of the limitations of the M-P neuron by introducing the concept of numerical weights (a measure of importance) for inputs, and a mechanism for learning those weights. Inputs are no longer limited to boolean values like in the case of an M-P neuron, it supports real inputs as well which makes it more useful and generalized. Now, this is very similar to an M-P neuron but we take a weighted sum of the inputs and set the output as one only when the sum is more than an arbitrary threshold (theta). However, according to the convention, instead of hand coding the thresholding parameter theta, we add it as one of the inputs, with the weight -theta like shown below, which makes it learn-able Consider the task of predicting whether I would watch a random game of football on TV or not (the same example from my M-P neuron post) using the behavioral data available. And let's assume my decision is solely dependent on 3 binary inputs (binary for simplicity). Here, w_0 is called the bias because it represents the prior (prejudice). A football freak may have a very low threshold and may watch any football game irrespective of the league, club or importance of the game [theta = 0]. On the other hand, a selective viewer like me may only watch a football game that is a premier league game, featuring Man United game and is not friendly [theta = 2]. The point is, the weights and the bias will depend on the data (my viewing history in this case). Based on the data, if needed the model may have to give a lot of importance (high weight) to the isManUnitedPlaying input and penalize the weights of other inputs. Perceptron vs McCulloch-Pitts Neuron What kind of functions can be implemented using a perceptron? How different is it from McCulloch-Pitts neurons? From the equations, it is clear that even a perceptron separates the input space into two halves, positive and negative. All the inputs that produce an output 1 lie on one side (positive half space) and all the inputs that produce an output 0 lie on the other side (negative half space). In other words, a single perceptron can only be used to implement linearly separable functions, just like the M-P neuron. Then what is the difference? Why do we claim that the perceptron is an updated version of an M-P neuron? Here, the weights, including the threshold can be learned and the inputs can be real values. —----------------------------------------------------------------------------------------------------- In a perceptron, weights and bias are parameters that determine the behavior and decision-making process of the neuron. They are essential components that allow the perceptron to learn and make predictions. Weights: Each input to a perceptron is associated with a weight. The weights represent the strength or importance of the respective input in influencing the neuron's decision. In other words, they determine how much each input contributes to the overall activation of the perceptron. The perceptron multiplies each input by its corresponding weight and sums up these weighted inputs. Bias: The purpose of the bias term is to introduce an additional degree of freedom in the decision-making process of the perceptron. It allows the perceptron to have a non-zero output even when all the input values are 0. The bias term helps in shifting the decision boundary of the perceptron, influencing its overall activation and decision. During the learning process, the weights and bias of a perceptron are adjusted iteratively based on the perceptron learning algorithm. This algorithm aims to find the optimal values for the weights and bias that minimize the error between the perceptron's predictions and the desired outputs. Boolean Functions Using Perceptron OR Function — Can Do! Just revisiting the good old OR function the perceptron way. The above ‘possible solution’ was obtained by solving the linear system of equations on the left. It is clear that the solution separates the input space into two spaces, negative and positive half spaces. I encourage you to try it out for AND and other boolean function. Now if you actually try and solve the linear equations above, you will realize that there can be multiple solutions. But which solution is the best? To more formally define the ‘best’ solution, we need to understand errors and error surfaces, which we will do in my next post on Perceptron Learning Algorithm. XOR Function — Can’t Do! Now let's look at a non-linear boolean function i.e., you cannot draw a line to separate positive inputs from the negative ones. Notice that the fourth equation contradicts the second and the third equation. Point is, there are no perceptron solutions for non-linearly separated data. So the key take away is that a single perceptron cannot learn to separate the data that are non-linear in nature. The XOR Affair In the book published by Minsky and Papert in 1969, the authors implied that, since a single artificial neuron is incapable of implementing some functions such as the XOR logical function, larger networks also have similar limitations, and therefore should be dropped. Later research on three-layered perceptrons showed how to implement such functions, therefore saving the technique from obliteration Link: https://towardsdatascience.com/perceptron-the-artificial-neuron-4d8c70d5cc8 d Perceptron Learning Algorithm, originally proposed by Frank Rosenblatt in 1943, later refined and carefully analyzed by Minsky and Papert in 1969. Perceptron A perceptron is not the Sigmoid neuron we use in ANNs or any deep learning networks today. The perceptron model is a more general computational model than McCulloch-Pitts neuron. It takes an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more than some threshold else returns 0. Rewriting the threshold as shown above and making it a constant input with a variable weight, we would end up with something like the following: A single perceptron can only be used to implement linearly separable functions. It takes both real and boolean inputs and associates a set of weights to them, along with a bias (the threshold thing I mentioned above). We learn the weights, we get the function. Let's use a perceptron to learn an OR function. OR Function Using A Perceptron What’s going on above is that we defined a few conditions (the weighted sum has to be more than or equal to 0 when the output is 1) based on the OR function output for various sets of inputs, we solved for weights based on those conditions and we got a line that perfectly separates positive inputs from those of negative. Doesn’t make any sense? Maybe now is the time you go through that post I was talking about. Minsky and Papert also proposed a more principled way of learning these weights using a set of examples (data). Mind you that this is NOT a Sigmoid neuron and we’re not going to do any Gradient Descent Warming Up — Basics Of Linear Algebra Vector A vector can be defined in more than one way. For a physicist, a vector is anything that sits anywhere in space, has a magnitude and a direction. For a CS guy, a vector is just a data structure used to store some data — integers, strings etc. For this tutorial, I would like you to imagine a vector the Mathematician way, where a vector is an arrow spanning in space with its tail at the origin. This is not the best mathematical way to describe a vector but as long as you get the intuition, you’re good to go. Vector Representations A 2-dimensional vector can be represented on a 2D plane as follows: Carrying the idea forward to 3 dimensions, we get an arrow in 3D space as follows: Dot Product Of Two Vectors At the cost of making this tutorial even more boring than it already is, let's look at what a dot product is. Imagine you have two vectors of size n+1, w and x, the dot product of these vectors (w.x) could be computed as follows: Here, w and x are just two lonely arrows in an n+1 dimensional space (and intuitively, their dot product quantifies how much one vector is going in the direction of the other). So technically, the perceptron was only computing a lame dot product (before checking if it's greater or lesser than 0). The decision boundary line which a perceptron gives out that separates positive examples from the negative ones is really just w. x = 0. Angle Between Two Vectors Now the same old dot product can be computed differently if only you knew the angle between the vectors and their individual magnitudes. Here’s how: The other way around, you can get the angle between two vectors, if only you knew the vectors, given you know how to calculate vector magnitudes and their vanilla dot product. When I say that the cosine of the angle between w and x is 0, what do you see? I see arrow w being perpendicular to arrow x in an n+1 dimensional space (in 2-dimensional space to be honest). So basically, when the dot product of two vectors is 0, they are perpendicular to each other. Setting Up The Problem We are going to use a perceptron to estimate if I will be watching a movie based on historical data with the above-mentioned inputs. The data has positive and negative examples, positive being the movies I watched i.e., 1. Based on the data, we are going to learn the weights using the perceptron learning algorithm. For visual simplicity, we will only assume two-dimensional input. Perceptron Learning Algorithm Our goal is to find the w vector that can perfectly classify positive inputs and negative inputs in our data. I will get straight to the algorithm. Here goes: We initialize w with some random vector. We then iterate over all the examples in the data, (P U N) both positive and negative examples. Now if an input x belongs to P, ideally what should the dot product w.x be? I’d say greater than or equal to 0 because that’s the only thing what our perceptron wants at the end of the day so let's give it that. And if x belongs to N, the dot product MUST be less than 0. So if you look at the if conditions in the while loop: Case 1: When x belongs to P and its dot product w.x < 0 Case 2: When x belongs to N and its dot product w.x ≥ 0 Only for these cases, we are updating our randomly initialized w. Otherwise, we don’t touch w at all because Case 1 and Case 2 are violating the very rule of a perceptron. So we are adding x to w (ahem vector addition ahem) in Case 1 and subtracting x from w in Case 2. The equation w = w + x you mentioned is a specific update rule used in the perceptron learning algorithm to adjust the weights of the perceptron. This update rule is applied when the perceptron misclassifies an input vector x that belongs to the positive class P. Update equation w = w+ del w b = b+ del b Del w = alpha * (yi - yi’) * xi Del b = alpha * (yi - yi’) The perceptron learning algorithm aims to find a decision boundary that separates two classes. In this case, the positive class P is represented by input vectors that should be classified as positive (y = +1) by the perceptron. However, if the perceptron misclassifies an input vector x from the positive class (i.e., w·x < 0), it means that the current weights w do not properly separate the classes. To correct this misclassification, the update rule w = w + x is used. Adding the misclassified input vector x to the weights w has the effect of adjusting the decision boundary. By adding x to w, the perceptron aims to shift the decision boundary closer to x, thereby increasing the likelihood of correctly classifying similar input vectors in the future. The addition of x to w has a geometric interpretation. The weights w determine the direction of the decision boundary, and adding x to w changes the direction of the decision boundary, effectively rotating it towards x. It's important to note that this update rule assumes that the positive class P can be separated from the negative class by a hyperplane. If the classes are not linearly separable, the perceptron learning algorithm may not converge. In such cases, alternative algorithms or modifications to the perceptron, such as using a multi-layer perceptron or non-linear activation functions, may be necessary. Let's assume we have two inputs, x1 and x2, both of which can take binary values (0 or 1). We also have corresponding weights w1 and w2, and a bias term b. Initialize the weights and bias to small random values: w1 = random_small_value w2 = random_small_value b = random_small_value Define the activation function. In the perceptron learning algorithm, the activation function is a step function that returns 1 if the weighted sum of inputs and biases is greater than or equal to 0, and 0 otherwise. We can define it as follows: def step_function(x): return 1 if x >= 0 else 0 Train the perceptron using the training examples of the AND function until convergence or a predefined number of epochs: for each training example (x1, x2, y) in the training set: y_pred = step_function(w1 * x1 + w2 * x2 + b) error = y - y_pred w1 = w1 + learning_rate * error * x1 w2 = w2 + learning_rate * error * x2 b = b + learning_rate * error Here, y is the target output for the given inputs (x1, x2), y_pred is the predicted output using the current weights and biases, and the learning_rate is a hyperparameter that controls the rate at which the weights and biases are updated. Once the weights and biases have been updated, you can use the perceptron to predict the output for new inputs: y_pred = step_function(w1 * x1 + w2 * x2 + b) By repeating steps 3 and 4, the perceptron learning algorithm learns the weights and biases Multi-layer perceptron Biological Neuron Deep learning Dendrite: Receives signals from other neurons Soma: Processes the information (central processing unit) Axon: Transmits the output of this neuron Synapse: Point of connection to other neurons Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes the output through a cable like structure to other connected neurons (axon to synapse to other neuron’s dendrite). Now, this might be biologically inaccurate as there is a lot more going on out there but on a higher level, this is what is going on with a neuron in our brain — takes an input, processes it, throws out an output. Our sense organs interact with the outer world and send the visual and sound information to the neurons. Let's say you are watching Friends. Now the information your brain receives is taken in by the “laugh or not” set of neurons that will help you make a decision on whether to laugh or not. Each neuron gets fired/activated only when its respective criteria (more on this later) is met like shown below. Of course, this is not entirely true. In reality, it is not just a couple of neurons which would do the decision making. There is a massively parallel interconnected network of 10¹¹ neurons (100 billion) in our brain and their connections are not as simple as I showed you above. It might look something like this: Now the sense organs pass the information to the first/lowest layer of neurons to process it. And the output of the processes is passed on to the next layers in a hierarchical manner, some of the neurons will fire and some won’t and this process goes on until it results in a final response — in this case, laughter. This massively parallel network also ensures that there is a division of work. Each neuron only fires when its intended criteria is met i.e., a neuron may perform a certain role to a certain stimulus, as shown below. It is believed that neurons are arranged in a hierarchical fashion (however, many credible alternatives with experimental support are proposed by the scientists) and each layer has its own role and responsibility. To detect a face, the brain could be relying on the entire network and not on a single layer. Now that we have established how a biological neuron works, lets look at what McCulloch and Pitts had to offer. McCulloch-Pitts Neuron The first computational model of a neuron was proposed by Warren MuCulloch (neuroscientist) and Walter Pitts (logician) in 1943. It may be divided into 2 parts. The first part, g takes an input (ahem dendrite ahem), performs an aggregation and based on the aggregated value the second part, f makes a decision. Lets suppose that I want to predict my own decision, whether to watch a random football game or not on TV. The inputs are all boolean i.e., {0,1} and my output variable is also boolean {0: Will watch it, 1: Won’t watch it}. So, x_1 could be isPremierLeagueOn (I like Premier League more) x_2 could be isItAFriendlyGame (I tend to care less about the friendlies) x_3 could be isNotHome (Can’t watch it when I’m running errands. Can I?) x_4 could be isManUnitedPlaying (I am a big Man United fan. GGMU!) and so on. y= 0 if any xi is inhibitory These inputs can either be excitatory or inhibitory. Inhibitory inputs are those that have maximum effect on the decision making irrespective of other inputs i.e., if x_3 is 1 (not home) then my output will always be 0 i.e., the neuron will never fire, so x_3 is an inhibitory input. Excitatory inputs are NOT the ones that will make the neuron fire on their own but they might fire it when combined together. Formally, this is what is going on: We can see that g(x) is just doing a sum of the inputs — a simple aggregation. And theta here is called thresholding parameter. For example, if I always watch the game when the sum turns out to be 2 or more, the theta is 2 here. This is called the Thresholding Logic. Boolean Functions Using M-P Neuron So far we have seen how the M-P neuron works. Now lets look at how this very neuron can be used to represent a few boolean functions. Mind you that our inputs are all boolean and the output is also boolean so essentially, the neuron is just trying to learn a boolean function. A lot of boolean decision problems can be cast into this, based on appropriate input variables— like whether to continue reading this post, whether to watch Friends after reading this post etc. can be represented by the M-P neuron. M-P Neuron: A Concise Representation This representation just denotes that, for the boolean inputs x_1, x_2 and x_3 if the g(x) i.e., sum ≥ theta, the neuron will fire otherwise, it won’t. x1 x2 x3 y 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 AND Function An AND function neuron would only fire when ALL the inputs are ON i.e., g(x) ≥ 3 here. OR Function I believe this is self explanatory as we know that an OR function neuron would fire if ANY of the inputs is ON i.e., g(x) ≥ 1 here. A Function With An Inhibitory Input Now this might look like a tricky one but it’s really not. Here, we have an inhibitory input i.e., x_2 so whenever x_2 is 1, the output will be 0. Keeping that in mind, we know that x_1 AND !x_2 would output 1 only when x_1 is 1 and x_2 is 0 so it is obvious that the threshold parameter should be 1. Lets verify that, the g(x) i.e., x_1 + x_2 would be ≥ 1 in only 3 cases: Case 1: when x_1 is 1 and x_2 is 0 Case 2: when x_1 is 1 and x_2 is 1 Case 3: when x_1 is 0 and x_2 is 1 But in both Case 2 and Case 3, we know that the output will be 0 because x_2 is 1 in both of them, thanks to the inhibition. And we also know that x_1 AND !x_2 would output 1 for Case 1 (above) so our thresholding parameter holds good for the given function. NOR Function For a NOR neuron to fire, we want ALL the inputs to be 0 so the thresholding parameter should also be 0 and we take them all as inhibitory input. NOT Function For a NOT neuron, 1 outputs 0 and 0 outputs 1. So we take the input as an inhibitory input and set the thresholding parameter to 0. It works! Can any boolean function be represented using the M-P neuron? Before you answer that, lets understand what M-P neuron is doing geometrically. Geometric Interpretation Of M-P Neuron This is the best part of the post according to me. Lets start with the OR function. OR Function We already discussed that the OR function’s thresholding parameter theta is 1, for obvious reasons. The inputs are obviously boolean, so only 4 combinations are possible — (0,0), (0,1), (1,0) and (1,1). Now plotting them on a 2D graph and making use of the OR function’s aggregation equation i.e., x_1 + x_2 ≥ 1 using which we can draw the decision boundary as shown in the graph below. Mind you again, this is not a real number graph We just used the aggregation equation i.e., x_1 + x_2 =1 to graphically show that all those inputs whose output when passed through the OR function M-P neuron lie ON or ABOVE that line and all the input points that lie BELOW that line are going to output 0. Voila!! The M-P neuron just learnt a linear decision boundary! The M-P neuron is splitting the input sets into two classes — positive and negative. Positive ones (which output 1) are those that lie ON or ABOVE the decision boundary and negative ones (which output 0) are those that lie BELOW the decision boundary. Lets convince ourselves that the M-P unit is doing the same for all the boolean functions by looking at more examples (if it is not already clear from the math). AND Function In this case, the decision boundary equation is x_1 + x_2 =2. Here, all the input points that lie ON or ABOVE, just (1,1), output 1 when passed through the AND function M-P neuron. It fits! The decision boundary works! Tautology OR Function With 3 Inputs Lets just generalize this by looking at a 3 input OR function M-P unit. In this case, the possible inputs are 8 points — (0,0,0), (0,0,1), (0,1,0), (1,0,0), (1,0,1),… you got the point(s). We can map these on a 3D graph and this time we draw a decision boundary in 3 dimensions. “Is it a bird? Is it a plane?” Yes, it is a PLANE! The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1 is shown below: Take your time and convince yourself by looking at the above plot that all the points that lie ON or ABOVE that plane (positive half space) will result in output 1 when passed through the OR function M-P unit and all the points that lie BELOW that plane (negative half space) will result in output 0. Just by hand coding a thresholding parameter, M-P neuron is able to conveniently represent the boolean functions which are linearly separable. Linear separability (for boolean functions): There exists a line (plane) such that all inputs which produce a 1 lie on one side of the line (plane) and all inputs which produce a 0 lie on other side of the line (plane). Limitations Of M-P Neuron What about non-boolean (say, real) inputs? Do we always need to hand code the threshold? Are all inputs equal? What if we want to assign more importance to some inputs? What about functions which are not linearly separable? Say XOR function. Blog: https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd 1 Neural Networks Neural networks reflect the behavior of the human brain, allowing computer programs to recognize patterns and solve common problems in the fields of AI, machine learning, and deep learning. What are neural networks? Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another. Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. Neural networks rely on training data to learn and improve their accuracy over time. However, once these learning algorithms are fine-tuned for accuracy, they are powerful tools in computer science and artificial intelligence, allowing us to classify and cluster data at a high velocity. Tasks in speech recognition or image recognition can take minutes versus hours when compared to the manual identification by human experts. One of the most well-known neural networks is Google’s search algorithm. How do neural networks work? Think of each individual node as its own linear regression model, composed of input data, weights, a bias (or threshold), and an output. The formula would look something like this: ∑wixi + bias = w1x1 + w2x2 + w3x3 + bias output = f(x) = 1 if ∑w1x1 + b>= 0; 0 if ∑w1x1 + b < 0 Once an input layer is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output compared to other inputs. All inputs are then multiplied by their respective weights and then summed. Afterward, the output is passed through an activation function, which determines the output. If that output exceeds a given threshold, it “fires” (or activates) the node, passing data to the next layer in the network. This results in the output of one node becoming in the input of the next node. This process of passing data from one layer to the next layer defines this neural network as a feedforward network. Let’s break down what one single node might look like using binary values. We can apply this concept to a more tangible example, like whether you should go surfing (Yes: 1, No: 0). The decision to go or not to go is our predicted outcome, or y-hat. Let’s assume that there are three factors influencing your decision-making: 1. Are the waves good? (Yes: 1, No: 0) 2. Is the line-up empty? (Yes: 1, No: 0) 3. Has there been a recent shark attack? (Yes: 0, No: 1) Then, let’s assume the following, giving us the following inputs: X1 = 1, since the waves are pumping X2 = 0, since the crowds are out X3 = 1, since there hasn’t been a recent shark attack Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are of greater importance to the decision or outcome. W1 = 5, since large swells don’t come around often W2 = 2, since you’re used to the crowds W3 = 4, since you have a fear of sharks Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the various inputs, we can start to plug in values into the formula to get the desired output. Y-hat = (1*5) + (0*2) + (1*4) – 3 = 6 If we use the activation function from the beginning of this section, we can determine that the output of this node would be 1, since 6 is greater than 0. In this instance, you would go surfing; but if we adjust the weights or the threshold, we can achieve different outcomes from the model. When we observe one decision, like in the above example, we can see how a neural network could make increasingly complex decisions depending on the output of previous decisions or layers. In the example above, we used perceptrons to illustrate some of the mathematics at play here, but neural networks leverage sigmoid neurons, which are distinguished by having values between 0 and 1. Since neural networks behave similarly to decision trees, cascading data from one node to another, having x values between 0 and 1 will reduce the impact of any given change of a single variable on the output of any given node, and subsequently, the output of the neural network. As we start to think about more practical use cases for neural networks, like image recognition or classification, we’ll leverage supervised learning, or labeled datasets, to train the algorithm. As we train the model, we’ll want to evaluate its accuracy using a cost (or loss) function. This is also commonly referred to as the mean squared error (MSE). In the equation below, i represents the index of the sample, y-hat is the predicted outcome, y is the actual value, and m is the number of samples. 𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛= 𝑀𝑆𝐸=1/2𝑚 ∑129_(𝑖=1)^𝑚▒(𝑦 ̂^((𝑖) )−𝑦^((𝑖) ) )^2 Ultimately, the goal is to minimize our cost function to ensure correctness of fit for any given observation. As the model adjusts its weights and bias, it uses the cost function and reinforcement learning to reach the point of convergence, or the local minimum. The process in which the algorithm adjusts its weights is through gradient descent, allowing the model to determine the direction to take to reduce errors (or minimize the cost function). With each training example, the parameters of the model adjust to gradually converge at the minimum. See this IBM Developer article for a deeper explanation of the quantitative concepts involved in neural networks. Most deep neural networks are feedforward, meaning they flow in one direction only, from input to output. However, you can also train your model through backpropagation; that is, move in the opposite direction from output to input. Backpropagation allows us to calculate and attribute the error associated with each neuron, allowing us to adjust and fit the parameters of the model(s) appropriately. Types of neural networks Neural networks can be classified into different types, which are used for different purposes. While this isn’t a comprehensive list of types, the below would be representative of the most common types of neural networks that you’ll come across for its common use cases: The perceptron is the oldest neural network, created by Frank Rosenblatt in 1958. It has a single neuron and is the simplest form of a neural network: Feedforward neural networks, or multi-layer perceptrons (MLPs), are what we’ve primarily been focusing on within this article. They are comprised of an input layer, a hidden layer or layers, and an output layer. While these neural networks are also commonly referred to as MLPs, it’s important to note that they are actually comprised of sigmoid neurons, not perceptrons, as most real-world problems are nonlinear. Data usually is fed into these models to train them, and they are the foundation for computer vision, natural language processing, and other neural networks. Convolutional neural networks (CNNs) are similar to feedforward networks, but they’re usually utilized for image recognition, pattern recognition, and/or computer vision. These networks harness principles from linear algebra, particularly matrix multiplication, to identify patterns within an image. Recurrent neural networks (RNNs) are identified by their feedback loops. These learning algorithms are primarily leveraged when using time-series data to make predictions about future outcomes, such as stock market predictions or sales forecasting. Neural networks vs. deep learning Deep Learning and neural networks tend to be used interchangeably in conversation, which can be confusing. As a result, it’s worth noting that the “deep” in deep learning is just referring to the depth of layers in a neural network. A neural network that consists of more than three layers—which would be inclusive of the inputs and the output—can be considered a deep learning algorithm. A neural network that only has two or three layers is just a basic neural network. To learn more about the differences between neural networks and other forms of artificial intelligence, like machine learning, please read the blog post “AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the Difference?” History of neural networks The history of neural networks is longer than most people think. While the idea of “a machine that thinks” can be traced to the Ancient Greeks, we’ll focus on the key events that led to the evolution of thinking around neural networks, which has ebbed and flowed in popularity over the years: 1943: Warren S. McCulloch and Walter Pitts published “A logical calculus of the ideas immanent in nervous activity (PDF, 1 MB) (link resides outside IBM)” This research sought to understand how the human brain could produce complex patterns through connected brain cells, or neurons. One of the main ideas that came out of this work was the comparison of neurons with a binary threshold to Boolean logic (i.e., 0/1 or true/false statements). 1958: Frank Rosenblatt is credited with the development of the perceptron, documented in his research, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain” (PDF, 1.6 MB) (link resides outside IBM). He takes McCulloch and Pitt’s work a step further by introducing weights to the equation. Leveraging an IBM 704, Rosenblatt was able to get a computer to learn how to distinguish cards marked on the left vs. cards marked on the right. 1974: While numerous researchers contributed to the idea of backpropagation, Paul Werbos was the first person in the US to note its application within neural networks within his PhD thesis (PDF, 8.1 MB) (link resides outside IBM). 1989: Yann LeCun published a paper (PDF, 5.7 MB) (link resides outside IBM) illustrating how the use of constraints in backpropagation and its integration into the neural network architecture can be used to train algorithms. This research successfully leveraged a neural network to recognize hand-written zip code digits provided by the U.S. Postal Service. Introduction to DEEP LEARNING G.Karthikeyan ME.,(Ph.D) 1 CONTENTS I. Introduction II. History III. Principle IV. Technology V. Working VI. Formulations VII. Advantage VIII. Disadvantage IX. Real Time Applications 2 INTRODUCTION What is Deep Learning? Deep learning is a branch of machine learning that uses data, loads and loads of data, to teach computers how to do things only humans were capable of before. For example, how do machines solve the problems of perception? 3 HISTORY 1958: Frank Rosenblatt creates the perceptron, an algorithm for pattern recognition. 1989: Scientists were able to create algorithms that used deep neural networks. 2000's: The term “deep learning” begins to gain popularity after a paper by Geoffrey Hinton. 2012: Artificial pattern-recognition algorithms achieve human- level performance on certain tasks. 4 PRINCIPLE Deep learning is based on the concept of artificial neural networks, or computational systems that mimic the way the human brain functions. 5 TECHNOLOGY Deep learning is a fast-growing field, and new architectures, variants appear every few weeks. We'll see discuss the major three: 1. Convolution Neural Network (CNN) 6 TECHNOLOGY 1. Convolution Neural Network (CNN) Convolutional Neural networks are designed to process data through multiple layers of arrays. This type of neural networks is used in applications like image recognition or face recognition. The primary difference between CNN and any other ordinary neural network is that CNN takes input as a two-dimensional array and operates directly on the images rather than focusing on feature extraction which other neural networks focus on. 7 2. Recurrent Neural Network (RNN) RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. RNNs have a “memory” which captures information about what has been calculated so far. 8 TECHNOLOGY 3. Long-Short Term Memory LSTM can learn "Very Deep Learning" tasks that require memories of events that happened thousands or even millions of discrete time steps ago. LSTM works even when there are long delays, and it can handle signals that have a mix of low and high frequency components. 9 WORKING Consider the following handwritten sequence: Most people effortlessly recognize those digits as 504192. The difficulty of visual pattern recognition becomes apparent if you attempt to write a computer program to recognize digits like those above. 10 WORKING 11 WORKING The idea of neural network is to develop a system which can learn from these large training examples. Each neuron assigns a weighting to its input — how correct or incorrect it is A training relative to the task being Sample performed. The final output is A very basic approach: then determined by the total of Binary Classifier those weightings 12 FORMULATIONS The basis of deep learning is classification which can be further used for detection, ranking, regression, etc. 13 ADVANTAGES Deep learning can give good results with unstructured or unlabelled data. Deep learning facilitates the process of automatically detecting features and does not require feature extraction in advance. It is a robust system and one neural network-based approach can be adapted and used for multiple data types and applications. Deep learning has boosted the possibility of integration with other existing technologies, including big data, brain-computer interface, the Internet of Things (IoT) and drones. 14 CHALLENGES 1. Deep learning works only with large amounts of data. 2. Training it with large and complex data models can be expensive. 3. It also needs extensive hardware to do complex mathematical calculations. 4. There is no single or standard theory for selecting deep learning tools. 5. Deep learning algorithms are sometimes unable to provide conclusions in cross-disciplinary problems. 15 DL Applications 1. customer experience 2. automatic speech recognition 3. autonomous vehicles 4. image colourisation 5. computer vision 6. video colourisation 7. customer service 8. deep-learning robots 9. drug discovery and toxicology 10. farming 11. financial services 12. healthcare 13. image caption generation 14. image recognition 15. language recognition 16. law enforcement 17. mobile advertising 18. recommendation systems 19. text generation 20. translation engines 16 FUTURE SCOPE 1. For Astronauts, Next Steps on Journey to Space Will Be Virtual 2. Droughts and Deep Learning: Measuring Water Where It’s shortage 17 Back Propagation BackPropagation Backpropagation is the essence of neural network training. It is the method of fine-tuning the weights of a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable by increasing its generalization. Backpropagation in neural network is a short form for “backward propagation of errors.” It is a standard method of training artificial neural networks. This method helps calculate the gradient of a loss function with respect to all the weights in the network. How Backpropagation Algorithm Works The Back propagation algorithm in neural network computes the gradient of the loss function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct computation.

Feed Forward and Back-Propagation Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript