Artificial Intelligence - Neural Networks PDF

Artificial Intelligence 02-2. Neural Networks 11. Sep. 2024 Sang-Hoon Lee Ajou University Logic Gates with Perceptron ▪ Logic Gates AND ✓ The output of an AND g...

Artificial Intelligence 02-2. Neural Networks 11. Sep. 2024 Sang-Hoon Lee Ajou University Logic Gates with Perceptron ▪ Logic Gates AND ✓ The output of an AND gate is 1 only if both inputs are 1. ✓ Otherwise, the output of an AND gate is 0. OR ✓ The OR gate is 0 only if both inputs are 0. x1 w1 y x2 w2 b perceptron perceptron 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏  0, 𝑦 = 0 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 > 0, 𝑦 = 1 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 = 0 1 Practice ▪ Jupyter Notebook https://jupyter.ajou.ac.kr ▪ Tutorial http://ajoupyterhub.ajousw.kr/user-guide-ajoupyterhub 2 Multi-layer Perceptron ▪ A multilayer perceptron (MLP) is a name for a modern feedforward artificial neural network, consisting of fully connected neurons with a nonlinear activation function, organized in at least three layers, notable for being able to distinguish data that is not linearly separable Single-layer Perceptron ✓ Only distinguish linearly separable data 3 Binary Classification with MLP ▪ Sigmoid Activation To create a probability, we’ll pass 𝒛 = 𝒘 ∙ 𝒙 + 𝒃 through the sigmoid function, 𝝈 𝒛 ✓ It takes a real-valued number and maps it into the range (0,1), which is just what we want for a probability. ✓ Because it is nearly linear around 0 but flattens toward the ends, it tends to squash outlier values toward 0 or 1. Sigmoid function is also called the logistic function 4 Binary Classification with MLP ▪ Classification with Logistic regression if w∙x+b > 0 if w∙x+b ≤ 0 5 Optimizing the Neural Networks ▪ Loss Function for Binary Classification Cross-entropy loss ▪ Optimization algorithm Stochastic gradient descent 6 Optimizing the Neural Networks ▪ Stochastic gradient descent is an online algorithm that minimizes the loss function by computing its gradient after each training example, and nudging θ in the right direction (the opposite direction of the gradient). 7 Index ▪ Neural Networks Multi-class classification, Regression Backpropagation Learning Rate Deep Learning Dropout 8 Multi-class Classification ▪ Often we need more than 2 classes Positive/negative/neutral Parts of speech (noun, verb, adjective, adverb, preposition, etc.) Classify emergency SMSs into different actionable classes ▪ If >2 classes we use multinomial logistic regression = Softmax regression = Multinomial logit ▪ The probability of everything must still sum to 1 P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1 ▪ Need a generalization of the sigmoid called the softmax Takes a vector z = [z1, z2,..., zk] of k arbitrary values Outputs a probability distribution ✓ each value in the range [0,1] ✓ all the values summing to 1 9 Multi-class Classification ▪ The Softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers 𝑒 1.3 Results of activation function 𝑒 1.3 + 𝑒 5.1 + 𝑒 2.2 + 𝑒 0.7 + 𝑒 1.1 𝑒 5.1 𝑒 1.3 + 𝑒 5.1 + 𝑒 2.2 + 𝑒 0.7 + 𝑒 1.1 𝑒 5.1 =164.021… 𝑒 1.3 =3.669… 𝑒 0.7 =2.013… Multi-class Classification Class candidates ={cat, dog, horse} Input images Feedforward output, y softmax output, S(y) 𝑛 (results of activation function) 𝑓(෍ 𝑤𝑖 𝑥𝑖 + b) I1 I2 I3 I1 I2 I3 I1 𝑖=1 cat 5 4 1 cat 0.7 0.0 0.0 Forward Softmax 1 2 2 propagation function dog 4 2 4 dog 0.2 0.0 0.4 I2 6 0 9 horse 2 8 4 horse 0.0 0.9 0.4 4 8 9 I3 Multi-class Classification ▪ Due to random initialization of weights and biases, the neural network probably has errors in giving the correct output Class candidates={cat, dog, horse} Input images softmax output, S(y), 𝑦ො Target value, y Mean squared error 1 = σ𝑛𝑖=1(𝑦𝑖 − 𝑦ො𝑖 )2 𝑛 I1 (cat) I I I I I I I I I 1 2 3 1 2 3 1 2 3 cat 0.7 0.0 0.0 1 0 0 0.487 0.000 0.250 1 2 2 dog 0.2 0.0 0.4 0 0 0 I2 (horse) Minimizing the loss 6 0 9 horse 0 1 1 0.0 0.9 0.4 I3 4 8 9 (horse) Regression ▪ Estimating the continuous values of target data Input x: features Target y: stock price Loss: Minimizing the distance between the target y and the predicted 𝒚 ෝ ✓ L1 distance = |y- 𝒚 ෝ| ✓ MSE: Mean Squared Error Loss y(target) 13 MLP Learning: Error backpropagation ▪ Backward error propagation is the neural network training process of feeding error rates back through a neural network to make it more accurate MSE (mean squared error): 1 𝑛 𝐸𝑡𝑜𝑡𝑎𝑙 = ෍ (𝑦𝑖 − 𝑦ො𝑖 )2 𝑛 𝑖 𝑦ො𝑖 𝑦𝑖 𝑛: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡 * Weight update learning rate Gradient Descent: Update the weight by Back propagation the opposite direction of error toward the input layer The scale of weight update depends on the learning rate MLP Learning: Error backpropagation (2) ▪ Propagate the errors from the outputs back to the inputs 𝑦𝑖𝑡 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 𝑉𝑖ℎ ℎ=1 𝑧ℎ𝑡 𝐷 𝑊ℎ𝑗 𝑧ℎ𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑥𝑗𝑡 MLP Learning: Error backpropagation (3) ▪ Propagate the errors from the outputs back to the inputs 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 ℎ=1 𝐷 𝑧ℎ𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑦𝑖𝑡 1 2 𝐸 𝑊, 𝑉 𝑋 = ෍ ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 2 𝑡 𝑖 𝑉𝑖ℎ 𝑧ℎ𝑡 𝜕𝐸 ∆𝑉𝑖ℎ = −𝜂 = 𝜂 ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 𝑧ℎ𝑡 𝑊ℎ𝑗 𝜕𝑉𝑖ℎ 𝑡 𝑥𝑗𝑡 MLP Learning: Error backpropagation (4) ▪ Propagate the errors from the outputs back to the inputs 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 1 ℎ=1 𝐷 2 𝐸 𝑊, 𝑉 𝑋 = ෍ ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 2 𝑧ℎ𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑡 𝑖 𝑗=1 𝑦𝑖𝑡 𝜕𝐸 𝜕𝐸 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝑉𝑖ℎ ∆𝑊ℎ𝑗 = −𝜂 = −𝜂 ෍ ෍ 𝜕𝑊ℎ𝑗 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 𝑡 𝑖 𝑧ℎ𝑡 = −𝜂 ෍ ෍ −( 𝑟𝑖𝑡 − 𝑦𝑖𝑡 ) ∙ 𝑉𝑖ℎ ∙ 𝑧ℎ𝑡 (1 − 𝑧ℎ𝑡 )𝑥𝑗𝑡 𝑊ℎ𝑗 𝑡 𝑖 𝑥𝑗𝑡 𝜕𝐸 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 = 𝜂 ෍ ෍( 𝑟𝑖𝑡 − 𝑦𝑖𝑡 ) ∙ 𝑉𝑖ℎ ∙ 𝑧ℎ𝑡 (1 − 𝑧ℎ𝑡 )𝑥𝑗𝑡 𝑡 𝑖 MLP Learning: Error backpropagation (5) ▪ Propagate the errors from the outputs back to the inputs 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 ℎ=1 𝐷 𝑧ℎ𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑦𝑖𝑡 ∆𝑊ℎ𝑗 = 𝜂 ෍ ෍( 𝑟𝑖𝑡 − 𝑦𝑖𝑡 ) ∙ 𝑉𝑖ℎ ∙ 𝑧ℎ𝑡 (1 − 𝑧ℎ𝑡 )𝑥𝑗𝑡 𝑉𝑖ℎ 𝑡 𝑖 𝑧ℎ𝑡 This term acts like the error term for hidden unit ℎ 𝑊ℎ𝑗 → (𝑟𝑖𝑡 − 𝑦𝑖𝑡 ): error in the output unit 𝑖 → 𝑉𝑖ℎ : “responsibility” of the hidden unit ℎ to the error 𝑥𝑗𝑡 MLP Learning: Error backpropagation (6) ▪ Propagate the errors from the outputs back to the inputs 𝐷 𝑧ℎ𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑦𝑖𝑡 ∆𝑊ℎ𝑗 = 𝜂 ෍ ෍( 𝑟𝑖𝑡 − 𝑦𝑖𝑡 ) ∙ 𝑉𝑖ℎ ∙ 𝑧ℎ𝑡 (1 − 𝑧ℎ𝑡 )𝑥𝑗𝑡 𝑉𝑖ℎ 𝑡 𝑖 𝑧ℎ𝑡 𝜕𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 𝑊ℎ𝑗 Derivative of the sigmoidal activation function in the hidden layer 𝑥𝑗𝑡 MLP Learning: Error backpropagation (7) ▪ Derivative of the sigmoidal activation function in the hidden layer 𝐷 𝑧ℎ𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑑 𝑑 𝑑 1 𝑧= 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑊 𝑇 𝑋 = 𝑑𝑊 𝑑𝑊 𝑑𝑊 1 + exp(−𝑊 𝑇 𝑋) 2 1 𝑑 =− ∙ 1 + exp(−𝑊 𝑇 𝑋) 1 + exp −𝑊 𝑇 𝑋 𝑑𝑊 2 1 =− ∙ − exp −𝑊 𝑇 𝑋 𝑋 1 + exp −𝑊 𝑇 𝑋 1 exp −𝑊 𝑇 𝑋 = ∙ 𝑋 1 + exp −𝑊 𝑇 𝑋 1 + exp −𝑊 𝑇 𝑋 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑊 𝑇 𝑋) ∙ (1 − 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑊 𝑇 𝑋 )𝑋 MLP Learning: Error backpropagation (8) ▪ Update (𝑛𝑒𝑤) 𝑉𝑖ℎ = 𝑉𝑖ℎ + ∆𝑉𝑖ℎ ∆𝑉𝑖ℎ = 𝜂 ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 𝑧ℎ𝑡 𝑡 (𝑛𝑒𝑤) 𝑊ℎ𝑗 = 𝑊ℎ𝑗 + ∆𝑊ℎ𝑗 ∆𝑊ℎ𝑗 = 𝜂 ෍ ෍( 𝑟𝑖𝑡 − 𝑦𝑖𝑡 ) ∙ 𝑉𝑖ℎ ∙ 𝑧ℎ𝑡 (1 − 𝑧ℎ𝑡 )𝑥𝑗𝑡 𝑡 𝑖 MLP Learning: Error backpropagation (tanh) ▪ Propagate the errors from the outputs back to the inputs 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 ℎ=1 𝑥 𝐷 𝑦𝑖𝑡 𝑧ℎ𝑡 = 𝑡𝑎𝑛ℎ ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑉𝑖ℎ 𝑧ℎ𝑡 1 2 𝐸 𝑊, 𝑉 𝑋 = ෍ ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 2 𝑊ℎ𝑗 𝑡 𝑖 𝜕𝐸 𝑥𝑗𝑡 ∆𝑉𝑖ℎ = −𝜂 =? 𝜕𝑉𝑖ℎ 𝜕𝐸 ∆𝑊ℎ𝑗 = −𝜂 =? 𝜕𝑊ℎ𝑗 MLP Learning: Error backpropagation (tanh) ▪ Propagate the errors from the outputs back to the inputs 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 ℎ=1 𝑥 𝐷 𝑦𝑖𝑡 𝑧ℎ𝑡 = 𝑡𝑎𝑛ℎ ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑉𝑖ℎ 𝑧ℎ𝑡 1 2 𝐸 𝑊, 𝑉 𝑋 = ෍ ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 2 𝑊ℎ𝑗 𝑡 𝑖 𝜕𝐸 𝑥𝑗𝑡 ∆𝑉𝑖ℎ = −𝜂 = 𝜂 ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 𝑧ℎ𝑡 𝜕𝑉𝑖ℎ 𝑡 𝜕𝐸 𝜕𝐸 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 ∆𝑊ℎ𝑗 = −𝜂 = −𝜂 ෍ ෍ 𝜕𝑊ℎ𝑗 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 𝑡 𝑖 Tanh http://taewan.kim/post/tanh_diff/ MLP Learning: Error backpropagation (tanh) ▪ Propagate the errors from the outputs back to the inputs 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 ℎ=1 𝑥 𝐷 𝑦𝑖𝑡 𝑧ℎ𝑡 = 𝑡𝑎𝑛ℎ ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑉𝑖ℎ 𝑧ℎ𝑡 1 2 𝐸 𝑊, 𝑉 𝑋 = ෍ ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 2 𝑊ℎ𝑗 𝑡 𝑖 𝜕𝐸 𝑥𝑗𝑡 ∆𝑉𝑖ℎ = −𝜂 = 𝜂 ෍ 𝑟𝑖𝑡 − 𝑦𝑖𝑡 𝑧ℎ𝑡 𝜕𝑉𝑖ℎ 𝑡 𝜕𝐸 𝜕𝐸 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 ∆𝑊ℎ𝑗 = −𝜂 = −𝜂 ෍ ෍ 𝜕𝑊ℎ𝑗 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 𝑡 𝑖 = 𝜂 ෍ ෍( 𝑟𝑖𝑡 − 𝑦𝑖𝑡 ) ∙ 𝑉𝑖ℎ ∙ (1 − 𝑧ℎ𝑡 )(1 + 𝑧ℎ𝑡 )𝑥𝑗𝑡 𝑡 𝑖 MLP Learning: Error backpropagation (ReLu) ▪ Propagate the errors from the outputs back to the inputs 𝑦 𝐻 𝑦𝑖𝑡 = ෍ 𝑉𝑖ℎ 𝑧ℎ𝑡 + 𝑉𝑖0 ℎ=1 𝐷 𝑥 𝑦𝑖𝑡 𝑧ℎ𝑡 = 𝑅𝑒𝐿𝑈 ෍ 𝑊ℎ𝑗 𝑥𝑗𝑡 + 𝑊ℎ0 𝑗=1 𝑉𝑖ℎ 𝜕𝐸 𝜕𝐸 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 ∆𝑊ℎ𝑗 = −𝜂 = −𝜂 ෍ ෍ 𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 𝑡 𝑖 𝜕𝑦𝑖𝑡 𝜕𝑧ℎ𝑡 𝜕𝑊ℎ𝑗 𝑊ℎ𝑗 𝑥𝑗𝑡 Learning Rate ▪ It represents the speed at which a machine learning model "learns". * Weight update learning rate Learning Rate https://medium.com/@ompramod9921/mastering-gradient-descent-optimizing- neural-networks-with-precision-e461e996633e 28 Learning Rate ▪ Initial Phase At the start of the training, we can afford to make larger updates to our parameters, as the initial parameters are usually far from the optimal ones. So, we start with a larger learning rate. ▪ Middle Phase As training progresses, our parameters get closer to the optimal ones. Now, large updates can lead to overshooting the minimum. So, we gradually reduce the learning rate. This is where learning rate decay comes into play. ▪ Final Phase Towards the end of the training, we want to converge to the minimum, so we continue to reduce the learning rate to make smaller and smaller updates. 29 Learning Rate Decay ▪ Step Decay Reduce the learning rate by some factor every few epochs. For example, we might halve the learning rate every 5 epochs. 30 Learning Rate Decay ▪ Exponential Decay The learning rate is decayed exponentially over time. It follows an exponential decay function. 31 Saddle Point ▪ A specific point in the optimization landscape of a cost function where the gradient is zero, but the point is neither a minimum nor a maximum. 32 Saddle Point ▪ It’s a point where the surface of the cost function resembles a saddle, with some dimensions curving upward and others downward. The red and green curves intersect at a generic saddle point in two dimensions. Along the green curve the saddle point looks like a local minimum, while it looks like a local maximum along the red curve. 33 Saddle Point ▪ When Gradient Descent encounters a saddle point, the gradients become very small. This causes the updates to the parameters to also become very small. As a result, the algorithm makes little progress and appears to be stuck, or to be proceeding extremely slowly. Gradient descent proceeds extremely slowly near a saddle point 34 Saddle Point ▪ we find that (x = 0) is the only point where the derivative is zero. (x = 0) is not a local maximum or a local minimum. Therefore, (x = 0) is a saddle point for the function (f(x) = x³) It can cause problems because the gradients are very small. 35 Saddle Point ▪ When the input is very large or very small, the function saturates at these extremes, causing the gradient to be nearly zero. This leads to a plateau effect during backpropagation, where the weights and biases don’t get updated effectively. 36 Deep Learning ▪ Gradient Vanishing Problem is encountered when training neural networks with gradient-based learning methods and backpropagation ✓ The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process Due to backpropagation computes gradients by the chain rule, using Sigmoid and Tanh cause gradient vanishing problem https://medium.com/@amanatulla1606/vanishing-gradient-problem-in-deep-learning- understanding-intuition-and-solutions-da90ef4ecb54 Deep Learning ▪ AlexNet Convolution Layer (5)+ Fully-Connected Layer (2) + Classification CNN ▪ Convolutional Neural Network CNN ▪ CNN slides filter from the top left to the bottom right Filter is a feature identifier, sometimes called kernel Stride is how far the filter moves in every step along one direction The area where filter stays is called a receptive field to detect “connector” stride=1 … … … 65025 65025 65025 0 13005 0 65025 0 Input image 0 convolution 255 255 255 0 0 130050 65025 0 0 0 255 0 0 255 0 0 65025 65025 65025 0 255 0 * = 0*255+255*255+255*0+0*0=65025 255 255 255 0 0 Feature map 255 0 0 0 0 255 255 255 0 0 https://github.com/minsuk- https://www.youtube.com/watch?v=RLlI9q6Uojk heo/deeplearning/blob/master/src/CNN_Tensorflow.ipynb Deep Learning ▪ Gradient Vanishing Problem is encountered when training neural networks with gradient-based learning methods and backpropagation ✓ The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process Due to backpropagation computes gradients by the chain rule, using Sigmoid and Tanh cause gradient vanishing problem https://medium.com/@amanatulla1606/vanishing-gradient-problem-in-deep-learning- understanding-intuition-and-solutions-da90ef4ecb54 Deep Learning ▪ Gradient Vanishing Problem Sigmoid and Tanh causes the gradient to be nearly zero To solve this issue, AlexNet proposed ReLU activation function 𝑦 𝑦 𝑥 𝑦 𝑥 𝑥 Deep Learning ▪ ReLU ReLU helps mitigate the problem of small gradients and the resulting slow learning on the plateaus of the loss landscape. ✓ ReLU is a popular activation function defined as f(x) = max(0,x) Derivative of the ReLU function is 1 when x>0 ✓ Limitation: Dying ReLU problem when it starts outputting 0 for all inputs 43 Deep Learning ▪ Leaky ReLU (Leaky Rectified Linear Unit) A type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope 𝑦 is determined before training, i.e. it is not learnt The slope coefficient during training This type of activation function is popular in tasks where we may suffer from sparse gradients 𝑦 𝑥 44 Deep Learning ▪ Overfitting Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. ▪ Dropout The key idea is to randomly drop units (along with their connections) from the neural network during training This significantly reduces overfitting and gives major improvements over other regularization methods. Epoch, batch, input size, and GPU memory ▪ An epoch means training the neural Training dataset = 1000 records network with all the training data for one cycle We can use multiple epochs in training ▪ An epoch is made up of one or more batches ▪ We call passing through the training examples in a batch an iteration ▪ The backpropagation process is done per batch The error is calculated per input data Finally, the mean squared error or cross-entropy is used for backpropagation process ▪ For the gradient decent algorithm, all activation results of neurons should be available during the backpropagation process ➔ GPU memory https://www.baeldung.com/cs/epoch-neural-networks ANN learning process ▪ Input: samples with batch size ▪ Output: values with batch size ▪ Error: sum or average of error ▪ Backpropagation 1 𝑛 𝐸𝑡𝑜𝑡𝑎𝑙 = ෍ (𝑦𝑖 − 𝑦ො𝑖 )2 𝑛 𝑖 Averaging the errors of batch samples input Artificial neural network output true value (training data) backpropagation Next Class (23. Sept.) ▪ Convolutional Neural Network 48

Artificial Intelligence - Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript