Perceptron and Neural Networks Explained | PDF
Document Details

Uploaded by UnmatchedRocket5313
Debajyoti Biswas
Tags
Summary
This document explains perceptrons and neural networks. It covers topics such as gradient descent, backpropagation, convolutional neural networks (CNNs), and related concepts in deep learning and machine learning. It also explains how backpropagation algorithms are used.
Full Transcript
Perceptron Debajyoti Biswas Perceptron The perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input x(a real-valued vector) to an output value f(x)(a single binary value): Here is Heaviside step function (where an input of>0 outputs 1...
Perceptron Debajyoti Biswas Perceptron The perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input x(a real-valued vector) to an output value f(x)(a single binary value): Here is Heaviside step function (where an input of>0 outputs 1; otherwise 0 is the output ), is a vector of real-valued weights, is the dot product where is the number of inputs to the perceptron, and is the bias. The appropriate weights are applied to the inputs, and the resulting weighted sum passed to a function that produces the output o. In the context of neural networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a multilayer perceptron, which is a misnomer for a more complicated neural network. As a linear classifier, the single- layer perceptron is the simplest In a feedforward network, feedforward neural network. information always moves in one direction; it never goes A diagram showing a perceptron updating its linear boundary as more training examples are added Gradient Descent People are stuck in the mountains and are trying to get down (i.e., trying to find the global minimum). There is heavy fog such that visibility is extremely low. Therefore, the path down the mountain is not visible They can use the method of gradient descent, which involves looking at the steepness of the hill at their current position, then proceeding in the direction with the steepest descent (i.e., downhill). Using this method, they would eventually find their way down the mountain or possibly get stuck in some hole (i.e., local minimum or saddle point), like a mountain lake. However, assume also that the steepness of the hill is not immediately obvious with simple observation, but rather it requires a sophisticated instrument to measure, which the persons happen to have at the moment. It takes quite some time to measure the steepness of the hill with the instrument, thus they should minimize their use of the instrument if they wanted to get down the mountain before sunset. The difficulty then is choosing the frequency at which they should measure the steepness of the hill so not to go off track. In this analogy, the persons represent the algorithm, and the path taken down the mountain represents the sequence of parameter settings that the algorithm will explore. The steepness of the hill represents the slope of the function at that point. The instrument used to measure steepness is differentiation. The direction they choose to travel in aligns with the gradient of the function at that point. The amount of time they travel before taking another measurement is the step size. Gradient Descent Step 1. Training Machine Learning Models Neural networks are trained using Gradient Descent (or its variants) in combination with backpropagation. Backpropagation computes the gradients of the loss function with respect to each parameter (weights and biases) in the network by applying the chain rule. The process involves: Forward Propagation: Computes the output for a given input by passing data through the layers. Backward Propagation: Uses the chain rule to calculate gradients of the loss with respect to each parameter (weights and biases) across all layers. Gradients are then used by Gradient Descent to update the parameters layer-by-layer, moving toward minimizing the loss function. Step 2. Minimizing the Cost Function The algorithm minimizes a cost function, which quantifies the error or loss of the model’s predictions compared to the true labels Gradient Descent Algorithm in Machine Learning | GeeksforGeeks Gradient Descent for Linear Regression Gradient descent minimizes the Mean Squared Error (MSE) which serves as the loss function to find the best-fit line. Gradient Descent is used to iteratively update the weights (coefficients) and bias by computing the gradient of the MSE with respect to these parameters. The algorithm computes the gradient of the MSE with respect to the weights and biases. It updates the weights (w) and bias (b) using the formula: Calculating the gradient of the loss with respect to the weights. Updating weights and biases iteratively to maximize the likelihood of the correct classification Gradient of J(w,b) with respect to b Here we have considered the linear regression. So that here the parameters are weight and bias only. But in a fully connected neural network model there can be multiple layers and multiple parameters but the concept will be the same everywhere. And the below- mentioned formula will work everywhere. 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 =𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 −𝛾 Gradient Descent Learning Rate The learning rate is a critical hyperparameter in the context of gradient descent, influencing the size of steps taken during the optimization process to update the model parameters. When the learning rate is too small, the optimization process progresses very slowly. The model makes tiny updates to its parameters in each iteration, leading to sluggish convergence and potentially getting stuck in local minima. an excessively large learning rate can cause the optimization algorithm to overshoot the optimal parameter values, leading to divergence or oscillations that hinder convergence. Achieving the right balance is essential. A small learning rate might result in vanishing gradients and slow convergence, while a large learning rate may lead to overshooting and instability. Backpropagation Backpropagation is a technique used in deep learning to train artificial neural networks particularly feed- forward networks. It works iteratively to adjust weights and bias to minimize the cost function. Backpropagation often uses optimization algorithms like gradient descent or stochastic gradient descent. The algorithm computes the gradient using the chain rule from calculus allowing it to effectively navigate complex layers in the neural network to minimize the cost function. Working of Backpropagation Algorithm The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward Pass. Forward pass: In forward pass the input data is fed into the input layer. These inputs combined with their respective weights are passed to hidden layers. For example in a network with two hidden layers (h1 and h2 as shown in Fig.) the output from h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted inputs. Each hidden layer applies an activation function like ReLU (Rectified Linear Unit) which returns the input if it’s positive and zero otherwise. This adds non- linearity allowing the model to learn complex relationships in the data. Finally the outputs from the last hidden layer are passed to the output layer where an activation function such as softmax converts the weighted outputs into probabilities for classification. Backward Pass In the backward pass the error (the difference between the predicted and actual output) is propagated back through the network to adjust the weights and biases. One common method for error calculation is the Mean Squared Error (MSE) Once the error is calculated the network adjusts weights using gradients which are computed with the chain rule. These gradients indicate how much each weight and bias should be adjusted to minimize the error in the next iteration. The backward pass continues layer by layer ensuring that the network learns and improves its performance. Advantages of Backpropagation Efficient Weight Update: It computes the gradient of the loss function with respect to each weight using the chain rule making it possible to update weights efficiently. Scalability: The backpropagation algorithm scales well to networks with multiple layers and complex architectures making deep learning feasible. Automated Learning: With backpropagation the learning process becomes automated and the model can adjust itself to optimize its performance. Gradient descent, a fundamental optimization algorithm, can sometimes encounter two common issues: vanishing gradients and exploding gradients. Vanishing Gradients The vanishing gradient problem is a challenge that emerges during backpropagation when the derivatives or slopes of the activation functions become progressively smaller as we move backward through the layers of a neural network. This phenomenon is particularly prominent in deep networks with many layers, hindering the effective training of the model. The weight updates becomes extremely tiny, or even exponentially small, it can significantly prolong the training time, and in the worst- case scenario, it can halt the training process altogether. Vanishing and Exploding Gradients Problems in Deep Learning | Vanishing Gradients During backpropagation, the gradients propagate back through the layers of the network, they decrease significantly. This means that as they leave the output layer and return to the input layer, the gradients become progressively smaller. As a result, the weights associated with the initial levels, which accommodate these small gradients, are updated little or not at each iteration of the optimization process. Exploding Gradients The exploding gradient problem is a challenge encountered during training deep neural networks. It occurs when the gradients of the network's loss function with respect to the weights (parameters) become excessively large. The issue of exploding gradients arises when, during backpropagation, the derivatives or slopes of the neural network's layers grow progressively larger as we move backward. This is essentially the opposite of the vanishing gradient problem. Exploding Gradients The root cause of this problem lies in the weights of the network, rather than the choice of activation function. High weight values lead to correspondingly high derivatives, causing significant deviations in new weight values from the previous ones. As a result, the gradient fails to converge and can lead to the network oscillating around local minima, making it challenging to reach the global minimum point. Solution For These Issues Weights Regularizations: The initialization of weights can be adjusted to ensure that they are in an appropriate range. Using a different activation function, such as the Rectified Linear Unit (ReLU), can also help to mitigate the vanishing gradient problem. Gradient clipping: It involves limiting the maximum and minimum values of the gradient during backpropagation. This can prevent the gradients from becoming too large or too small and can help to stabilize the training process. Batch normalization: It can also help to address these problems by normalizing the input to each layer, which can prevent the activation function from saturating and help to reduce the vanishing and exploding gradient problems. Convolutional Neural Network (CNN) Convolutional Neural Networks (CNNs) are a specialized class of neural networks designed to process grid-like data, such as images. They are particularly well-suited for image recognition and processing tasks. They are inspired by the visual processing mechanisms in the human brain. CNNs excel at capturing hierarchical patterns and spatial dependencies within images. Convolutional Neural Network (CNN) Working Principle of CNN Input Image: The CNN receives an input image, which is typically preprocessed to ensure uniformity in size and format. Convolutional Layers: Filters are applied to the input image to extract features like edges, textures, and shapes. Pooling Layers: The feature maps generated by the convolutional layers are downsampled to reduce dimensionality. Fully Connected Layers: The downsampled feature maps are passed through fully connected layers to produce the final output, such as a classification label. Output: The CNN outputs a prediction, such as the class of the image. Components of a Convolutional Neural Network Convolutional Layers: These layers apply convolutional operations to input images, using filters (also known as kernels) to detect features such as edges, textures, and more complex patterns. Convolutional operations help preserve the spatial relationships between pixels. Pooling Layers: They downsample the spatial dimensions of the input, reducing the computational complexity and the number of parameters in the network. Max pooling is a common pooling operation, selecting the maximum value from a group of neighboring pixels. Components of a Convolutional Neural Network Activation Functions: They introduce non-linearity to the model, allowing it to learn more complex relationships in the data. Fully Connected Layers: These layers are responsible for making predictions based on the high-level features learned by the previous layers. They connect every neuron in one layer to every neuron in the next layer. Convolutional Neural Network Training The training process for a CNN involves the following steps: Data Preparation: The training images are preprocessed to ensure that they are all in the same format and size. Loss Function: A loss function is used to measure how well the CNN is performing on the training data. The loss function is typically calculated by taking the difference between the predicted labels and the actual labels of the training images. Optimizer: An optimizer is used to update the weights of the CNN in order to minimize the loss function. Backpropagation: Backpropagation is a technique used to calculate the gradients of the loss function with respect to the weights of the CNN. The gradients are then used to update the weights of the CNN using the optimiz Advantages of CNN High Accuracy: CNNs achieve state-of-the-art accuracy in various image recognition tasks. Efficiency: CNNs are efficient, especially when implemented on GPUs. Robustness: CNNs are robust to noise and variations in input data. Adaptability: CNNs can be adapted to different tasks by modifying their architecture. Disadvantages of CNN Complexity: CNNs can be complex and difficult to train, especially for large datasets. Resource-Intensive: CNNs require significant computational resources for training and deployment. Data Requirements: CNNs need large amounts of labeled data for training. Interpretability: CNNs can be difficult to interpret, making it challenging to understand their predictions. Introduction to Convolution Neural Network | GeeksforGeeks Backpropagation in Data Mining | GeeksforGeeks