Artificial Intelligence Lecture Notes PDF

# Artificial Intelligence ## Introduction - The image shows several robotic heads with gears and wires. - It also shows a venn diagram showing the relationship between deep learning, machine learning and artificial intelligence. - Deep Learning is a subfield of Machine Learning which is a subfield of Artificial Intelligence. ## Model of an artificial neuron - An image depicting the model of an artificial neuron. - It shows a network of inputs ($x_1$, $x_2$... $x_n$), weights ($w_1$, $w_2$... $w_n$), a summation function with "net" as output, an activation function $f$ and output $y$. ## Multi-layer net - An image depicting the model of a multi-layer neural network. - It shows a network of input nodes ($x_1$, $x_2$, ...$x_n$) connected to several hidden layers with multiple nodes in each layer. All nodes are connected by edges. Output nodes are connected to hidden layer nodes. ## Supervised Learning - **The Supervised Learning Process:** - The image shows a flow chart representing the supervised learning process. - Training data is labeled and passed through a Machine Learning Algorithm to create a predictive model, the model is then evaluated. - Feedback Loop is used to improve the prediction model. - **Example:** - "Let's say, you want to predict for a group of people their chances of becoming Coronavirus infected." - "You will need training dataset from previous cases." - "You will discover new direct relationships via a feedback loop." - "Your model will evolve and become more and more accurate." ## Unsupervised Learning - **The Unsupervised Learning Process:** - The image shows the flow chart of the unsupervised learning process. - Unlabeled input data is passed through Machine Learning algorithm to generate output. - **Example:** - "There are many animals, snakes, birds and insects that you have never ever seen in your life." - "But when you see a new bird, that no one has labeled as bird for you to understand, you can still make out that it is a bird because it has feather, it has beak, it can fly etc." - "This is unsupervised learning. Computationally complex and less accurate." ## Supervised Learning Problems - Supervised learning problems can be further grouped into regression and classification problems. - **Classification:** - "A classification problem is when the output variable is a category, such as "red" or "blue" or "disease" and "no disease" - **Regression:** - "A regression problem is when the output variable is a real value, such as "dollars" or "weight"." ## List of Common Supervised Machine Learning Algorithms - Decision Trees - K Nearest Neighbors - Linear SVC (Support Vector Classifier) - Logistic Regression - Linear Regression ## Evaluating Classification Methods - **Predictive Accuracy:** - "Accuracy = Number of correct classifications / Total number of test cases" - **Efficiency:** - Time to construct the model. - Time to use the model. - **Robustness:** - Handling noise and missing values. - **Scalability:** - Efficiency in disk-resident databases. - **Interpretability:** - Understandable and insight provided by the model. - **Compactness of the model:** - Size of the tree, or the number of rules. ## Classification Model - **Step 1: Import libraries** - Import the necessary libraries from scikit-learn. - **Step 2: Load dataset** - Load the Iris dataset, which is included in scikit-learn. - **Step 3: Split dataset** - Split the dataset into training and testing sets using. - **Step 4: Create SVM Classifier** - Create an SVM classifier with a linear kernel. - **Step 5: Train Classifier** - Train the classifier using the training data. - **Step 6: Make Predictions** - Use the trained classifier to make predictions on the test data. - **Step 7: Evaluate Model** - Calculate the accuracy of the model and print it, along with the predicted and actual labels. ## Machine vs Deep Code - **Machine Learning Code:** - Library Importation - Data Preparation - Model Definition - Model Compilation - Training - Evaluation and Plotting - **Deep Learning Code:** - Import Libraries - Load Dataset: Split Dataset: - Create SVM Classifier: - Train Classifier: - Make Predictions: - Evaluate Model: ## Softmax Layer - **Introduction to Neural Networks:** - The softmax layer applies softmax activations to output a probability value in the range [0, 1]. - The values **z** inputted to the softmax layer are referred to as **logits**. - **Probability:** - 0 < yi < 1 - Σίνι = 1 - *An Image depicting a Softmax Layer:* - The image shows a network consisting of three nodes with each having an input $z$, an output $e^z$ and a weight assigned to it. - This network performs a summation over all the nodes and then calculates the probability $y_i$ for each of the three nodes. ## Activation Functions - **Introduction to Neural Networks:** - Non-linear activations are needed to learn complex (non-linear) data representations. - Otherwise, NNs would be just a linear function (such as W₁W₂x = Wx). - NNs with a large number of layers (and neurons) can approximate more complex functions. - *Figure: more neurons improve representation (but, may overfit)* - *An image showcasing how the number of hidden neurons affect the decision boundary. * - The image shows three different decision boundaries created using 3 hidden neurons, 6 hidden neurons, and 20 hidden neurons. ## Activation: Sigmoid - **Introduction to Neural Networks:** - The sigmoid function σ: takes a real-valued number and "squashes" it into the range between 0 and 1. - The output can be interpreted as the firing rate of a biological neuron. - **Not firing = 0; Fully firing = 1.** - When the neuron's activations are 0 or 1, sigmoid neurons saturate. - Gradients at these regions are almost zero (almost no signal will flow). - Sigmoid activations are less common in modern NNs. - *An image depicting the sigmoid activation function.* - The image shows the sigmoid function $f(x) = 1/(1+e^{-x})$ and its range is $[0,1]$. ## Activation: Tanh - **Introduction to Neural Networks:** - **Tanh function:** takes a real-valued number and "squashes" it into range between -1 and 1. - Like sigmoid, tanh neurons saturate. - Unlike sigmoid, the output is zero-centered. - It is therefore preferred than sigmoid. - Tanh is a scaled sigmoid: tanh(x) = 2*σ(2x)-1 - *An image depicting the tanh activation function.* - The image shows the tanh function $tanh(x) = 2/(1+e^{-2x})-1$ and its range is $[-1,1]$. ## Activation: ReLU - **Introduction to Neural Networks:** - **ReLU (Rectified Linear Unit):** takes a real-valued number and thresholds it at zero. - $f(x) = max(0,x)$ - **Properties of ReLU:** - Most modern deep NNs use ReLU activations. - **ReLU is fast to compute.** - Compared to sigmoid, tanh. - Simply threshold a matrix at zero. - **Accelerates the convergence of gradient descent.** - Due to linear, non-saturating form. - Prevents the gradient vanishing problem. - *An image depicting the ReLU activation function.* - The image shows the **ReLU** function, $f(x) = 0$ for $x < 0$ and $f(x) = x$ for $x \geq 0$. ## Activation: Leaky ReLU - **Introduction to Neural Networks:** - **The problem of ReLU activations: they can "die".** - ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not activate again on any data. - E.g., when a large learning rate is used. - **Leaky ReLU activation function is a variant of ReLU.** - Instead of the function being 0 when x < 0, a leaky ReLU has a small negative slope (e.g., α = 0.01, or similar). - **This resolves the dying ReLU problem.** - **Most current works still use ReLU.** - With a proper setting of the learning rate, the problem of dying ReLU can be avoided. - *An image depicting the **Leaky ReLU** activation function.* - The image shows the **Leaky ReLU** function, $f(x) = αx$ for $x < 0$ and $f(x) = x$ for $x \geq 0$. ## Activation: Linear Function - **Introduction to Neural Networks:** - **Linear function** means that the output signal is proportional to the input signal to the neuron. - If the value of the constant *c* is 1, it is also called **identity activation function**. - This activation type is used in regression problems. - E.g., the last layer can have linear activation function, in order to output a real number (and not a class membership). - *An image depicting the **linear function**.* - The image shows the linear function, $f(x) = cx$. ## Training Neural Networks - **Training NNs:** - The network **parameters Θ** include the **weight matrices and bias vectors** from all layers. - Θ = {W¹, b¹, W², b², ... , W¹, b¹} - Often, the model parameters Θ are referred to as **weights**. - Training a model to learn a set of parameters Θ that are optimal (according to a criterion) is one of the greatest challenges in ML. - *An image depicting training a neural network to identify handwritten numbers with output probability of each digit.* - **Example:** - Input number: 1 and output probability of 0.1 of being 1. - Input number: 2 and output probability of 0.7 of being 2. - Input number: 9 and output probability of 0.2 of being 0. ## Training NNs - **Data preprocessing** - helps convergence during training. - Mean subtraction, to obtain zero-centered data. - Subtract the mean for each individual data dimension (feature). - Normalization. - Divide each feature by its standard deviation. - To obtain standard deviation of 1 for each data dimension (feature). - Or, scale the data within the range [0,1] or [-1, 1]. - E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range. - *Image showing an example of original, zero-centered and normalized data.* ## Batch Normalization - **Batch normalization layers** act similar to the **data preprocessing** steps mentioned earlier. - They calculate the mean μ and variance σ of a batch of input data, and normalize the data x to a zero mean and unit variance. - μ = ∑x / n and σ² = ∑(x-μ)² / (n-1). - **BatchNorm layers** alleviate the problems of proper initialization of the parameters and hyper-parameters. - Result in faster convergence training, allow larger learning rates. - Reduce the internal covariate shift. - **BatchNorm layers** are inserted immediately after convolutional layers or fully-connected layers, and before activation layers. - They are very common with convolutional NNs. ## Training NNs - To train a NN, set the parameters Θ such that for a training subset of images, the corresponding elements in the predicted output have maximum values. - *Image showing an example of output probability of different digits: 1, 2, 9, 0.* ## Training NNs - Define a **loss function/objective function/cost function** L(Θ) that calculates the difference (error) between the model prediction and the true label. - E.g., L(Θ) can be **mean-squared error**, **cross-entropy**, etc. *An image depicting neural network with loss function L(Θ).* - **Example:** - Input number: 7 - Output probability, $y_1$: 0.2 - Output probability, $y_2$: 0.3 - Output probability, $y_3$: 0.5 - True label: 1 - **Cost function:** L(Θ) - **Cost function** L(Θ) calculates error between predicted output and true label. ## Loss Functions - **Classification tasks:** - Training examples: Pairs of N inputs $x_i$ and ground-truth class labels $y_i$. - Output layer: Softmax activations [maps to a probability distribution]. - $P(y=j|x) = e^{z_j} / ∑_{l=1}^K e^{z_l}$. - **Loss function:** Cross-entropy(Θ) - Cross-entropy(Θ) = $-1/N ∑_{i=1}^N [y_i log(ŷ_i) + (1-y_i) log(1-ŷ_i)]$. - Ground-truth class labels $y_i$ and model predicted class labels $ŷ_i$. - *Image depicting loss function and cross-entropy function, which measures the difference between the predicted and true class distributions.* ## Loss Functions - **Regression tasks:** - Training examples: Pairs of N inputs $x_i$, and ground-truth output values $y_i$. - Output Layer: Linear (Identity) or Sigmoid Activation. - **Loss function:** - **Mean Squared Error** $L(Θ) = 1/n∑_{i=1}^n (ŷ(x_i) - y_i)^2 $. - **Mean Absolute Error** $L(Θ) = 1/n∑_{i=1}^n |ŷ(x_i) - y_i|$. - *Image depicting loss function and Mean Squared Error (MSE). Both of these loss functions quantify the difference between predicted and true values, and both are used in regression.* ## Training NNs - Optimizing the **loss function** L(Θ). - Almost all DL models these days are trained with a variant of the **gradient descent** (GD) algorithm. - GD applies iterative refinement of the network parameters Θ. - GD uses the opposite direction of the **gradient** of the loss with respect to the NN parameters (i.e., ∇L(Θ) = [∂L/∂Θ₁]) for updating Θ. - The gradient of the loss function ∇L(Θ) gives the direction of fastest increase of the loss function L(Θ) when the parameters Θ are changed. - *Image depicting gradient descent, cost function L(Θ), Θ and ∇L(Θ).* ## Gradient Descent Algorithm - Steps in the **gradient descent** algorithm: 1. Randomly initialize the model parameters, Θ⁰. 2. Compute the gradient of the loss function at the initial parameters Θ⁰: ∇L(Θ⁰). 3. Update the parameters as: Θnew = Θ⁰ - α∇L(Θ⁰) - Where α is the learning rate. 4. Go to step 2 and repeat (until a terminating criterion is reached). - *Image depicting Gradient Descent, cost function L(Θ), initial parameters Θ⁰, gradient ∇L(Θ) and parameter update Θnew*. ## Gradient Descent Algorithm - Example: a NN with only 2 parameters *w₁* and *w₂*, i.e., Θ = {*w₁*, *w₂*} - The different colors represent the values of the loss (minimum loss Θ* is = 1.3). *An imagre depicting GD with initial parameters Θ⁰, updated parameters Θ¹, Θ², and Θ³.* - **Steps:** 1. **Randomly pick a starting point** Θ⁰. 2. **Compute the gradient at Θ⁰**, ∇L(Θ⁰). 3. **Times the learning rate η, and update Θ**, Θnew = Θ⁰ - η∇L(Θ⁰). 4. **Go to step 2, repeat.** - ∇L(Θ) = (∂L(Θ)/∂*w₁*, ∂L(Θ)/∂*w₂*)

Artificial Intelligence Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript