Artificial Intelligence Lecture Notes PDF
Document Details
Uploaded by LyricalCherryTree
Kafr El Sheikh University
Tags
Related
- Deep Learning (Ian Goodfellow, Yoshua Bengio, Aaron Courville) PDF
- Artificial Intelligence - Neural Networks PDF
- Lec 5. Artificial Intelligence PDF
- AI - Machine Learning & Deep Learning Lecture 08 PDF
- AI Notes - Introduction to Artificial Intelligence PDF
- Machine Learning and Bioinformatics Lecture Notes PDF
Summary
These lecture notes provide an introduction to artificial intelligence, focusing on key concepts like deep learning, machine learning, and neural networks, along with different types of activation functions. The document includes diagrams and mathematical representations to illustrate the concepts.
Full Transcript
# Artificial Intelligence ## Introduction - The image shows several robotic heads with gears and wires. - It also shows a venn diagram showing the relationship between deep learning, machine learning and artificial intelligence. - Deep Learning is a subfield of Machine Learning which is a subfi...
# Artificial Intelligence ## Introduction - The image shows several robotic heads with gears and wires. - It also shows a venn diagram showing the relationship between deep learning, machine learning and artificial intelligence. - Deep Learning is a subfield of Machine Learning which is a subfield of Artificial Intelligence. ## Model of an artificial neuron - An image depicting the model of an artificial neuron. - It shows a network of inputs ($x_1$, $x_2$... $x_n$), weights ($w_1$, $w_2$... $w_n$), a summation function with "net" as output, an activation function $f$ and output $y$. ## Multi-layer net - An image depicting the model of a multi-layer neural network. - It shows a network of input nodes ($x_1$, $x_2$, ...$x_n$) connected to several hidden layers with multiple nodes in each layer. All nodes are connected by edges. Output nodes are connected to hidden layer nodes. ## Supervised Learning - **The Supervised Learning Process:** - The image shows a flow chart representing the supervised learning process. - Training data is labeled and passed through a Machine Learning Algorithm to create a predictive model, the model is then evaluated. - Feedback Loop is used to improve the prediction model. - **Example:** - "Let's say, you want to predict for a group of people their chances of becoming Coronavirus infected." - "You will need training dataset from previous cases." - "You will discover new direct relationships via a feedback loop." - "Your model will evolve and become more and more accurate." ## Unsupervised Learning - **The Unsupervised Learning Process:** - The image shows the flow chart of the unsupervised learning process. - Unlabeled input data is passed through Machine Learning algorithm to generate output. - **Example:** - "There are many animals, snakes, birds and insects that you have never ever seen in your life." - "But when you see a new bird, that no one has labeled as bird for you to understand, you can still make out that it is a bird because it has feather, it has beak, it can fly etc." - "This is unsupervised learning. Computationally complex and less accurate." ## Supervised Learning Problems - Supervised learning problems can be further grouped into regression and classification problems. - **Classification:** - "A classification problem is when the output variable is a category, such as "red" or "blue" or "disease" and "no disease" - **Regression:** - "A regression problem is when the output variable is a real value, such as "dollars" or "weight"." ## List of Common Supervised Machine Learning Algorithms - Decision Trees - K Nearest Neighbors - Linear SVC (Support Vector Classifier) - Logistic Regression - Linear Regression ## Evaluating Classification Methods - **Predictive Accuracy:** - "Accuracy = Number of correct classifications / Total number of test cases" - **Efficiency:** - Time to construct the model. - Time to use the model. - **Robustness:** - Handling noise and missing values. - **Scalability:** - Efficiency in disk-resident databases. - **Interpretability:** - Understandable and insight provided by the model. - **Compactness of the model:** - Size of the tree, or the number of rules. ## Classification Model - **Step 1: Import libraries** - Import the necessary libraries from scikit-learn. - **Step 2: Load dataset** - Load the Iris dataset, which is included in scikit-learn. - **Step 3: Split dataset** - Split the dataset into training and testing sets using. - **Step 4: Create SVM Classifier** - Create an SVM classifier with a linear kernel. - **Step 5: Train Classifier** - Train the classifier using the training data. - **Step 6: Make Predictions** - Use the trained classifier to make predictions on the test data. - **Step 7: Evaluate Model** - Calculate the accuracy of the model and print it, along with the predicted and actual labels. ## Machine vs Deep Code - **Machine Learning Code:** - Library Importation - Data Preparation - Model Definition - Model Compilation - Training - Evaluation and Plotting - **Deep Learning Code:** - Import Libraries - Load Dataset: Split Dataset: - Create SVM Classifier: - Train Classifier: - Make Predictions: - Evaluate Model: ## Softmax Layer - **Introduction to Neural Networks:** - The softmax layer applies softmax activations to output a probability value in the range [0, 1]. - The values **z** inputted to the softmax layer are referred to as **logits**. - **Probability:** - 0 < yi < 1 - Σίνι = 1 - *An Image depicting a Softmax Layer:* - The image shows a network consisting of three nodes with each having an input $z$, an output $e^z$ and a weight assigned to it. - This network performs a summation over all the nodes and then calculates the probability $y_i$ for each of the three nodes. ## Activation Functions - **Introduction to Neural Networks:** - Non-linear activations are needed to learn complex (non-linear) data representations. - Otherwise, NNs would be just a linear function (such as W₁W₂x = Wx). - NNs with a large number of layers (and neurons) can approximate more complex functions. - *Figure: more neurons improve representation (but, may overfit)* - *An image showcasing how the number of hidden neurons affect the decision boundary. * - The image shows three different decision boundaries created using 3 hidden neurons, 6 hidden neurons, and 20 hidden neurons. ## Activation: Sigmoid - **Introduction to Neural Networks:** - The sigmoid function σ: takes a real-valued number and "squashes" it into the range between 0 and 1. - The output can be interpreted as the firing rate of a biological neuron. - **Not firing = 0; Fully firing = 1.** - When the neuron's activations are 0 or 1, sigmoid neurons saturate. - Gradients at these regions are almost zero (almost no signal will flow). - Sigmoid activations are less common in modern NNs. - *An image depicting the sigmoid activation function.* - The image shows the sigmoid function $f(x) = 1/(1+e^{-x})$ and its range is $[0,1]$. ## Activation: Tanh - **Introduction to Neural Networks:** - **Tanh function:** takes a real-valued number and "squashes" it into range between -1 and 1. - Like sigmoid, tanh neurons saturate. - Unlike sigmoid, the output is zero-centered. - It is therefore preferred than sigmoid. - Tanh is a scaled sigmoid: tanh(x) = 2*σ(2x)-1 - *An image depicting the tanh activation function.* - The image shows the tanh function $tanh(x) = 2/(1+e^{-2x})-1$ and its range is $[-1,1]$. ## Activation: ReLU - **Introduction to Neural Networks:** - **ReLU (Rectified Linear Unit):** takes a real-valued number and thresholds it at zero. - $f(x) = max(0,x)$ - **Properties of ReLU:** - Most modern deep NNs use ReLU activations. - **ReLU is fast to compute.** - Compared to sigmoid, tanh. - Simply threshold a matrix at zero. - **Accelerates the convergence of gradient descent.** - Due to linear, non-saturating form. - Prevents the gradient vanishing problem. - *An image depicting the ReLU activation function.* - The image shows the **ReLU** function, $f(x) = 0$ for $x < 0$ and $f(x) = x$ for $x \geq 0$. ## Activation: Leaky ReLU - **Introduction to Neural Networks:** - **The problem of ReLU activations: they can "die".** - ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not activate again on any data. - E.g., when a large learning rate is used. - **Leaky ReLU activation function is a variant of ReLU.** - Instead of the function being 0 when x < 0, a leaky ReLU has a small negative slope (e.g., α = 0.01, or similar). - **This resolves the dying ReLU problem.** - **Most current works still use ReLU.** - With a proper setting of the learning rate, the problem of dying ReLU can be avoided. - *An image depicting the **Leaky ReLU** activation function.* - The image shows the **Leaky ReLU** function, $f(x) = αx$ for $x < 0$ and $f(x) = x$ for $x \geq 0$. ## Activation: Linear Function - **Introduction to Neural Networks:** - **Linear function** means that the output signal is proportional to the input signal to the neuron. - If the value of the constant *c* is 1, it is also called **identity activation function**. - This activation type is used in regression problems. - E.g., the last layer can have linear activation function, in order to output a real number (and not a class membership). - *An image depicting the **linear function**.* - The image shows the linear function, $f(x) = cx$. ## Training Neural Networks - **Training NNs:** - The network **parameters Θ** include the **weight matrices and bias vectors** from all layers. - Θ = {W¹, b¹, W², b², ... , W¹, b¹} - Often, the model parameters Θ are referred to as **weights**. - Training a model to learn a set of parameters Θ that are optimal (according to a criterion) is one of the greatest challenges in ML. - *An image depicting training a neural network to identify handwritten numbers with output probability of each digit.* - **Example:** - Input number: 1 and output probability of 0.1 of being 1. - Input number: 2 and output probability of 0.7 of being 2. - Input number: 9 and output probability of 0.2 of being 0. ## Training NNs - **Data preprocessing** - helps convergence during training. - Mean subtraction, to obtain zero-centered data. - Subtract the mean for each individual data dimension (feature). - Normalization. - Divide each feature by its standard deviation. - To obtain standard deviation of 1 for each data dimension (feature). - Or, scale the data within the range [0,1] or [-1, 1]. - E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range. - *Image showing an example of original, zero-centered and normalized data.* ## Batch Normalization - **Batch normalization layers** act similar to the **data preprocessing** steps mentioned earlier. - They calculate the mean μ and variance σ of a batch of input data, and normalize the data x to a zero mean and unit variance. - μ = ∑x / n and σ² = ∑(x-μ)² / (n-1). - **BatchNorm layers** alleviate the problems of proper initialization of the parameters and hyper-parameters. - Result in faster convergence training, allow larger learning rates. - Reduce the internal covariate shift. - **BatchNorm layers** are inserted immediately after convolutional layers or fully-connected layers, and before activation layers. - They are very common with convolutional NNs. ## Training NNs - To train a NN, set the parameters Θ such that for a training subset of images, the corresponding elements in the predicted output have maximum values. - *Image showing an example of output probability of different digits: 1, 2, 9, 0.* ## Training NNs - Define a **loss function/objective function/cost function** L(Θ) that calculates the difference (error) between the model prediction and the true label. - E.g., L(Θ) can be **mean-squared error**, **cross-entropy**, etc. *An image depicting neural network with loss function L(Θ).* - **Example:** - Input number: 7 - Output probability, $y_1$: 0.2 - Output probability, $y_2$: 0.3 - Output probability, $y_3$: 0.5 - True label: 1 - **Cost function:** L(Θ) - **Cost function** L(Θ) calculates error between predicted output and true label. ## Loss Functions - **Classification tasks:** - Training examples: Pairs of N inputs $x_i$ and ground-truth class labels $y_i$. - Output layer: Softmax activations [maps to a probability distribution]. - $P(y=j|x) = e^{z_j} / ∑_{l=1}^K e^{z_l}$. - **Loss function:** Cross-entropy(Θ) - Cross-entropy(Θ) = $-1/N ∑_{i=1}^N [y_i log(ŷ_i) + (1-y_i) log(1-ŷ_i)]$. - Ground-truth class labels $y_i$ and model predicted class labels $ŷ_i$. - *Image depicting loss function and cross-entropy function, which measures the difference between the predicted and true class distributions.* ## Loss Functions - **Regression tasks:** - Training examples: Pairs of N inputs $x_i$, and ground-truth output values $y_i$. - Output Layer: Linear (Identity) or Sigmoid Activation. - **Loss function:** - **Mean Squared Error** $L(Θ) = 1/n∑_{i=1}^n (ŷ(x_i) - y_i)^2 $. - **Mean Absolute Error** $L(Θ) = 1/n∑_{i=1}^n |ŷ(x_i) - y_i|$. - *Image depicting loss function and Mean Squared Error (MSE). Both of these loss functions quantify the difference between predicted and true values, and both are used in regression.* ## Training NNs - Optimizing the **loss function** L(Θ). - Almost all DL models these days are trained with a variant of the **gradient descent** (GD) algorithm. - GD applies iterative refinement of the network parameters Θ. - GD uses the opposite direction of the **gradient** of the loss with respect to the NN parameters (i.e., ∇L(Θ) = [∂L/∂Θ₁]) for updating Θ. - The gradient of the loss function ∇L(Θ) gives the direction of fastest increase of the loss function L(Θ) when the parameters Θ are changed. - *Image depicting gradient descent, cost function L(Θ), Θ and ∇L(Θ).* ## Gradient Descent Algorithm - Steps in the **gradient descent** algorithm: 1. Randomly initialize the model parameters, Θ⁰. 2. Compute the gradient of the loss function at the initial parameters Θ⁰: ∇L(Θ⁰). 3. Update the parameters as: Θnew = Θ⁰ - α∇L(Θ⁰) - Where α is the learning rate. 4. Go to step 2 and repeat (until a terminating criterion is reached). - *Image depicting Gradient Descent, cost function L(Θ), initial parameters Θ⁰, gradient ∇L(Θ) and parameter update Θnew*. ## Gradient Descent Algorithm - Example: a NN with only 2 parameters *w₁* and *w₂*, i.e., Θ = {*w₁*, *w₂*} - The different colors represent the values of the loss (minimum loss Θ* is = 1.3). *An imagre depicting GD with initial parameters Θ⁰, updated parameters Θ¹, Θ², and Θ³.* - **Steps:** 1. **Randomly pick a starting point** Θ⁰. 2. **Compute the gradient at Θ⁰**, ∇L(Θ⁰). 3. **Times the learning rate η, and update Θ**, Θnew = Θ⁰ - η∇L(Θ⁰). 4. **Go to step 2, repeat.** - ∇L(Θ) = (∂L(Θ)/∂*w₁*, ∂L(Θ)/∂*w₂*)