Lec 5. Artificial Intelligence PDF

# ARTIFICIAL INTELLIGENCE ## Neural Network - **Artificial Neural Network (ANN)**: Regression and classification - **Convolutional Neural Network (CNN)**: Computer vision - **Recurrent Neural Network (RNN)**: Time series analysis ## ML vs. Deep Learning - **Deep Learning (DL)** is a machine learning subfield that uses multiple layers for learning data representations. - DL is exceptionally effective at learning patterns. ### Machine Learning - Diagram: An arrow points from Input > Feature extraction > Classification > Output. The Output is labelled "Car not Car". ### Deep Learning - Diagram: An arrow points from Input > Feature extraction-Classification > Output. The Output is labelled "Car not Car". - DL applies a multi-layer process for learning rich hierarchical features (i.e., data representations). - Input image pixels → Edges → Textures → Parts → Objects ### The whole - Diagram: A diagram shows a single image (of a cat) which is used in a network. The image is passed through 3 layers: - Convolution - Max Pooling - Convolution - Max Pooling - The network on the left is labelled "Fully Connected Feedforward network". - The network on the right is labelled "Convolution". - The network on the right can be repeated multiple times. ## Convolutional Neural Networks (CNNs) - Feature extraction architecture - After 2 convolutional layers, a max-pooling layer reduces the size of the feature maps (typically by 2). - A fully convolutional and a softmax layer are added last to perform classification. ## Machine vs. Deep code - **Machine Learning** - Library Importation: - Data Preparation: - Model Definition: - Model Compilation: - Training: - Evaluation and Plotting: - **Deep Learning** - Import Libraries: - Load Dataset: Split Dataset: - Create SVM Classifier: - Train Classifier: - Make Predictions: - Evaluate Model: ## Batches VS epoch VS iterations - **Batch**: A batch is a subset of the dataset. - The number of samples processed before the model is updated. - **Epoch**: An epoch is one complete pass through the entire dataset. - During one epoch, the model sees every sample in the dataset exactly once. - **Iteration**: An iteration refers to one **update of the model's parameters**. - In one epoch, the number of iterations is equal to the number of batches. - Let's assume you have the following: - Dataset: 1000 samples - Batch Size: 100 samples - Epochs: 10 - **Calculation:** - Number of Batches per Epoch: 1000 samples / 100 batch size = 10 batches - Total Iterations: Number of Epochs x Number of Batches per Epoch = 10 x 10 = 100 iterations ## Training the model - **train\_ds**: This is the dataset used for training. It contains the features and labels that will be used to train the model. - **epochs = 30**: The model will go through the entire training dataset 30 times. - **batch\_size = 32**: During each epoch, the training data will be split into smaller batches of 32 samples. The model will update its weights after each batch. - **history**: The output of model.fit() is stored in history. It contains details about the training process, such as the loss and accuracy metrics for each epoch. You can use this to plot the **training progress**, like loss or accuracy curves over epochs. - Suppose your dataset contains 2592 samples. - You've set **batch\_size = 32** in your model.fit() function, meaning the model will process 32 samples per batch. - Calculate how many batches per epoch: - Number of Batches = Total Number of Samples / Batch Size = 2592 / 32 = 81 batches - So, each epoch will consist of 81 batches of data, where each batch contains 32 samples. During training, after processing each batch, the counter increments, like: - 1/81 (after the first batch is processed) - 2/81 (after the second batch is processed) - ... - 81/81 (after all batches are processed for that epoch) ## # Import libraries - import numpy as np - from tensorflow.keras.models import Sequential - from tensorflow.keras.layers import Dense - from tensorflow.keras.optimizers import SGD ## # Generate dummy dataset - X = np.random.rand(1000, 20) - y = np.random.randint(2, size=(1000, 1)) ## # Define a simple model - model = Sequential([ Dense(32, activation='relu', input\_dim=20), - Dense(1, activation='sigmoid') ]) ## # Compile the model - model.compile(optimizer=SGD(), loss='binary\_crossentropy', metrics=['accuracy']) ## # Define parameters - batch\_size = 100 - epochs = 10 ## # Train the model - history = model.fit(X, y, epochs=epochs, batch\_size=batch\_size) ## Adam - **Adaptive Moment Estimation (Adam)**: Adam combines insights from the momentum optimizers that accumulate the values of past gradients, and it also introduces new terms based on the second moment of the gradient. - **Similar to GD with momentum**: Adam computes a weighted average of past gradients (**first moment of the gradient**), i.e., $V_t = β_1V_{t-1} + (1 - β_1)(g_{t-1})$. - **Adam also computes a weighted average of past squared gradients (second moment of the gradient)**, i.e., $U_t^2 = β_2U_{t-1}^2 + (1 - β_2)(g_{t-1})^2$. ## Generalization - **Underfitting**: The model is too "simple" to represent all the relevant class characteristics. - E.g., model with too few parameters. - Produces high error on the training set and high error on the validation set. - **Overfitting**: The model is too "complex" and fits irrelevant characteristics (noise) in the data. - E.g., model with too many parameters. - Produces low error on the training error and high error on the validation set. ## Optimization - **Mathematical optimization**: "the selection of a best element (with regard to some criterion) from some set of available alternatives" (Wikipedia). - **Main types**: finite-step, iterative, heuristic. - **Learning as an optimization problem**: - **cost function**: $J(θ) = \frac{1}{m} \sum_{i=1}^m L(f(x_i; θ), y_i) + R(θ)$ ## Regularization: Weight Decay - **l2 weight decay**: A regularization term that penalizes large weights is added to the loss function. - $L_{reg}(θ) = L(θ) + λ\sum_{k} θ_k^2$ - For every weight in the network, we add the regularization term to the loss value. - During gradient descent parameter update, every weight is decayed linearly toward zero. - **The weight decay coefficient λ determines how dominant the regularization is during the gradient computation.** ## Regularization: Dropout - **Dropout**: - Randomly drop units (along with their connections) during training. - Each unit is retained with a fixed **dropout rate p**, independent of other units. - The hyper-parameter p needs to be chosen (tuned). - Often, between 20% and 50% of the units are dropped. - **Dropout is a kind of ensemble learning.** - Using one mini-batch to train one network with a slightly different architecture. ## Regularization: Early Stopping - **Early-stopping**: - During model training, use a validation set. - E.g., validation/train ratio of about 25% to 75% - Stop when the validation accuracy (or loss) has not improved after **n epochs.** - **The parameter n is called patience.** ## Batch Normalization - **Batch normalization layers act similar to the data preprocessing steps mentioned earlier.** - They calculate the mean μ and variance σ of a batch of input data, and normalize the data x to a zero mean and unit variance. - i.e., $x' = \frac{x - μ}{σ}$ - **BatchNorm layers alleviate the problems of proper initialization of the parameters and hyper-parameters.** - Result in faster convergence training, allow larger learning rates. - Reduce the internal covariate shift. - **BatchNorm layers are inserted immediately after convolutional layers or fully-connected layers, and before activation layers.** - They are very common with convolutional NNs. ## Hyper-parameter Tuning - **Training NNs can involve setting many hyper-parameters.** - **The most common hyper-parameters include:** - Number of layers, and number of neurons per layer - Initial learning rate - Learning rate decay schedule (e.g., decay constant) - Optimizer type - **Other hyper-parameters may include:** - Regularization parameters (l2 penalty, dropout rate) - Batch size - Activation functions - Loss function - **Hyper-parameter tuning can be time-consuming for larger NNs.** ## k-Fold Cross-Validation - **Illustration of a 5-fold cross-validation:** - Diagram: A split table represents different folds (1 - 5) that are used for different splits (1-5) to find the model parameters each time, and then final evaluation is performed. ## Adam Optimizer - **The optimizer is responsible for adjusting the weights of the network based on the loss gradient to minimize the loss function. Adam combines the advantages of: Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp).** - **loss='categorical\_crossentropy'**: - commonly used for **multi-class classification problems**. - The loss function calculates the **difference between the true labels and the predictions made by the model**, guiding the optimizer on how to adjust the weights to reduce the difference. ## Parameters for Model Compilation 1. **Optimizer**: Specifies the optimization algorithm to use. 2. **Loss**: Specifies the loss function to use. 3. **Metrics**: Specifies the metrics to evaluate during training and testing. ## Common Options for Each Parameter ### 1. Optimizer: - **Adam: Adaptive Moment Estimation.** - `optimizer = 'adam'` - `optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)` - **SGD: Stochastic Gradient Descent.** - `optimizer = 'sgd'` - `optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)` - **RMSprop: Root Mean Square Propagation.** - `optimizer = 'rmsprop'` - `optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)` - **Adagrad: Adaptive Gradient Algorithm.** - `optimizer = 'adagrad'` - `optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.01)` - **Adadelta: Adaptive Delta.** - `optimizer = 'adadelta'` - `optimizer = tf.keras.optimizers.Adadelta(learning_rate=1.0)` - **Adamax: Variant of Adam based on the infinity norm.** - `optimizer = 'adamax'` - `optimizer = tf.keras.optimizers.Adamax(learning_rate=0.002)` ### 2. Loss: - **Binary Crossentropy**: For **binary classification.** - `loss = 'binary_crossentropy'` - `loss = tf.keras.losses.BinaryCrossentropy()` - **Categorical Crossentropy**: For **multi classification.** - `loss = 'Categorical_crossentropy'` - **Mean Squared Error**: For regression tasks. - `loss = 'mean_squared_error'` - `loss = tf.keras.losses. MeanSquaredError()` ### 3. Metrics: - `metrics = ['accuracy']` - `metrics = ['precision']` - `metrics = ['recall']` - `metrics = ['AUC']` - **# Create the confusion matrix** - `from sklearn.metrics import confusion_matrix` - `cm = confusion_matrix(y_true, y_pred)` ## # Compile the model with specified parameters - `model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy', 'AUC'])` ## # Display the model summary - `model.summary()` ## How to use HuggingFace model with LangChain on google Colab? - Use HuggingFace Hub model.

Lec 5. Artificial Intelligence PDF

Document Details

Tags

Related

Summary

Full Transcript