TBL notes for Week 2.pdf

Chi Wei: After many years, when you go out to work, I may no longer be with you, but I hope this note will still accompany you. So I've decided to share the PowerPoint notes with more examples for you. You can check with foundation model or search engine to see if there is any updated information in future. But I hope these basics knowledge will always accompany you. (2024) Powerpoint summary and extra explanation (it has been proofreading by Max and my colleagues) Slide 2 The learning objectives focus on understanding the fundamental concepts of AI and its relevance to biomedical engineering, exploring the history and evolution of AI applications in the ﬁeld, and identifying the key challenges and opportunities in using AI to solve biomedical engineering problems. You will learn/revise basic AI principles, including machine learning and neural networks, and their applications in analyzing complex biomedical data to enhance diagnostics and patient care.We will also talk about historical perspective that will cover signiﬁcant milestones and breakthroughs, such as advancements in medical imaging and personalized medicine. Additionally, the objectives highlight the challenges of data privacy, algorithm transparency, and clinical integration, while emphasizing the opportunities AI presents in predictive analytics, drug discovery, and wearable health technologies. Slide 5 Thing you may want want to consider to ask yourself after this class, what is AI, what if AI fail, and where are we heading to? Slide 6 Artiﬁcial intelligence (AI) in healthcare has revolutionized the diagnostic process by leveraging advanced algorithms, machine learning models, and vast amounts of data to improve the accuracy, speed, and e iciency of diagnosing diseases Slide 15 Artiﬁcial intelligence (AI) is a broad ﬁeld encompassing the development of systems that can perform tasks typically requiring human intelligence. Machine learning (ML) is a subset of AI that involves training algorithms on data to make predictions or decisions without being explicitly programmed for speciﬁc tasks. Neural networks, inspired by the structure of the human brain, are a key technique within ML, consisting of interconnected layers of nodes that process and learn from data. Deep learning, a specialized area within pg. 1 neural networks, involves multiple layers (deep neural networks) that enable the extraction of high-level features and complex patterns from vast amounts of data, driving advancements in ﬁelds such as image and speech recognition. AI is the overarching ﬁeld aiming to create systems that mimic human intelligence. ML, a subset of AI, focuses on algorithms that improve from experience. DL, a subset of ML, uses neural networks with many layers for complex tasks. DS involves extracting insights from data using various techniques, including ML. Each plays a distinct role in harnessing data for intelligent solutions. From 2010 onwards we have CNN, LSTM, RNN and GANS coming down, and now is foundation models and quantum computing which is the next wave of technology. Slide 20 Kaul, Enslin, and Gross (2020) provide an overview of the history of artiﬁcial intelligence (AI) in medicine, with a focus on its application in gastrointestinal endoscopy. The article traces the evolution of AI from early expert systems to contemporary deep learning algorithms. It discusses milestones such as the development of MYCIN and the increasing integration of AI in diagnostic imaging and clinical decision support systems. The authors highlight how advancements in computational power and data availability have accelerated AI's impact on improving diagnostic accuracy and patient outcomes in gastroenterology. Slide 31 Deep learning builds upon the foundation of artiﬁcial neural networks (ANN). The process involves several key concepts: 1. Artiﬁcial Neural Networks (ANN): The basic structure that mimics the human brain's neural network. 2. Backpropagation: A method for training ANNs by adjusting weights to minimize error. 3. Fully Connected Layers: Layers where each neuron connects to every neuron in the next layer. 4. Convolutional Layers: Specialized layers for processing grid-like data, such as images, by focusing on local features. 5. Overﬁtting: A challenge where the model performs well on training data but poorly on new, unseen data due to excessive complexity. pg. 2 Slide 32 An Artiﬁcial Neural Network (ANN) is analogous to the human brain's network of neurons. In a biological neuron, dendrites receive input signals, which are processed in the cell body, and if the signal is strong enough, it triggers an output signal through the axon. Similarly, in an ANN, input nodes (analogous to dendrites) receive data, process it through layers of interconnected nodes (like the cell body), and produce an output through the ﬁnal layer (like an axon). This structure enables the network to learn and make decisions based on input data, mimicking how the brain processes information. Slide 33 In an Artiﬁcial Neural Network (ANN), the process of neurotransmitters crossing the synaptic cleft is analogous to the transfer of signals between nodes (neurons) through weighted connections. Just as neurotransmitters convey information across the synaptic gap to inﬂuence the next neuron’s activity, weighted connections in an ANN transmit signals between nodes, inﬂuencing the activation of subsequent layers. These weights are adjusted during training to optimize the network's performance, similar to how neurotransmitter activity can modulate synaptic strength in biological neurons. Slide 34 Biological Neuron: A biological neuron is a nerve cell in the brain and nervous system that processes and transmits information through electrical and chemical signals. It consists of dendrites that receive signals, a cell body that processes these signals, and an axon that transmits the signal to other neurons or muscles via neurotransmitters across a synaptic cleft. Artiﬁcial Neuron: An artiﬁcial neuron, modeled after the biological neuron, is a mathematical function used in artiﬁcial neural networks. It takes input values, processes them through a weighted sum, applies an activation function, and produces an output. This output is then passed to other neurons in the network to mimic learning and decision-making processes Slide 37 This equation describes how an artiﬁcial neuron processes inputs to produce an output, analogous to how biological neurons process signals. The inputs are weighted, summed, and passed through an activation function to produce the output, similar to how a biological neuron sums up incoming signals and decides whether or not to ﬁre. Neural Network: A simple neural network, also known as a shallow neural network, typically consists of an input layer, one or two hidden layers, and an output layer pg. 3 Deep Neural Network (DNN): A deep neural network is characterized by having multiple hidden layers between the input and output layers. Slice 40 Yann LeCun, Geo rey Hinton, and Yoshua Bengio are widely recognized as pioneers in the ﬁeld of deep learning. They were jointly awarded the Turing Award in 2018, which is often referred to as the "Nobel Prize of Computing," for their signiﬁcant contributions to the development and advancement of deep learning techniques. Their work laid the foundation for many of the AI and machine learning technologies that are widely used today. Speciﬁcally: Geo rey Hinton is known for his work on backpropagation and deep belief networks, which are foundational to training deep neural networks. Yann LeCun developed convolutional neural networks (CNNs), which are now a core technology in image and video recognition. Yoshua Bengio contributed extensively to the development of probabilistic models and deep learning algorithms, further advancing the ﬁeld. Together, their work has revolutionized the way machines learn from data and has driven signiﬁcant progress in artiﬁcial intelligence. Slide 41 brief description of each component of an Artiﬁcial Neural Network (ANN): 1. Activation Function The activation function in an ANN determines whether a neuron should be activated or not, based on the weighted sum of its inputs. It introduces non-linearity into the model, allowing the network to learn and model complex data patterns. Common activation functions include the sigmoid, ReLU (Rectiﬁed Linear Unit), and tanh functions. 2. Weights Weights are the parameters within an ANN that are adjusted during the training process. They determine the strength and direction (positive or negative) of the inﬂuence of input signals on the neuron's output. By adjusting the weights, the network learns from the data, improving its ability to make accurate predictions. 3. Cost Function pg. 4 The cost function, also known as the loss function, measures the di erence between the predicted output and the actual output during training. It quantiﬁes the error in the network’s predictions. The goal of training an ANN is to minimize this cost function, thereby improving the accuracy of the model. 4. Learning Algorithm The learning algorithm is the method used to update the weights in the ANN based on the error calculated by the cost function. A common learning algorithm is backpropagation combined with gradient descent, which iteratively adjusts the weights to minimize the cost function. The learning algorithm is crucial for enabling the network to learn from data and improve over time. Slide 43 Neurons in an artiﬁcial neural network can be seen as functions that take input values, apply weights, sum them up, and then pass the result through an activation function to produce an output. The gradient, which is the derivative of the cost function with respect to the weights, guides how the weights should be adjusted during training. When training a neural network, the gradient provides information on how to change the weights to reduce the error or cost. This process is part of backpropagation, where the gradient is used to update the weights in a direction that minimizes the cost function, thereby optimizing the network's performance. In summary, neurons act as functions that process inputs, and gradients are used to optimize these functions by adjusting the weights during learning. backpropagation performs a backward pass to adjust the model's parameters, aiming to minimize the mean squared error (MSE). Slide 46 In this scenario, an image is processed by stretching its pixels into a single column to serve as input to a neural network. This approach is often used in simple feedforward neural networks, where each pixel value is treated as an individual input feature. Process Overview: 1. Input Image Stretching: 1. The image, typically a matrix of pixel values, is ﬂattened into a single column vector. For example, a 28x28 pixel image would be stretched into a column of 784 pixel values. pg. 5 2. Neural Network Processing: 1. This column vector serves as the input layer to a neural network. 2. The network processes the input through its layers, where each neuron applies a weighted sum and activation function. 3. Output Scores: 1. The ﬁnal layer of the network outputs scores for di erent classes, such as "cat score," "dog score," and "ship score." 2. Each score represents the network's conﬁdence that the input image belongs to that particular class. 4. Interpretation: 1. The class with the highest score is typically chosen as the network’s prediction for the image. Example: If the input is an image of a cat, the network will process the stretched pixel values, and ideally, the "cat score" will be the highest among the output scores, indicating that the network has classiﬁed the image as a cat. This method of input processing is straightforward and can be e ective for simple tasks, but for more complex tasks or larger images, more advanced architectures like convolutional neural networks (CNNs) are typically used to better capture spatial relationships in the data. Slide 48 Here is an example of underﬁtting and overﬁtting in regression. When the predictor is too simple or rigid, it fails to capture the underlying pattern in the data, leading to underﬁtting. This results in poor model performance and inaccurate predictions. On the other hand, when the predictor is too ﬂexible, it captures not only the true pattern but also the noise in the data, leading to overﬁtting. This causes the model to perform well on training data but poorly on new, unseen data due to its excessive sensitivity to minor ﬂuctuations. Slide 49 To detect underﬁtting and overﬁtting during the training process using test error, training error, and a stopper, you can follow these steps: 1. Monitor Training and Test Error: pg. 6 Underﬁtting: If both the training error and test error are high and do not decrease signiﬁcantly during training, this indicates underﬁtting. The model is too simple to capture the underlying patterns in the data. Overﬁtting: If the training error continues to decrease while the test error starts to increase or stabilizes, this suggests overﬁtting. The model is learning the noise and speciﬁc details in the training data that do not generalize to new data. 2. Use a Stopper (Early Stopping): Early Stopping: Implement early stopping to halt training when the test error starts to increase while the training error decreases. This indicates the point where the model begins to overﬁt. By stopping training at this point, you can prevent overﬁtting and preserve a model that generalizes better to unseen data. 3. Visualize the Errors: Plot the Training and Test Error: Create a plot with the number of training epochs on the x-axis and error on the y-axis. As training progresses, observe the behavior of the training and test error curves. Underﬁtting is evident when both errors are high, and overﬁtting is indicated when the test error starts increasing after a certain point, even as the training error continues to decrease. 4. Adjust Model Complexity: If underﬁtting is detected, consider increasing the complexity of the model (e.g., adding more layers or neurons). If overﬁtting is detected, consider using regularization techniques, reducing model complexity, or adding more training data. By carefully monitoring these metrics and using early stopping, you can balance the trade-o between underﬁtting and overﬁtting, leading to a model that performs well on both training and test data. Dropout is a regularization technique used to prevent overﬁtting in neural networks, especially in deep learning models. It works by randomly "dropping out" or deactivating a subset of neurons during the training process on each iteration. How Dropout Works: During Training: At each training step, a fraction of neurons in a layer (speciﬁed by the dropout rate, e.g., 0.2 means 20% of neurons) are randomly selected and ignored, meaning their weights are not updated. This forces the network to not rely too heavily on any one neuron or set of neurons, which promotes robustness and helps the model generalize better to new data. pg. 7 During Inference (Testing/Validation): Dropout is turned o , and all neurons are used. However, the weights are scaled down by the dropout rate to account for the reduced capacity during training, ensuring consistent output expectations. Why Dropout Helps: Reduces Overﬁtting: By preventing neurons from co-adapting too much to the training data, dropout reduces the risk of overﬁtting, making the model less sensitive to speciﬁc training examples and better at generalizing to unseen data. Creates Redundancy: Since di erent subsets of neurons are used at di erent times, the network essentially learns to provide a more distributed representation of the data, which increases redundancy and resilience in the model. Implementation: Dropout is typically applied after fully connected layers in the network but can also be used after convolutional layers. The dropout rate is a hyperparameter that needs to be tuned; common values are between 0.2 and 0.5. In the context of your training process, adding dropout can be an e ective way to mitigate overﬁtting, especially if you observe the test error increasing after a certain point in training while the training error continues to decrease. Slide 51 Here's an overview of both activation functions and why you might make this change: Sigmoid Activation Function:  Formula: σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1  Output Range: (0, 1)  Characteristics: o The Sigmoid function maps input values into a range between 0 and 1. o It is often used in the output layer for binary classiﬁcation problems. o Vanishing Gradient Problem: In deep networks, gradients of the sigmoid function can become very small during backpropagation, making it di icult to update the weights, especially in earlier layers. o Saturated Neurons: For large positive or negative inputs, the gradient approaches zero, leading to slow learning. pg. 8 ReLU (Rectiﬁed Linear Unit) Activation Function:  Formula: ReLU(x)=max⁡(0,x)\text{ReLU}(x) = \max(0, x)ReLU(x)=max(0,x)  Output Range: [0, ∞)  Characteristics: o ReLU is simple and computationally e icient. o It introduces non-linearity while keeping the output for positive inputs as they are and zeroing out negative inputs. o Avoids Vanishing Gradient Problem: Unlike Sigmoid, ReLU doesn’t saturate in the positive domain, allowing for more e icient learning in deep networks. o Dying ReLU Problem: Sometimes, neurons can "die" during training if they only output zero, but this can often be mitigated with proper initialization and learning rates. Why Transition from Sigmoid to ReLU:  Faster Convergence: ReLU typically leads to faster training convergence due to its ability to propagate gradients more e ectively.  Better Performance: Networks with ReLU activation functions often perform better in practice, particularly in deep networks, because ReLU helps mitigate the vanishing gradient problem.  Non-Saturating Behavior: ReLU allows for non-saturating behavior for positive inputs, which helps keep the gradients ﬂowing. How to Implement the Change:  Replace Sigmoid with ReLU: Simply change the activation function in your hidden layers from Sigmoid to ReLU.  Adjust Learning Rate: After switching to ReLU, you might need to adjust your learning rate, as ReLU can sometimes require a di erent learning rate for optimal performance.  Monitor Performance: Keep an eye on training and validation metrics to ensure that the transition is improving performance as expected. In summary, transitioning from Sigmoid to ReLU is generally done to improve the e iciency and e ectiveness of training deep neural networks, especially when dealing with large-scale data or deep architectures. pg. 9 Example in Code (for a framework like TensorFlow/Keras): python Copy code # Example with Sigmoid model.add(Dense(units=128, activation='sigmoid')) # Change to ReLU model.add(Dense(units=128, activation='relu')) Switching to ReLU is generally advisable for hidden layers in deep networks, as it helps mitigate issues with gradients and can lead to faster and more e ective training. Slide 52 Activation functions play a critical role in neural networks by introducing non-linearity, allowing the network to learn complex patterns. Here are the main types of activation functions: 1. Sigmoid  Formula: σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1  Output Range: (0, 1)  Pros: Useful for binary classiﬁcation tasks, smooth gradient.  Cons: Prone to vanishing gradient problem, saturates for large positive/negative inputs.  Use Case: Often used in the output layer of binary classiﬁcation models. 2. Tanh (Hyperbolic Tangent)  Formula: tanh(x)=ex−e−xex+e−x\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{- x}}tanh(x)=ex+e−xex−e−x  Output Range: (-1, 1)  Pros: Zero-centered output, which helps with convergence.  Cons: Like Sigmoid, it can su er from the vanishing gradient problem.  Use Case: Sometimes used in hidden layers of neural networks. pg. 10 3. ReLU (Rectiﬁed Linear Unit)  Formula: ReLU(x)=max⁡(0,x)\text{ReLU}(x) = \max(0, x)ReLU(x)=max(0,x)  Output Range: [0, ∞)  Pros: E icient computation, mitigates the vanishing gradient problem, accelerates convergence.  Cons: "Dying ReLU" problem, where neurons can become inactive and only output zero.  Use Case: Commonly used in hidden layers of deep networks. 4. Leaky ReLU  Formula: Leaky ReLU(x)=max⁡(0.01x,x)\text{Leaky ReLU}(x) = \max(0.01x, x)Leaky ReLU(x)=max(0.01x,x)  Output Range: (-∞, ∞)  Pros: Addresses the "dying ReLU" problem by allowing a small, non-zero gradient for negative inputs.  Cons: The slope for negative inputs is a hyperparameter that must be chosen.  Use Case: Used as an alternative to ReLU when experiencing dying neurons. 5. ELU (Exponential Linear Unit)  Formula: ELU(x)=x\text{ELU}(x) = xELU(x)=x if x>0x > 0x>0, ELU(x)=α(ex−1)\text{ELU}(x) = \alpha(e^x - 1)ELU(x)=α(ex−1) if x≤0x \leq 0x≤0  Output Range: (-α, ∞)  Pros: Similar beneﬁts to ReLU, with better performance for negative inputs, which helps in faster and more accurate learning.  Cons: More computationally expensive than ReLU.  Use Case: Used when a more robust performance is needed compared to ReLU. 6. Swish  Formula: Swish(x)=x⋅σ(x)\text{Swish}(x) = x \cdot \sigma(x)Swish(x)=x⋅σ(x) where σ(x)\sigma(x)σ(x) is the sigmoid function.  Output Range: (-∞, ∞)  Pros: Self-gated, allows for a smoother and non-monotonic activation which can improve performance on certain tasks. pg. 11  Cons: More computationally expensive than ReLU.  Use Case: Newer models, especially those requiring high accuracy. 7. Softmax  Formula: Softmax(xi)=exi∑j=1Kexj\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K}e^{x_j}}Softmax(xi)=∑j=1Kexjexi  Output Range: (0, 1) for each output, and the outputs sum to 1.  Pros: Converts logits into probabilities, ideal for multi-class classiﬁcation.  Cons: Can be computationally expensive with large output spaces.  Use Case: Used in the output layer of multi-class classiﬁcation models. 8. Linear  Formula: f(x)=xf(x) = xf(x)=x  Output Range: (-∞, ∞)  Pros: Simple, used when the output can take any value.  Cons: No non-linearity, so it doesn’t allow the network to capture complex patterns.  Use Case: Used in regression tasks where the output is a continuous value. Each activation function has its own strengths and weaknesses, and the choice of activation function can signiﬁcantly a ect the performance of a neural network. The most common practice is to use ReLU in hidden layers and speciﬁc functions like Sigmoid or Softmax in the output layer depending on the problem type (binary or multi-class classiﬁcation). Slide 53 In the context of machine learning, particularly classiﬁcation tasks, performance measures and loss functions are critical for evaluating and optimizing models. Here’s an explanation of each concept you mentioned: 1. Performance Measure:  Objective: The goal of a classiﬁer is to maximize performance, typically measured by metrics like accuracy, precision, recall, F1-score, or AUC-ROC. However, these measures do not consider the cost of mistakes directly.  Cost-sensitive Evaluation: In many real-world scenarios, the cost of di erent types of errors (false positives vs. false negatives) can vary signiﬁcantly. pg. 12 Performance measures can be adjusted to account for these costs, such as through cost-sensitive learning or custom performance metrics. 2. Loss Function:  Deﬁnition: A loss function quantiﬁes the error between the predicted value and the actual value. The goal during training is to minimize this loss, thereby improving the model’s predictions.  Role in Optimization: The loss function guides the optimization process, helping to adjust model parameters to reduce errors and improve predictions. 3. Examples of Loss Functions:  Misclassiﬁcation Error: o Deﬁnition: It is a simple loss function that counts the number of incorrect predictions made by the classiﬁer. o Formula: Error=1N∑i=1N1(yi≠y^i)\text{Error} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(y_i \neq \hat{y}_i)Error=N1∑i=1N1(yi=y^i) o Use Case: It is commonly used in classiﬁcation tasks for evaluating the ﬁnal model's accuracy but is not di erentiable, so it’s not used during the training process.  Hinge Loss: o Deﬁnition: Hinge loss is used primarily for "maximum-margin" classiﬁcation, most notably for Support Vector Machines (SVMs). o Formula: Hinge Loss=max⁡(0,1−y⋅f(x))\text{Hinge Loss} = \max(0, 1 - y \cdot f(x))Hinge Loss=max(0,1−y⋅f(x)) o Use Case: It is used when the goal is to maximize the margin between classes. It penalizes predictions that are correct but not conﬁdent enough, i.e., those close to the decision boundary.  Logistic Loss (Cross-Entropy Loss): o Deﬁnition: Logistic loss, also known as cross-entropy loss, is used for binary classiﬁcation. It measures the performance of a classiﬁcation model whose output is a probability value between 0 and 1. o Formula: Logistic Loss=−(y⋅log⁡(y^)+(1−y)⋅log⁡(1−y^))\text{Logistic Loss} = -\left(y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})\right)Logistic Loss=−(y⋅log(y^)+(1−y)⋅log(1−y^)) pg. 13 o Use Case: Commonly used in logistic regression and neural networks. It is di erentiable, making it suitable for training models with gradient-based optimization. 4. Misclassiﬁcation Error:  Deﬁnition: The misclassiﬁcation error is simply the proportion of incorrect predictions out of the total number of predictions.  Calculation: Misclassiﬁcation Error=1−Accuracy\text{Misclassiﬁcation Error} = 1 - \text{Accuracy}Misclassiﬁcation Error=1−Accuracy  Use Case: It’s a straightforward metric but doesn’t provide a nuanced view of the model’s performance, especially in cases of imbalanced datasets. 5. Hinge Loss:  Explanation: Hinge loss is speciﬁcally used in the context of SVMs, where the goal is not just to classify correctly but to classify with a certain margin. Hinge loss increases the penalty for points that are not only misclassiﬁed but also those that are correctly classiﬁed but close to the decision boundary. 6. Logistic Loss (Log-Loss or Cross-Entropy Loss):  Explanation: Logistic loss is more commonly used in models where the output is probabilistic (e.g., logistic regression, neural networks). It penalizes incorrect predictions more harshly as the predicted probability deviates from the actual class label.  Interpretation: Log-loss is particularly useful because it accounts for the conﬁdence of predictions, providing a more granular measure of model performance compared to simple accuracy. Summary:  Performance measures are the metrics used to evaluate the overall e ectiveness of the model.  Loss functions guide the optimization during training by quantifying the cost of mistakes.  Misclassiﬁcation Error, Hinge Loss, and Logistic Loss are di erent types of loss functions used depending on the model and the problem. In practice, the choice of performance measure and loss function depends on the speciﬁc problem and the nature of the data, with considerations of whether errors have di erent costs and the need for di erentiable functions to guide learning. pg. 14 Training Error vs. Testing Error: Training Error:  Deﬁnition: Training error is the error rate of a machine learning model on the same dataset that was used to train the model. It measures how well the model has learned to predict the target variable for the training data.  Calculation: It's typically calculated by applying the model to the training data and comparing the predicted outputs to the actual outputs.  Implication: A small training error indicates that the model has learned the patterns in the training data well. However, it does not guarantee good performance on unseen data (testing data). Testing Error:  Deﬁnition: Testing error is the error rate of a machine learning model on a separate dataset that was not used during training (called the test set). It measures the model’s ability to generalize to new, unseen data.  Calculation: It’s calculated by applying the trained model to the test set and comparing the predicted outputs to the actual outputs.  Implication: A low testing error indicates that the model generalizes well and can make accurate predictions on new data. Key Concepts:  Minimizing Testing Error: The ultimate goal in training a machine learning model is not just to minimize the training error but to minimize the testing error. This is because a model that performs well on the training data but poorly on new, unseen data is of little practical use.  Overﬁtting: o Deﬁnition: Overﬁtting occurs when a model learns the training data too well, including its noise and outliers. This results in a model that has very low training error but fails to generalize to new data, leading to high testing error. pg. 15 o Indication: A clear sign of overﬁtting is when the training error continues to decrease with more training, but the testing error starts to increase or remains high.  Smaller Training Error ≠ Smaller Testing Error: o A smaller training error does not necessarily imply a smaller testing error. In fact, if the training error is signiﬁcantly lower than the testing error, it suggests that the model might be overﬁtting. o The key is to ﬁnd a balance where the model has a low enough training error without sacriﬁcing its ability to perform well on unseen data (indicated by a low testing error). Practical Example: 1. Model Training: You train a model on your training data and monitor both the training error and testing error. 2. Observation: o If both errors are high, the model may be underﬁtting. o If the training error is low but the testing error is high, the model is likely overﬁtting. o If both the training error and testing error are low, the model is well- generalized. 3. Goal: The goal is to minimize the testing error, which indicates that the model will perform well on new data, even though the training error might be slightly higher. In summary, while a low training error indicates that the model has learned the patterns in the training data, it is the testing error that ultimately determines the model’s e ectiveness in real-world applications. The challenge in machine learning is to train a model that minimizes testing error, avoiding both underﬁtting and overﬁtting. Generalization Error (or generation error) is a measure of how well a machine learning model performs on unseen data, or how well it generalizes from the training data to new, unseen data. It is the di erence between the error on the training data and the expected error on new data. Key Points about Generalization Error: 1. Deﬁnition: pg. 16 o Generalization error is the error that occurs when a model is applied to new data that it has never seen before. It reﬂects the model's ability to generalize from the training data to the testing or real-world data. 2. Calculation: o It is typically estimated by evaluating the model on a test set that was not used during training. o Formally, generalization error can be expressed as the di erence between the expected (or average) loss on the test set and the loss on the training set. 3. Importance: o A model with low generalization error is considered to be well- generalized, meaning it performs well not just on the training data but also on new data. o Minimizing generalization error is the ultimate goal in building machine learning models because it indicates that the model can make accurate predictions in real-world scenarios. 4. Overﬁtting and Underﬁtting: o Overﬁtting: When a model has a very low training error but a high generalization error, it indicates overﬁtting. The model has learned the training data too well, including the noise, and does not perform well on new data. o Underﬁtting: When both the training error and the generalization error are high, the model is underﬁtting. It is too simple to capture the underlying patterns in the data. 5. Strategies to Minimize Generalization Error: o Regularization: Techniques like L1 and L2 regularization add a penalty for large weights in the model, helping to prevent overﬁtting. o Cross-Validation: Using techniques like k-fold cross-validation helps to ensure that the model generalizes well by testing it on multiple subsets of the data. o Dropout: In neural networks, dropout randomly deactivates neurons during training, which helps the model to generalize better. o Simplifying the Model: Reducing the complexity of the model (e.g., fewer layers, fewer parameters) can help in reducing overﬁtting. pg. 17 o More Data: Providing the model with more diverse training data can improve its ability to generalize. Example: Suppose you have a model that performs with 95% accuracy on the training set but only 70% accuracy on the test set. The large drop in performance indicates a high generalization error, suggesting that the model may be overﬁtting to the training data. In conclusion, generalization error is a critical concept in machine learning because it reﬂects how well a model is likely to perform in real-world applications. Minimizing generalization error is essential for creating robust models that can reliably make accurate predictions on new, unseen data. Cross-Validation for Generalization Evaluation Cross-validation is a robust technique for evaluating how well a machine learning model generalizes to an independent dataset. The most commonly used form is k-fold cross-validation. Here’s how it works: Steps for k-Fold Cross-Validation: 1. Split the Dataset: o Divide the original dataset into k equal-sized subsets (folds). Typically, k is chosen as 5 or 10, but it can vary depending on the size of the dataset. 2. Training and Testing: o For each of the k iterations:  Retain one of the k folds as the test set.  Use the remaining k-1 folds as the training set to train the model.  Evaluate the model's performance on the test set. 3. Repeat for k Runs: o Repeat this process k times, each time with a di erent fold as the test set and the remaining k-1 folds as the training set. 4. Calculate Average Error Rate: o After all k runs, calculate the average error rate (or any other performance metric) across all k test sets. This average error rate provides an estimate of the model's generalization error. Beneﬁts of k-Fold Cross-Validation: pg. 18  Better Generalization Estimate: It provides a more reliable estimate of the model’s performance on unseen data, as every data point gets to be in the test set exactly once and in the training set k-1 times.  E iciency: It makes e icient use of limited data by ensuring that every observation is used for both training and validation.  Reduction of Bias and Variance: k-Fold cross-validation helps to balance the trade-o between bias and variance, providing a more stable estimate of model performance. Formula: If E is the error metric (e.g., accuracy, error rate) calculated for each fold, the average error rate across all folds is given by: Average Error Rate=1k∑i=1kEi\text{Average Error Rate} = \frac{1}{k} \sum_{i=1}^{k} E_iAverage Error Rate=k1∑i=1kEi Where:  kkk is the number of folds.  EiE_iEi is the error metric for the i-th fold. Example:  Suppose you use 5-fold cross-validation (k=5).  You split your dataset into 5 equal parts (folds).  In the ﬁrst run, you use the 1st fold as the test set and the remaining 4 folds for training. You then compute the error rate on the 1st fold.  In the second run, you use the 2nd fold as the test set and the remaining 4 folds for training, and so on.  After 5 runs, you average the error rates from all 5 folds to get the ﬁnal estimate of your model’s generalization error. Conclusion: k-Fold cross-validation is a powerful technique for evaluating a model’s ability to generalize. By averaging the performance across multiple folds, it provides a more reliable estimate of how the model will perform on unseen data, helping to ensure that the model is neither overﬁtting nor underﬁtting the training data. A learning classiﬁer is a type of machine learning model that is trained to categorize data into predeﬁned classes or categories. The primary goal of a learning classiﬁer is to pg. 19 learn from labeled training data so that it can accurately predict the class of new, unseen data points. Slide 56 Key Components and Steps in Building a Learning Classiﬁer: 1. Dataset: o Training Data: A set of examples where each instance has features (input variables) and a corresponding label (output class). This labeled data is used to train the classiﬁer. o Test Data: A separate set of examples used to evaluate the classiﬁer's performance after training. It helps assess how well the classiﬁer generalizes to new data. 2. Feature Selection: o The process of identifying the most relevant features (variables) in the dataset that will be used as inputs for the classiﬁer. Good feature selection can improve the accuracy and e iciency of the model. 3. Model Selection: o Algorithm Choice: Selecting a suitable learning algorithm based on the nature of the problem and data. Common algorithms include:  Linear Models: Logistic regression, Linear Discriminant Analysis (LDA)  Tree-Based Models: Decision trees, Random Forest, Gradient Boosting  Instance-Based: k-Nearest Neighbors (k-NN)  Kernel-Based: Support Vector Machines (SVM)  Neural Networks: Multi-layer perceptrons, Convolutional Neural Networks (CNNs) for image data o The choice of algorithm depends on factors like data size, number of features, linearity, and computational resources. 4. Training the Classiﬁer: o The classiﬁer is trained by feeding it the training data, where it learns to map the input features to the corresponding output labels. pg. 20 o During training, the model adjusts its parameters (weights) to minimize the error or loss function, which measures the di erence between predicted and actual labels. o Optimization Techniques: Gradient descent, backpropagation (in neural networks), or speciﬁc tree-growing algorithms (in decision trees). 5. Validation: o Cross-Validation: Techniques like k-fold cross-validation are used to evaluate the model’s performance during training and to tune hyperparameters (settings that control the learning process). o Hyperparameter Tuning: Adjusting parameters such as learning rate, depth of trees, or number of neighbors in k-NN to optimize the model’s performance. 6. Testing and Evaluation: o After training, the classiﬁer’s performance is evaluated on the test data. o Common evaluation metrics include:  Accuracy: The percentage of correctly predicted labels.  Precision, Recall, and F1-Score: Metrics that provide more detailed insights, especially for imbalanced datasets.  ROC-AUC Curve: For binary classiﬁers, the Area Under the Receiver Operating Characteristic Curve provides a measure of the model’s ability to discriminate between classes. 7. Prediction: o Once trained and evaluated, the classiﬁer can be used to predict the class of new, unseen data points based on the input features provided. 8. Model Deployment: o If the classiﬁer performs well, it can be deployed in a real-world application to automatically classify incoming data, such as email ﬁltering, image recognition, medical diagnosis, or customer segmentation. Example: A spam email classiﬁer might use features such as the frequency of certain keywords, the presence of links, and the sender's email address to predict whether an email is pg. 21 spam or not. After being trained on a labeled dataset of emails (spam and non-spam), the classiﬁer can then be used to automatically ﬁlter out spam emails in a user's inbox. Summary: A learning classiﬁer is a fundamental component of many machine learning applications. It learns from labeled data to distinguish between di erent classes and is evaluated based on its ability to generalize to new, unseen data. The entire process involves selecting the right model, training it e ectively, validating its performance, and ﬁnally deploying it for real-world use. pg. 22

TBL notes for Week 2.pdf

Document Details

Tags

Related

Full Transcript