Foundamentals_of_Data_Science.pdf

Foundamentals of Data Science Andrea Di Vincenzo October 2024 2 Contents 1 Computer Vision 5 1.1 Image Classification....................

Foundamentals of Data Science Andrea Di Vincenzo October 2024 2 Contents 1 Computer Vision 5 1.1 Image Classification............................................. 5 1.1.1 Image Classification Steps...................................... 5 1.1.2 Advantages and Disadvantages of Image Classification...................... 5 Disadvantages of Image Classification............................... 5 1.2 Object Detection (Identification)...................................... 6 1.2.1 Object Detection Works....................................... 6 1.3 Color Recognition.............................................. 6 1.3.1 Advantages of Color Recognition.................................. 6 1.3.2 Color Histograms........................................... 6 1.3.3 Joint 3D Color Histograms..................................... 7 1.3.4 Color Normalization by Intensity.................................. 8 1.3.5 Recognition Using Histograms................................... 8 1.4 Histogram Comparison Technique...................................... 9 1.4.1 Histogram Comparison: Intersection Method........................... 9 Advantages and Disadvantages................................... 10 1.4.2 Histogram Comparison: Euclidean Distance............................ 11 1.4.3 Histogram Comparison: Chi-square Distance........................... 12 1.4.4 Which Measure is Best?....................................... 14 1.4.5 Recognition using Histograms - Nearest-Neighbor Strategy................... 15 1.5 Performance Evaluation........................................... 15 1.5.1 Score-Based Evaluation....................................... 16 1.5.2 Score-Based Evaluetion Example.................................. 17 Confusion Matrix........................................... 17 1.5.3 Overall Accuracy in Classification................................. 18 Limitations of Accuracy....................................... 19 Alternatives to Accuracy...................................... 19 1.5.4 Overall Precision in Classification.................................. 19 Interpretation............................................. 20 Limitations of Precision....................................... 20 1.5.5 Recall in Classification........................................ 20 Importance of Recall......................................... 20 Trade-off Between Recall and Precision.............................. 21 2 Linear Regression 23 2.1 Simple Linear Regression.......................................... 23 2.1.1 Interpretation of Parameters.................................... 23 Assumptions of Simple Linear Regression:............................. 23 2.2 Multiple Linear Regression:......................................... 24 2.2.1 Interpretation of Parameters in Multiple Linear Regression:................... 24 2.2.2 Assumptions of Multiple Linear Regression:............................ 24 3 4 CONTENTS Chapter 1 Computer Vision 1.1 Image Classification Image classification is the most basic form of computer vision. It involves assigning a label or a class to an entire image based on its content. The primary goal of image classification is to categorize an image into one of several predefined classes or categories. This is typically achieved through the use of machine learning algorithms. 1.1.1 Image Classification Steps 1. Preparing Your Data: In order to provide better image data for computer vision models to work with, it removes unwanted deformities and enhances some important parts of the image. 2. Object Detection: Objects are localized, which involves object segmentation and object position determination. 3. Identification of Patterns: Deep learning algorithms then identify patterns in the image that can be specific to a certain label. With the help of this dataset, the model gains future accuracy improvements. 4. Division of Observed Things into Predefined Classes: Machine learning algorithms divide observed things into predefined classes using the classification strategy, contrasting desired patterns with picture patterns. 1.1.2 Advantages and Disadvantages of Image Classification Advantages of Image Classification Higher quality products - With accurate image classification, your AI product will be able to perform a variety of tasks connected with image object recognition. Real-world training - Since your product will need to recognize items in the physical world, it would make sense to train it on real-life images instead of computer-generated ones. Many practical applications - Image classification products can be used in areas like object identification in satellite images, traffic control systems, brake light detection, machine vision, and more. Disadvantages of Image Classification Working with data uncertainties - Regardless of how thoroughly you train an image recognition model, there are cases where the model fails to classify the objects correctly. We attribute most discrepancies to these factors. 5 6 CHAPTER 1. COMPUTER VISION Occlusion - In some images, the target object may not be entirely visible. Consider a dog that is hiding in a bush, its leg and body hidden. In this case, even if the dog’s head is plainly visible in the viewpoint, the imaging algorithm might not be able to identify it. Background noise - When working with object detection machine learning, the ability of the model to correctly identify the object in the image may be impacted by interfering with ambient texture and color. An imaging model, for instance, would have trouble distinguishing a red apple from a similar-colored table. Similarly, it is difficult to concentrate on a single vehicle in slow-moving traffic. 1.2 Object Detection (Identification) Object detection is a computer vision solution that identifies objects and their location in an image. The coordinates of the object in an image that an object detection system has been trained to identify will be returned. 1.2.1 Object Detection Works OD algorithms usually make use of machine learning and deep learning. There are two main approaches to creating object recognition models: Creating and training an object detector - You have to build a network architecture that can learn the characteristics of the items of interest in order to train a custom object detector from scratch. To train the CNN, a sizable collection of labeled data must also be assembled. A personalized object detector can produce amazing outcomes. Having said this, configuring the layers and weights in the CNN manually takes a lot of time and training data. Use a pre-trained object detector - Transfer learning is a technique that allows you to start with a pre-trained network and then fine-tune it for your application, and it is used in many deep-learning object detection procedures because the object detectors in this method have previously been trained on dozens, if not millions, of photos; it can yield speedier results. Convolutional neural networks (CNN): This deep learning architecture is made especially for the analysis and feature extraction from visual data. 1.3 Color Recognition A Useful fiture for Object Recognition can Be the color of the object. 1.3.1 Advantages of Color Recognition The color is consistent under geometric transformations THIS means that when an object undergoes transla- tion rotation or scaling its color remains unchanged. Color as a local feature color is defined at each pixel making it a highly localized feature. This means it robust to partial occlusion, meaning that even if part of the object is hidden the visible part’s color still aid in recognition. The direct color usage approach uses the exact color of objects for identification or recognition. Instead of relying on a dominant color of the object for identification on recognition, we can use statistics of object colors and computing them with histograms that capture the distribution of colors within the object. This adds robustness to noise and other variations in appearance. 1.3.2 Color Histograms Color histograms are a representation of the distribution of colors in an image. For each pixel, the values for Red, Green, and Blue (RGB) are given. Histograms are computed for these color channels. For each color channel (Red, Green, Blue), a histogram counts how many pixels in the image have a particular intensity of that color. 1.3. COLOR RECOGNITION 7 Figure 1.1: Histogram of color distribution Luminance histograms represent the brightness of the image. This histogram measures how many pixels have specific levels of brightness, independent of color. These histograms are used as features to describe the color distribution of an object in an image, which can be compared against histograms of other images to identify similarities or match objects. 1.3.3 Joint 3D Color Histograms stead of separate 1D histograms for Red, Green, and Blue, a 3D histogram considers the RGB values together as a vector. This allows for a more precise representation of color combinations present in the image. Each entry in this 3D space represents a combination of Red, Green, and Blue, and the count in each bin represents how many pixels have that specific color combination. This representation makes it easier to compute the similarity of two images. Comparing two 3D histograms can show whether two objects have a similar color composition, even if they are rotated or partially occluded. Like the 1D case, this is a robust representation because it works even if the objects in the image are rotated, partially occluded, or viewed under different lighting conditions. Figure 1.2: Histogram of 3D color distribution 8 CHAPTER 1. COMPUTER VISION 1.3.4 Color Normalization by Intensity When dealing with color images, each pixel’s color is typically represented by its Red, Green, and Blue (RGB) components. However, the intensity of a color can vary due to changes in lighting or shading. Even if the colors are the same, varying intensity can make them appear different. We can handle this with nor- malization. Intensity of a Pixel The total brightness of each pixel is defined as: I =R+G+B Chromatic representation involves normalizing the color of each pixel by dividing each color component (R, G, B) by the intensity I. This transformation removes the effect of varying brightness or illumination, making the color representation consistent across different lighting conditions. Figure 1.3: Normalized Colors Using Normalization If I know, for example, two colors R, G, I can calculate B using the formula: B =1−R−G We can fully describe the color using just two values. The cube represents the range of possible values for RGB, with axes corresponding to each color channel. The constraint R + G + B = 1 implies that the colors lie on a plane within this cube, forming a 2D space for normalized color. In image recognition, it’s important to have a color representation that is invariant to lighting changes. The chromatic representation allows systems to focus on the actual color of the object rather than being influenced by lighting conditions. Figure 1.4: RGB Cube 1.3.5 Recognition Using Histograms This is a method for identifying objects based on their color distributions. 1. Histograms Comparison Step: A histogram representing the color distribution of a ”test image” is com- pared to histograms from a database of ”known objects”. The object whose histogram closely resembles that of the test is identified as the best match. 1.4. HISTOGRAM COMPARISON TECHNIQUE 9 2. Multiple Views per Object: Since an object can appear in different orientations and lighting conditions, the database stores multiple views of each object. Each view has its histogram.This increases accuracy of object recognition, as the system can compare the test image’s histogram with histograms from different angles or views of the same object. 3. Histogram-Based Retrieval In the example below, a ”query” object is given (e.g., a yellow cat figurine), and its color histogram is used to retrieve similar objects from the database. The system retrieves objects whose color histograms closely match that of the query, displaying objects such as yellow cars or yellow objects (e.g., nuclear waste barrel) in the database.This process highlights the use of histograms for identifying objects based on color similarities, even when the objects belong to different categories but share similar color profiles. Figure 1.5: Recognition using histograms 1.4 Histogram Comparison Technique The Histogram Comparison technique is a method used to measure the similarity or dissimilarity between two histograms. A histogram represents the distribution of certain features (such as pixel intensity, color, or texture) within an image or dataset. In image processing and computer vision, histograms are often used to represent the distribution of colors, brightness, or other properties of an image. 1.4.1 Histogram Comparison: Intersection Method Histogram comparison is a fundamental method in image analysis and computer vision for determining how similar two histograms are. One commonly used metric is **Histogram Intersection**, which measures the common parts between two histograms. This method is particularly useful in tasks such as object recognition, color analysis, and retrieval of images from databases based on their content. Histogram Intersection Formula The histogram intersection method calculates the similarity between two histograms Q and V by taking the sum of the minimum values for each bin in the histograms. The formula is given by: X ∩(Q, V ) = min(qi , vi ) i Where: Q = [q1 , q2 ,... , qn ] represents the histogram of the first image. V = [v1 , v2 ,... , vn ] represents the histogram of the second image. 10 CHAPTER 1. COMPUTER VISION qi and vi are the values of the i-th bin in histograms Q and V , respectively. The min(qi , vi ) function returns the minimum value between the corresponding bins. The sum is over all n bins. Motivation Histogram intersection has the following motivations and properties: Measures the Common Parts: This method directly measures the overlap between the histograms by focusing on their common parts. The more similar two histograms are, the greater their intersection. Range: The result of the histogram intersection lies in the range of [0, 1]. – A value of 1 means the histograms are perfectly similar. – A value of 0 indicates no similarity between the histograms. Unnormalized Histograms: For unnormalized histograms, the intersection is scaled by the sum of the histogram values. The following formula is used to normalize the intersection for unnormalized histograms: P P 1 i min(q i , vi ) i min(qi , vi ) ∩(Q, V ) = P + P 2 i qi i vi This normalization ensures that the intersection comparison is not biased by differences in the overall number of elements in the histograms. Advantages and Disadvantages Advantages: – Easy to implement and computationally efficient. – Measures commonality between histograms, which is useful in applications where overlapping features are important. Disadvantages: – Less sensitive to subtle differences between histograms. – It only focuses on common parts and ignores differences, which might lead to poor results in discriminative tasks. Applications The histogram intersection method is widely used in tasks such as: Object Recognition: Comparing the color histograms of objects for recognition purposes. Image Retrieval: Finding images from a database based on similar content. Color-Based Matching: Comparing the color distributions in different images, useful in pattern recognition. The histogram intersection method is a simple yet effective technique for comparing histograms. Its ability to focus on the commonality between two histograms makes it suitable for tasks such as object recognition and image retrieval, where the overlap between histograms represents similarity. However, its inability to highlight differences between histograms limits its use in more discriminative tasks. 1.4. HISTOGRAM COMPARISON TECHNIQUE 11 Figure 1.6: histograms Intersection 1.4.2 Histogram Comparison: Euclidean Distance The Euclidean distance is a common metric used to measure the difference between two histograms. It calculates the straight-line distance between two points in a multidimensional space, where each bin of the histogram represents a dimension. In the context of histograms, the Euclidean distance is computed by measuring the sum of squared differences between corresponding bins of two histograms. Euclidean Distance Formula The Euclidean distance between two histograms Q and V is given by: sX d(Q, V ) = (qi − vi )2 i Where: Q and V are the two histograms being compared. qi and vi represent the values in the i-th bin of histograms Q and V , respectively. For the purposes of this slide, the square root is omitted, and the formula can be written as: X d(Q, V ) = (qi − vi )2 i Motivation The Euclidean distance focuses on the absolute differences between corresponding bins in two histograms. This metric has the following characteristics: Focus on Differences: It highlights how different the two histograms are by directly measuring the squared difference between corresponding bins. Range: The Euclidean distance has a range of [0, ∞): – A distance of 0 means that the two histograms are identical. – As the difference between the histograms increases, the Euclidean distance grows larger. Equal Weighting: Each bin in the histogram contributes equally to the total distance, meaning that this metric does not prioritize certain parts of the histogram over others. This can be a disadvantage in cases where certain regions of the histogram are more important than others. 12 CHAPTER 1. COMPUTER VISION Not Very Discriminant: While the Euclidean distance is simple to compute, it is not very discriminant for histogram comparison. Small differences in some bins may get overshadowed by large differences in others, making it less effective in distinguishing between similar but slightly different histograms. Advantages and Disadvantages Advantages: – Simple and computationally efficient to calculate. – Effective when the histograms are well-aligned and the differences are significant. Disadvantages: – Sensitive to noise and small variations in the histogram, which may result in large differences even for slightly different histograms. – Treats all differences equally, which may not always be appropriate when some features of the histogram are more important than others. – Not invariant to transformations such as scaling or translation of the histogram values. Applications The Euclidean distance is commonly used in applications where a straightforward, direct comparison of differences between two distributions is sufficient. These include: Image Retrieval: Comparing feature histograms of images to find similar images in a database. Object Detection: Detecting objects based on their color or texture histograms in different frames or images. Figure 1.7: histograms Euclidian Distance The Euclidean distance provides a simple and effective method for comparing histograms, particularly when we are interested in the overall difference between two distributions. However, it has some limitations, especially in terms of discriminative power, making it less ideal for situations where small differences matter or when robustness to noise is required. 1.4.3 Histogram Comparison: Chi-square Distance The Chi-square (χ2 ) distance is a commonly used metric to measure the similarity between two histograms. This metric has its roots in statistics, where it is used to compare observed distributions to expected distributions. In histogram comparison, it is used to calculate how different two histograms are by considering the relative differences between their bins, weighted by the sum of their bin values. 1.4. HISTOGRAM COMPARISON TECHNIQUE 13 Chi-square Distance Formula The formula for the Chi-square distance between two histograms Q and V is given by: X (qi − vi )2 χ2 (Q, V ) = i q i + vi Where: Q and V are the two histograms being compared. qi and vi represent the values in the i-th bin of histograms Q and V , respectively. The denominator (qi + vi ) normalizes the difference between qi and vi , making the distance more robust to larger bin values. Motivation The Chi-square distance has several important properties that make it useful for histogram comparison: Statistical Background: The Chi-square distance is based on the Chi-square test from statistics, which is used to determine whether two distributions differ. This background provides a rigorous way to compare histograms, with the possibility of computing a significance score. – It tests whether the distributions of two histograms are significantly different from each other. – It accounts for the fact that some bins may have higher values than others, normalizing by the sum of the bin values. Range: The Chi-square distance is non-negative, and its range is [0, ∞): – A value of χ2 (Q, V ) = 0 indicates that the two histograms are identical. – As the difference between the histograms increases, the Chi-square distance increases without an upper bound. Non-equal weighting of cells: Unlike Euclidean distance, the Chi-square distance does not weight all bins equally. Bins with higher values contribute less to the overall distance than bins with smaller values, making the Chi-square distance more sensitive to differences in smaller bins. This property makes it more discriminative in some applications. – The cells with higher values are treated as less important compared to cells with lower values. Sensitivity to Outliers: The Chi-square distance can be sensitive to outliers, especially if the bin counts in some regions are very low. To address this, it is often assumed that each bin contains at least a minimum number of samples to avoid large contributions from small differences. Advantages and Disadvantages Advantages: – It is more discriminative than simpler metrics such as Euclidean distance because it emphasizes relative differences between bins. – It can be used in applications where the statistical significance of the difference between histograms is important. Disadvantages: – It is sensitive to bins with small values, potentially leading to overemphasis on small differences in sparsely populated bins. – It may have problems with outliers or sparse histograms, especially if some bins contain very low values. 14 CHAPTER 1. COMPUTER VISION Applications The Chi-square distance is commonly used in applications where it is important to compare the relative proportions of values in different categories. Some typical applications include: Image Retrieval: Used to compare color histograms of images in large image databases. Texture Analysis: Used to compare texture histograms in image processing. Statistical Analysis: Comparing histograms in situations where the statistical distribution of values is of interest. Figure 1.8: histograms Chi-Square The Chi-square distance is a useful and statistically motivated method for comparing histograms, particularly when the relative difference between bin values is important. It provides a more discriminative measure than simpler metrics, but care must be taken to manage sensitivity to outliers and sparsely populated bins. 1.4.4 Which Measure is Best? The choice of the best histogram comparison measure depends on the application. Below are some of the most commonly used measures: Intersection Robustness: Intersection is generally more robust because it only considers the overlapping parts of the histograms. Best use: It works well when the goal is to compare similar color distributions but may be less effective when there are large differences between histograms. Chi-square (X²) Discriminative Power: Chi-square is more discriminative than intersection, giving more weight to differ- ences relative to the bin values. Best use: It is ideal for distinguishing between histograms representing complex data or when the histograms differ slightly. Euclidean Distance Not Robust: Euclidean distance is less robust because it gives equal weight to all parts of the histogram, making it sensitive to outliers. Best use: Works well when histograms are smooth and similar without many outliers. 1.5. PERFORMANCE EVALUATION 15 Other Measures Many other measures exist, depending on the context: Kolmogorov-Smirnov test: A statistical test used to compare two distributions and determine if they differ significantly. Kullback-Leibler divergence: An information-theoretic measure used to quantify the difference between two probability distributions. Jeffrey Divergence: A symmetrized version of Kullback-Leibler divergence that handles situations where distributions differ in both directions. 1.4.5 Recognition using Histograms - Nearest-Neighbor Strategy The recognition process using histograms is based on a simple nearest-neighbor strategy. The algorithm proceeds as follows: 1. Build a set of histograms H = {M1 , M2 , M3 ,...} for each known object. More exactly, build histograms for each view of each object to account for changes in perspective. 2. Build a histogram T for the test image. 3. Compare T to each Mk ∈ H using a suitable comparison measure. 4. Select the object with the best matching score, or reject the test image if no object is similar enough (i.e., distance above a threshold t). This approach is commonly referred to as the “Nearest-Neighbor” strategy, where the histogram of the test image is matched to the closest histogram in the database. 1.5 Performance Evaluation When comparing methods for the same task, there are two main approaches to evaluate performance: 1. Compare a Single Number: Metrics such as accuracy (recognition rate) or top-k accuracy are commonly used. Accuracy is defined as the percentage of correct predictions over the total number of test cases. Example: The bar chart in Figure ?? compares the recognition accuracy across different experimental setups. The method “PIPER” achieves the highest accuracy across all setups compared to “naeil” and “bbq baseline.” 2. Compare Curves: Performance curves such as Precision-Recall curves and ROC curves provide a more detailed evaluation of a model’s performance. The Precision-Recall curve is useful in cases of class imbalance, showing the trade-off between precision and recall. The ROC curve plots the true positive rate (recall) against the false positive rate, with the Area Under the Curve (AUC) summarizing performance. Example: The Precision-Recall curve illustrates the performance of the DetectorNet method in two stages for a bird classification task. The ROC curve in shows ROC curves for multiple face verification models, where higher curves represent better performance. 16 CHAPTER 1. COMPUTER VISION Figure 1.10: Precision Recall Figure 1.9: Accuracy Figure 1.11: ROC 1.5.1 Score-Based Evaluation In object recognition tasks, the recognition algorithm evaluates the similarity between the query object and the training image using a similarity score. The recognition decision is made based on a threshold t, which determines if the query object is classified as matching the training image. Similarity Score: The similarity score s is a normalized value between 0 and 1 that quantifies how closely the query object resembles the training image. – s = 1: indicates a perfect match between the query and training images. – s = 0: indicates no match between the query and training images. Threshold t: The threshold t is the cutoff value used to classify the query object as matching the training image. – If s ≥ t, the query object is classified as a match (positive example). – If s < t, the query object is classified as a non-match (negative example). Positive and Negative Examples: – Positive Example: Represented by a filled circle ( ), where the similarity score s is greater than or equal to the threshold t. – Negative Example: Represented by an empty circle (◦), where the similarity score s is less than the threshold t. 1.5. PERFORMANCE EVALUATION 17 1.5.2 Score-Based Evaluetion Example In the diagram, positive examples (green circles) are cases where the similarity score is above the threshold t, and negative examples (empty circles) are cases where the score is below the threshold. Figure 1.12: Score-based evaluation showing positive and negative examples. Threshold, Classifier, and Point Metrics The recognition algorithm identifies (classifies) the query object as matching the training image if their similarity is above a threshold t. This decision process is central to building a classifier. The decision boundary is set by the threshold t, and based on this, objects are classified as either positive or negative. The performance of this classifier can be evaluated using metrics like the confusion matrix, precision, recall, etc. Figure 1.13: Score-based evaluation showing positive and negative examples. Confusion Matrix A confusion matrix is used to evaluate the classification performance of a model. It is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. The matrix contains the following values: True Positives (TP): Number of positive samples correctly predicted as positive. 18 CHAPTER 1. COMPUTER VISION False Positives (FP): Number of negative samples incorrectly predicted as positive. False Negatives (FN): Number of positive samples incorrectly predicted as negative. True Negatives (TN): Number of negative samples correctly predicted as negative. Figure 1.14: Score-based evaluation showing positive and negative examples. The goal is to have high values in the diagonals (TP, TN) and low values off the diagonals (FP, FN). TP FP FN TN 1.5.3 Overall Accuracy in Classification In the context of machine learning classification tasks, Accuracy is one of the most commonly used evaluation metrics. It represents the proportion of correctly predicted instances (both positive and negative) out of the total number of predictions made by the model. More formally, Overall Accuracy is the ratio of the sum of True Positives (TP) and True Negatives (TN) to the total number of instances, which includes False Positives (FP) and False Negatives (FN) as well. It is calculated as: TP + TN Accuracy = TP + TN + FP + FN Where: TP (True Positives): Correctly predicted positive instances. TN (True Negatives): Correctly predicted negative instances. FP (False Positives): Incorrectly predicted positive instances. FN (False Negatives): Incorrectly predicted negative instances. The accuracy measures the diagonal elements (TP and TN), which are the correct classifications, as a fraction of the total number of instances. Accuracy gives us a simple and intuitive way of understanding the overall performance of a model. It measures how often the classifier is correct across all categories. In other words, if you predict on 100 test cases and 90 of them are correct, then your accuracy is 90%. 1.5. PERFORMANCE EVALUATION 19 Limitations of Accuracy While accuracy might seem like a comprehensive metric at first glance, it can be misleading in certain scenarios, especially when dealing with imbalanced datasets(i.e., where one class significantly outnumbers the other). In im- balanced datasets, a model that predicts the majority class all the time can have high accuracy even though it fails to correctly predict the minority class. If a dataset has a 95% negative class (non-spam) and only 5% positive class (spam), a classifier that predicts all instances as negative will still achieve high accuracy, even though it never correctly identifies any positive instances. For example, if the model simply predicts ”non-spam” for every email, the accuracy would still be 95% even though it’s completely ineffective at catching spam. TP + TN 0 + 950 Accuracy = = = 0.95 = 95% TP + TN + FP + FN 0 + 950 + 50 + 0 Accuracy is a good metric when: Classes are balanced in the dataset. The cost of different types of errors (false positives vs. false negatives) is similar. In cases of class imbalance or when different error types have different costs, it is better to consider additional metrics (such as precision, recall, F1-score, or ROC-AUC) to make a more informed evaluation of the model’s performance. Alternatives to Accuracy Given the limitations of accuracy in certain scenarios, it is often helpful to use other evaluation metrics that provide a more detailed understanding of the model’s performance, such as: Precision: Focuses on the accuracy of positive predictions (how many of the predicted positives were actually positive). Recall (Sensitivity/True Positive Rate): Measures how many actual positives were correctly identified by the model. F1 Score: The harmonic mean of precision and recall, balancing the trade-off between the two. AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures the model’s ability to discriminate between classes across various thresholds, providing a more comprehensive evaluation than accuracy alone. 1.5.4 Overall Precision in Classification Precision is a metric that measures the accuracy of the positive predictions made by the model. It is the proportion of true positive predictions out of all predicted positive instances. The formula for precision is: TP Precision = TP + FP Where: TP (True Positives): The number of correctly predicted positive instances. FP (False Positives): The number of incorrectly predicted positive instances (false alarms). 20 CHAPTER 1. COMPUTER VISION Interpretation High Precision: A high precision value means the model is confident in its positive predictions, with few false positives. Low Precision: A low precision value suggests that the model is making a large number of false positive predictions. Precision focuses on the quality of positive predictions, while recall focuses on how many actual positives are correctly identified. The trade-off between precision and recall can be managed using the F1 score, which is the harmonic mean of precision and recall: 2 · Precision · Recall F1 = Precision + Recall Precision answers: ”Out of all instances predicted as positive, how many were correct?” Recall answers: ”Out of all actual positive instances, how many were correctly predicted?” Precision does not account for false negatives, so it may not be suitable when identifying all positive cases is important. In imbalanced datasets, precision may be misleading if the model makes very few positive predictions. There is often a trade-off between precision and recall. Increasing precision typically reduces recall and vice versa. This trade-off is managed using the F1 Score, which is the harmonic mean of precision and recall, offering a balance between the two metrics Limitations of Precision Precision Does Not Consider False Negatives: Precision only evaluates how accurate the positive predictions are, but it does not consider the instances where the model failed to identify positive cases (false negatives). This can be problematic in situations where identifying all positive instances is important, as in medical tests. Imbalanced Datasets: In highly imbalanced datasets (where one class significantly outnumbers the other), precision can be misleading. For example, if there are very few actual positive instances, the model could achieve high precision by making very few positive predictions but still miss many of the actual positives. 1.5.5 Recall in Classification Recall is a metric that measures the ability of a classifier to correctly identify all positive instances in a dataset. It is also known as True Positive Rate (TPR) or Sensitivity. In simpler terms, recall answers the question: ”Out of all the actual positive cases, how many were correctly predicted as positive?” It is defined as: TP Recall = TP + FN Where: TP (True Positives): The number of correctly predicted positive instances. FN (False Negatives): The number of actual positive instances that were incorrectly predicted as negative. Importance of Recall High Recall: A high recall value indicates that the model is good at identifying positive instances. In applications where missing a positive case is critical, such as detecting cancer or fraud, a high recall is important Low Recall: A low recall value means that the model is missing many positive cases. In such cases, the model is likely to produce many false negatives. 1.5. PERFORMANCE EVALUATION 21 Trade-off Between Recall and Precision In classification tasks, there is often a trade-off between Recall and Precision. As recall increases, it typically comes at the expense of precision, and vice versa. This happens because: Recall increases when the model predicts more instances as positive, which can lead to an increase in the number of False Positives (FP). Hence, the model might capture more actual positives, but at the cost of wrongly classifying negatives as positives. Precision improves when the model becomes more conservative in predicting positives, which reduces the number of False Positives, but this can cause a decrease in Recall by missing some actual positives (i.e., increasing the False Negatives (FN)). To balance the trade-off between Recall and Precision, we can use the F1 Score, which is the harmonic mean of Precision and Recall: Precision · Recall F1 = 2 · Precision + Recall The F1 score provides a single metric that takes both precision and recall into account. A higher F1 score indicates a better balance between precision and recall. 100% Recall: Achieving 100% recall means that the model correctly identifies all actual positive cases (i.e., no false negatives), but it may result in a large number of false positives, lowering precision. High Recall vs. High Precision: A model with high recall might have many false positives, while a model with high precision might miss some actual positive cases. Striving for a balance between recall and precision is crucial, depending on the specific application. F1 Score: The F1 score is most useful when both precision and recall are important, as it balances the two. A high F1 score indicates good performance in terms of both precision and recall. 22 CHAPTER 1. COMPUTER VISION Chapter 2 Linear Regression Linear regression is a statistical method used to model the relationship between a dependent variable (also called the target, response, or output) and one or more independent variables (also called predictors, explanatory variables, or features) by fitting a linear equation to observed data. 2.1 Simple Linear Regression In the case of simple linear regression, there is one dependent variable Y and one independent variable X. The goal is to model the relationship between X and Y using a linear function of X. The equation of the model is given by: Y = β0 + β1 X + ϵ Where: Y : The dependent variable (response variable). X: The independent variable (explanatory variable). β0 : The intercept of the regression line, which represents the value of Y when X = 0. β1 : The slope of the regression line, which represents the change in Y for a one-unit change in X. ϵ: The error term (also called the residual), representing the difference between the observed value of Y and the value predicted by the model. 2.1.1 Interpretation of Parameters Intercept β0 : This is the predicted value of Y when X = 0. In many cases, this value is not meaningful (e.g., predicting salary when experience is zero), but it provides a reference point. Slope β1 : This describes the relationship between X and Y. Specifically, it quantifies the expected change in Y for a unit increase in X. For example, if β1 = 2, it means that for every unit increase in X, Y is expected to increase by 2 units. Assumptions of Simple Linear Regression: 1. Linearity: The relationship between the dependent variable Y and the independent variable X is linear. 2. Independence: The residuals (errors) ϵ are independent. 3. Homoscedasticity: The residuals have constant variance (i.e., the variance of the errors is the same across all values of X). 4. Normality: The residuals are normally distributed. 23 24 CHAPTER 2. LINEAR REGRESSION 2.2 Multiple Linear Regression: In the case of multiple linear regression, there are multiple independent variables X1 , X2 ,... , Xp. The goal is to model the relationship between the dependent variable Y and the set of independent variables X1 , X2 ,... , Xp. The equation of the model is given by: Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ϵ Where: Y : The dependent variable (response variable). X1 , X2 ,... , Xp : The independent variables (predictors). β0 : The intercept of the regression hyperplane. β1 , β2 ,... , βp : The coefficients (slopes) associated with each independent variable. ϵ: The error term (residual). In matrix notation, this can be written as: Y = Xβ + ϵ Where: Y is an n × 1 vector of the dependent variable values. X is an n × (p + 1) matrix of the independent variables (with a column of 1s for the intercept). β is a (p + 1) × 1 vector of the model coefficients. ϵ is an n × 1 vector of the residuals. 2.2.1 Interpretation of Parameters in Multiple Linear Regression: Intercept β0 : The predicted value of Y when all the independent variables X1 , X2 ,... , Xp are equal to 0. Slope βi : The expected change in Y for a one-unit increase in Xi , holding all other independent variables constant. 2.2.2 Assumptions of Multiple Linear Regression: 1. Linearity: The relationship between the dependent variable Y and each independent variable Xj is linear. 2. Independence: The residuals are independent. 3. Homoscedasticity: The residuals have constant variance. 4. Normality: The residuals are normally distributed. 5. No Multicollinearity: The independent variables X1 , X2 ,... , Xp are not too highly correlated with each other.

Foundamentals_of_Data_Science.pdf

Document Details

Related

Full Transcript

Upgrade to continue