Fundamentals of Data Science PDF
Document Details
![RapturousSine](https://quizgecko.com/images/avatars/avatar-4.webp)
Uploaded by RapturousSine
Università "La Sapienza" di Roma
2024
Andrea Di Vincenzo
Tags
Summary
This document provides notes on fundamentals of data science, focusing on topics like computer vision, edge detection, performance evaluation, machine learning techniques (including linear regression and gradient descent), classification models, and convolutional neural networks. The document is likely intended for a course or self-study on data science.
Full Transcript
Foundamentals of Data Science Andrea Di Vincenzo October 2024 2 Contents 1 Computer Vision Basics 7 1.1 Object Rec...
Foundamentals of Data Science Andrea Di Vincenzo October 2024 2 Contents 1 Computer Vision Basics 7 1.1 Object Recognition.............................................. 7 1.1.1 Object Identification......................................... 7 1.1.2 Object Classification......................................... 8 1.2 Image Linear Filtering............................................ 9 1.2.1 Image as a Mathematical Function................................. 10 1.2.2 Convolution.............................................. 11 1.2.3 Filtering to Reduce Noise...................................... 17 1.3 Multi Scale Image Representation(DA Finire)............................... 22 2 Edge Detection 23 2.1 1D Edge Detection Steps.......................................... 24 2.1.1 Preprocessing the Image with Gaussian Smoothing........................ 25 2.1.2 First Derivative (Gradient)..................................... 25 2.1.3 Second Derivative (Laplacian).................................... 29 2.1.4 Simplified Edge Detection...................................... 29 2.1.5 Hysteresis............................................... 30 2.2 Color Recognition.............................................. 30 2.2.1 Advantages of Color Recognition.................................. 30 2.2.2 Color Histograms........................................... 31 2.2.3 Joint 3D Color Histograms..................................... 31 2.2.4 Color Normalization by Intensity.................................. 32 2.2.5 Recognition Using Histograms................................... 33 2.3 Histogram Comparison Technique...................................... 34 2.3.1 Histogram Comparison: Intersection Method........................... 34 2.3.2 Histogram Comparison: Euclidean Distance............................ 38 2.3.3 Histogram Comparison: Chi-square Distance........................... 39 2.3.4 Which Measure is Best?....................................... 41 2.3.5 Recognition using Histograms - Nearest-Neighbor Strategy................... 42 3 Performance Evaluation 43 3.1 Performance Evaluation........................................... 43 3.1.1 Score-Based Evaluation....................................... 44 3.1.2 Overall Accuracy in Classification................................. 46 3.1.3 Overall Precision in Classification.................................. 47 3.1.4 Recall in Classification........................................ 48 3.1.5 F1 Score in Classification...................................... 49 4 Machine Learning 51 4.1 Linear Regression............................................... 52 4.1.1 Simple Linear Regression...................................... 53 4.1.2 Multiple Linear Regression..................................... 53 4.1.3 Assumptions of Linear Regression................................. 55 4.1.4 Cost Function for Choosinging the Best Parameters in Linear Regression........... 56 4.1.5 Understanding Underfitting and Overfitting in Linear Regression................ 61 3 4 CONTENTS 4.2 Gradient Descent in Linear Regression................................... 62 4.2.1 Gradient Descent and Convergence in Linear Regression..................... 63 4.2.2 Simultaneous Update in Gradient Descent............................. 63 4.2.3 Effect of Learning Rate in Gradient Descent............................ 63 4.2.4 Global Minimum vs. Local Minima in Gradient Descent..................... 64 4.2.5 Batch Gradient Descent in Linear Regression........................... 65 4.2.6 Stochastic Gradient Descent (SGD) in Linear Regression..................... 66 4.2.7 Mini-Batch Gradient Descent.................................... 69 4.2.8 Gradient Descent for Multiple Variable.............................. 71 4.3 Polynomial Regression............................................ 78 4.3.1 Importance of Feature Scaling in Polynomial Regression..................... 79 4.3.2 Choice of Features.......................................... 80 4.3.3 Dangers of (Polynomial) Regression................................ 81 4.3.4 Cost Function in Polynomial Regression.............................. 84 5 Classification 93 5.1 Interpreting the Output of a Classification Model............................. 94 5.1.1 Probability Simplex......................................... 94 5.1.2 Classification Decision and Argmax Function........................... 94 5.1.3 Example................................................ 95 5.2 Binary Classification............................................. 95 5.2.1 Sigmoid Function........................................... 97 5.2.2 Decision Boundary.......................................... 99 5.2.3 Circulare Decision Boundary.................................... 101 5.2.4 Beam Decision Boundary...................................... 103 5.2.5 Log-Likelihood Function....................................... 104 5.2.6 Newton’s Model........................................... 108 5.3 Multi-Class Classification.......................................... 109 5.3.1 Softmax Function.......................................... 110 5.3.2 One-Hot Encoding of Classes.................................... 112 5.3.3 Likelihood Function......................................... 113 5.3.4 Log-Likelihood Functions...................................... 114 5.3.5 Negative Log-Likelihood....................................... 116 5.3.6 Cross-Entropy Function....................................... 116 5.3.7 Softmax Stability........................................... 119 5.3.8 Final Overview............................................ 120 5.4 Calibration.................................................. 120 5.4.1 Expected Calibration Error (ECE)................................. 121 5.4.2 Hyperplanes.............................................. 122 6 Bias-Variance 123 6.1 Bias-Variance Tradeoff............................................ 123 6.1.1 Bias in Machine Learning...................................... 124 6.1.2 Variance in Machine Learning.................................... 124 6.1.3 The Tradeoff............................................. 125 6.2 Expected Risk and Empirical Risk in Machine Learning......................... 126 6.2.1 Expected Risk............................................ 126 6.2.2 Empirical Risk............................................ 126 6.3 Setup/Assumptions in Machine Learning.................................. 127 6.3.1 Data Distribution and Training/Testing Sets........................... 127 6.3.2 True Parameters and Model Estimation.............................. 127 6.3.3 Process of Learning......................................... 128 6.3.4 Estimators.............................................. 129 6.4 Address High-Variance (Overfitting).................................... 131 6.4.1 Regularization in Machine Learning................................ 131 6.4.2 Regularization in Gradient Descent Optimization......................... 135 6.4.3 Regularization as a Probabilistic Interpretation.......................... 137 CONTENTS 5 6.4.4 The Prior Distribution........................................ 141 6.4.5 L2 -Regularization Term....................................... 141 6.4.6 Generative vs. Discriminative and Bayesian vs. Frequentist................... 142 6.5 Generalization and Empirical Risk in the Context of Learning Theory................. 144 6.5.1 Generalization Error......................................... 144 6.5.2 Error Decomponition......................................... 145 6.5.3 Estimation Error........................................... 145 6.5.4 Approximation Error......................................... 146 6.5.5 Irreducible Error........................................... 147 6.5.6 Combined Error Decomposition................................... 148 6.6 Hold-out Cross Validation.......................................... 148 6.6.1 Relationship Between Model Complexity and Error........................ 148 6.6.2 Dataset Splitting for Training, Validation and Testing in Machine Learning.......... 150 7 Neural Networks 153 7.1 Learning Paradigms in Neural Network Architectures........................... 153 7.1.1 Supervised and Unsupervised Learning............................... 153 7.1.2 Deep and Shallow Architectures.................................. 154 7.1.3 Model Types within Each Quadrant................................ 154 7.2 Workflow of Convolutional Neural Network (CNN)............................ 157 7.2.1 The Artificial Neuron is a Foundamental Unit of Neural Networks............... 158 7.2.2 Neural Network Layers Structure and Function.......................... 159 7.2.3 Activation Functions Introduce Non-Linearity in Neural Networks............... 159 7.2.4 Loss Functions and Optimization in Neural Networks...................... 160 7.3 Backpropagation and Gradient Descent.................................. 163 7.3.1 Computational Graphs and The Two-Step Learning Process in Backpropagation....... 164 7.4 FINIRE BACKPROPAGATION...................................... 173 8 Convolutional Neural Network (CNN) 175 8.1 Basic Components of CNNs......................................... 175 8.1.1 Convolutional Layers......................................... 176 8.1.2 Pooling Layers............................................ 178 8.1.3 Fully Connected Layers....................................... 184 8.1.4 Applications and Impact of CNN Components.......................... 186 8.1.5 Feature Hierarchy in CNNs..................................... 186 8.2 Training Process............................................... 189 8.2.1 The Output Layer in CNNs..................................... 190 8.2.2 The Loss Function in CNNs..................................... 190 8.2.3 Gradient Descent in CNNs..................................... 191 9 Enhancing CNN Performance: Advanced Building Blocks 195 9.1 Dropout.................................................... 195 9.1.1 Dropout Training........................................... 197 9.1.2 Dropout Inference.......................................... 199 9.1.3 Inverse Dropout........................................... 200 9.2 Vanishing Gradient Problem......................................... 201 9.2.1 Residual Connections........................................ 202 9.2.2 Prevention of Vanishing Gradients Through Identity Mappings................. 205 9.2.3 ResNets................................................ 205 9.3 Standard Scaling............................................... 206 9.3.1 Feature Mean (µ) in the Context of Standard Scaling...................... 206 9.3.2 Feature Variance (σ 2 ) in the Context of Standard Scaling.................... 207 9.4 Batch Normalization (BN).......................................... 208 9.4.1 Batch Normalization Training.................................... 210 9.4.2 Batch Normalization Inference................................... 212 9.4.3 Batch Normalization in Convolutional Layers........................... 212 9.5 Data Augmentation............................................. 213 6 CONTENTS 9.5.1 Steps in Data Augmentation.................................... 214 9.5.2 Benefits of Data Augmentation................................... 215 9.6 CONTINUARE SLIDE 21.......................................... 216 10 Data Representation in Unsupervised Learning 217 10.1 Rappresenting data via Spanning Sets................................... 218 10.1.1 Visualizing Vectors and Basis Representation........................... 218 10.1.2 Spanning Sets and Linear Independence in Data Representation................ 219 10.2 Encoding Data Using a Spanning Set for Dimensionality Reduction................... 221 10.2.1 Minimizing the Cost Function for Data Representation in a Spanning Set........... 223 10.2.2 Orthonormal Spanning Sets for Perfect Data Representation.................. 226 10.2.3 Imperfect Data Representation with a Limited Spanning Set.................. 226 10.2.4 Digital Image Compression as a Fixed Basis Dimension Reduction Technique......... 232 10.3 Principal Component Analysis (PCA)................................... 233 10.3.1 Gradient Descent for Finding an Optimal Basis.......................... 234 10.3.2 Enforcing Orthogonality During Learning in PCA........................ 237 10.4 Autoencoder................................................. 239 10.4.1 Linear Autoencoder......................................... 239 10.4.2 Deep Autoencoders.......................................... 243 10.4.3 Stacked Autoencoders........................................ 244 10.4.4 Denoising Autoencoders....................................... 245 10.4.5 Sparse Autoencoders......................................... 246 10.5 Variational Autoencoders (VAEs)...................................... 246 10.5.1 The Architecture of Variational Autoencoders (VAEs)...................... 247 10.5.2 Deep Latent Variables in Neural Networks............................. 248 10.5.3 Latent Code and the Role of Encoder-Decoder in VAEs..................... 249 10.5.4 Trivial and Linear Cases in Variational Autoencoders (VAEs).................. 251 10.5.5 Parameterization Using Neural Networks............................. 252 10.5.6 Maximizing the Probability of the Observed Data........................ 252 10.5.7 Lower Bound (ELBO) as the Loss Function in VAEs....................... 254 10.5.8 Limitations of ELBO: Variance and Approximation Issues................... 256 Chapter 1 Computer Vision Basics Computer vision is a field of computer science that focuses on teaching machines to understand and interpret visual information from the world, such as images and videos. It aims to enable computers to recognize objects, people, or patterns, much like how humans use their eyes and brains to see and understand. In the context of data science, computer vision often involves using data to train models that can perform tasks like identifying faces, detecting objects in photos, or analyzing movements in videos. These models rely on techniques from machine learning and artificial intelligence to improve their accuracy and usefulness. 1.1 Object Recognition Recognition problems in computer vision are fundamental tasks that focus on understanding and interpreting im- ages by identifying objects within them. Recognition in computer vision often involves breaking down the process into key components to achieve a deeper understanding of an image. Segmentation is the first step, where the goal is to distinguish between the pixels that belong to the object of interest (the foreground) and those that make up the background. This process creates a precise boundary around the object, enabling the system to isolate it from other elements in the image. For instance, in an image of a dog on grass, segmentation would separate the pixels forming the dog from those representing the grass. Localization/Detection: It goes beyond segmentation to identify where the object is located within the scene. This includes pinpointing the object’s position using bounding boxes or coordinates and estimating its pose, which refers to properties like orientation, size, scale, and even 3D position. For example, localization might indicate that a car is in the top-right corner of the image, while pose estimation could reveal that the car is facing forward, is scaled to appear close, and is tilted slightly to the left. These components work together to provide a detailed understanding of both the object’s presence and its spatial context in the scene. While recognition focuses on understanding and interpreting the content of an image, object identification and classification are specific sub-tasks within this process.Both tasks rely on recognizing patterns, but identification is more detailed and instance-specific, while classification abstracts these details to categorize based on shared characteristics. Together, they form the building blocks of object recognition systems. 1.1.1 Object Identification Object Identification refers to the precise recognition of a specific instance of an object. For example, identifying ”your apple” involves distinguishing that particular apple from all other apples or objects, based on unique features such as its shape, color, or even distinguishing marks. Object identification is often referred to as ”instance recognition” and requires detailed knowledge about the exact object being observed, often through extensive training on the object’s characteristics in different settings and viewpoints. 7 8 CHAPTER 1. COMPUTER VISION BASICS Object Identification Steps This process requires careful preparation and training to ensure the model can uniquely distinguish the target object, even in complex or changing environments. The following steps outline the key components involved in successfully identifying specific objects using computer vision techniques: 1. Collecting and Labeling Specific Object Data: The first step in object identification is gathering a dataset that includes multiple images of the specific object to be identified. Each image must be accurately labeled to ensure the model can associate the correct features with the object. 2. Feature Extraction and Analysis: Advanced algorithms analyze the images to extract unique features of the specific object, such as texture, color, shape, or distinguishing marks, which set it apart from similar objects. 3. Training a Model for Specific Object Recognition: A deep learning model is trained using the la- beled dataset, learning the unique characteristics of the object to distinguish it from others, even in varying conditions like changes in lighting, angle, or partial occlusion. 4. Validation and Real-World Testing: The trained model is validated and tested in real-world scenarios to confirm it can consistently recognize the specific object. Adjustments are made to fine-tune its performance based on real-world feedback. Advantages of Object Identification Enhanced Precision for Specific Applications: Object identification enables highly accurate recogni- tion of specific instances of objects, making it ideal for applications like inventory tracking, personal object recognition, or identifying unique items such as custom-made equipment. Customization for Specialized Use Cases: By focusing on recognizing specific objects, this method allows businesses and systems to tailor solutions to meet niche needs, such as identifying counterfeit products or tracking personalized belongings. Robustness in Differentiation: Object identification excels in distinguishing between nearly identical items. This capability is particularly beneficial in quality control processes, where identifying subtle defects or variations in products is crucial. Disadvantages of Object Identification High Dependence on Quality Training Data: Object identification requires extensive and detailed datasets of the specific object to achieve high accuracy. Gathering such data for all potential use cases can be time-consuming and costly. Sensitivity to Environmental Variations: Models trained for specific object identification might struggle when faced with changes in lighting, angle, or background clutter, requiring additional adjustments to maintain performance. Limited Generalization: Unlike classification, which can generalize across a category, object identification is highly specialized. This specificity means it may require re-training for every new object to be identified, making scalability a challenge. 1.1.2 Object Classification Object Classification involves recognizing objects as belonging to a general category or class, such as identifying any apple, any cup, or any dog. This process does not focus on specific instances but rather on shared features that define a category, such as the general roundness and stem of an apple or the handles of cups. Object classification is often associated with ”basic-level categories,” which are the most intuitive and general levels at which humans naturally classify objects. For instance, humans are quicker to identify a “dog” rather than specifying a breed like ”Golden Retriever.” In computer vision, object classification relies on training models to learn these shared features and categorize objects accurately, regardless of variations in size, color, or orientation. 1.2. IMAGE LINEAR FILTERING 9 Object Classification Steps Unlike object identification, classification focuses on recognizing shared features among objects of the same category, rather than distinguishing specific instances. The following steps outline the essential process of object classification using computer vision techniques: 1. Preparing Your Data: In order to provide better image data for computer vision models to work with, it removes unwanted deformities and enhances some important parts of the image. 2. Object Detection: Objects are localized, which involves object segmentation and object position determi- nation. 3. Identification of Patterns: Deep learning algorithms then identify patterns in the image that can be specific to a certain label. With the help of this dataset, the model gains future accuracy improvements. 4. Division of Observed Things into Predefined Classes: Machine learning algorithms divide observed things into predefined classes using the classification strategy, contrasting desired patterns with picture pat- terns. Advantages of Image Classification Higher quality products - With accurate image classification, your AI product will be able to perform a variety of tasks connected with image object recognition. Real-world training - Since your product will need to recognize items in the physical world, it would make sense to train it on real-life images instead of computer-generated ones. Many practical applications - Image classification products can be used in areas like object identification in satellite images, traffic control systems, brake light detection, machine vision, and more. Disadvantages of Image Classification Working with data uncertainties: Regardless of how thoroughly you train an image recognition model, there are cases where the model fails to classify the objects correctly. We attribute most discrepancies to these factors. Occlusion: In some images, the target object may not be entirely visible. Consider a dog that is hiding in a bush, its leg and body hidden. In this case, even if the dog’s head is plainly visible in the viewpoint, the imaging algorithm might not be able to identify it. Background noise: When working with object detection machine learning, the ability of the model to correctly identify the object in the image may be impacted by interfering with ambient texture and color. An imaging model, for instance, would have trouble distinguishing a red apple from a similar-colored table. Similarly, it is difficult to concentrate on a single vehicle in slow-moving traffic. 1.2 Image Linear Filtering Image filtering is a basic yet powerful technique in image processing, used to enhance images or extract important information from them. It involves applying a mathematical function to the pixels of an image to achieve a desired effect. For example, filters can be used to smooth out noise, sharpen edges, or highlight certain features like lines or corners. Imagine taking a blurry photo and using a filter to make it clearer or removing grainy imperfections caused by poor lighting—these are practical applications of image filtering. The process typically works by examining a small group of pixels (called a ”neighborhood”) and adjusting their values based on specific rules, like averaging them to reduce noise or emphasizing differences to detect edges. Filters are foundational tools in tasks like improving image quality or preparing images for further analysis in computer vision. 10 CHAPTER 1. COMPUTER VISION BASICS 1.2.1 Image as a Mathematical Function An image can be understood mathematically as a function: f (x, y) where: x and y represent the spatial coordinates corresponding to specific locations within the image. In this context, (x, y) defines a point in the two-dimensional plane of the image, such as a pixel’s position. The function f (x, y) assigns a value to each coordinate pair, encapsulating the visual information at that point. This value, often referred to as the pixel value, carries the intensity or color information for the respective location. By interpreting the image as a function, it becomes possible to apply mathematical operations to analyze, modify, or extract details from the image. For example, in an image with dimensions M × N , x ranges from 0 to M − 1 and y ranges from 0 to N − 1, defining a grid of pixel values. This representation is fundamental in computer vision and image processing, allowing algorithms to treat the image as a structured dataset where each point corresponds to a measurable quantity that can be processed for tasks like filtering, object detection, and more. Understanding this mathematical framework helps bridge the gap between raw image data and computational analysis, forming the basis for advanced image manipulation techniques. Grayscale Images In a grayscale image, each pixel is represented by a single component that captures the intensity of gray at a specific location. This intensity value indicates how light or dark the pixel appears, where lower values represent darker shades (closer to black) and higher values represent lighter shades (closer to white). The typical range for these values is from 0 to 255, which corresponds to an 8-bit depth. This means each pixel can take on one of 256 possible shades of gray, providing a smooth transition between black and white. The simplicity of grayscale representation makes it ideal for many image processing tasks, as it reduces the complexity of the data while retaining critical visual information about texture, structure, and contrast within the image. This compact yet informative representation is widely used in applications like medical imaging, pattern recognition, and image filtering. Figure 1.1: Grayscale Image Example RGB Images An RGB (Red, Green, Blue) image represents each pixel using three separate components, r(x, y), g(x, y), and b(x, y), which correspond to the intensities of the red, green, and blue color channels at a specific location (x, y) in the image. Each of these components typically has a value ranging from 0 to 255, which corresponds to an 8-bit depth, allowing for 256 possible intensity levels per channel. The combination of these three channels determines the final color of each pixel, with the values of r(x, y), g(x, y), and b(x, y) blending to produce a wide variety of colors. By varying the intensities of these primary colors, the RGB model can produce a total of 256 × 256 × 256 = 16, 777, 216 possible colors, defining the full color range available in standard digital images. This vast array of colors enables RGB images to capture intricate details and variations in visual scenes, making them a powerful representation for applications requiring rich color information. The three-channel structure also makes RGB images more complex 1.2. IMAGE LINEAR FILTERING 11 to process compared to grayscale images, as each pixel carries three interdependent values that collectively define its color. Understanding the RGB representation is fundamental in fields like computer graphics, image processing, and digital photography, where precise manipulation of color is essential. Figure 1.2: RGB Possible Colors Figure 1.3: RGB Image 1.2.2 Convolution Convolutions are fundamental mathematical operations in image processing, playing a key role in feature detec- tion and analysis. They operate by applying a kernel, a small matrix of numerical values—to localized regions of an image, systematically performing element-wise multiplication and summation across the overlapping areas. The kernel defines the specific feature or transformation being applied, such as detecting edges or blurring the image. Convolutional operations are shift-invariant, meaning they respond consistently to a feature regardless of its position within the image. This property ensures that spatial structures, such as shapes or patterns, are preserved during the processing. By maintaining spatial relationships, convolutions are especially powerful in tasks requiring pattern recognition or feature extraction, making them a cornerstone in both traditional image processing and modern deep learning applications. Convolutions offer several benefits in image processing and deep learning, making them a cornerstone in analyzing visual data. Feature extraction: It is one of their primary advantages, as convolutions enable the detection of local patterns such as edges, textures, and shapes within images. In deep learning, convolutional layers auto- matically learn these features at varying levels of abstraction, starting from simple edges in the initial layers to more complex patterns like textures and entire objects in deeper layers. This hierarchical learning makes convolutions highly effective for tasks like object detection and image classification. Localized operations: Convolutions work on small, localized regions of the input (kernels or filters), enabling efficient tasks like smoothing, sharpening, or gradient detection. Shift invariance: It is a critical benefit, as convolutional operations preserve the spatial structure of the input data, allowing features to be recognized regardless of their position in the image. This property is essential for applications like object recognition, where the location of a feature should not affect its detection. These combined benefits make convolutions versatile and powerful in both traditional and AI-driven image process- ing. Image Filters Image filtering is a crucial technique in image processing, where a filter is defined as a mathematical function applied to the pixels of an image to transform or extract information. This transformation is performed by systematically altering pixel values based on their surroundings, enabling the extraction of useful patterns or the modification of the image’s appearance. Filters play diverse roles in image analysis: 12 CHAPTER 1. COMPUTER VISION BASICS Enhancement: Such as improving the contrast of an image, refines the visual clarity and highlights specific features. Smoothing: Involves reducing noise by averaging or other techniques, resulting in a cleaner image without unwanted variations. Template matching: On the other hand, focuses on detecting predefined patterns or shapes within an image, aiding in object recognition or alignment tasks. Image filtering serves multiple important purposes and offers significant benefits in the field of image processing. Filtering for enhancement: Focuses on improving the contrast of an image, making features more distin- guishable and visually prominent. This is particularly useful in highlighting edges, textures, or specific details that may otherwise blend into the background. Filtering for smoothing: It is employed to remove noise, such as random variations in pixel values caused by poor lighting or sensor imperfections, resulting in a cleaner and more visually appealing image. This is a crucial preprocessing step for advanced analysis, as it eliminates unwanted artifacts that could interfere with feature extraction or classification tasks. Filtering for template matching: It enables the detection of predefined patterns or structures within an image, such as recognizing shapes or objects. This is essential for applications like object recognition, feature detection, and alignment tasks, as it allows systems to identify and locate specific elements efficiently. Warping filters: They are specialized tools used for geometric transformations in image processing, allowing for the modification of pixel positions to alter the image layout. These filters enable operations like resizing, rotating, or skewing an image, as well as more complex transformations such as perspective shifts or non- linear warping. By changing the spatial arrangement of pixels, warping filters are essential for tasks like image alignment, correcting distortions, or simulating visual effects, while preserving the visual coherence of the image’s features. Together, these filtering techniques enhance image quality, reduce noise, and facilitate reliable pattern detection, forming the foundation for many computer vision and image analysis applications. Figure 1.4: Filtering 2D Convolution Linear filtering is one of the simplest and most fundamental techniques in image processing. It involves replacing the value of each pixel in an image with a weighted sum of its own value and the values of its neighboring pixels. The weights used for this operation are defined by a filter or kernel, a small matrix that specifies how much influence each neighboring pixel has. Convolution facilitates linear filtering by systematically applying the kernel across the image, performing element- wise multiplications and summing the results for each pixel position. This process is key to achieving effects like 1.2. IMAGE LINEAR FILTERING 13 Figure 1.5: Filter/Kernel Examples blurring, sharpening, or edge detection while maintaining computational efficiency and ensuring the transformation aligns with the spatial structure of the image. 2D Convolution Formulation The mathematical operation of 2D convolution is defined as: X f [m, n] = (I ⊗ g)[m, n] = I[m − k, n − l]g[k, l] k,l I[m, n]: The input image, represented as a grid of pixel values. I[m − k, n − l]: The image feature to which the kernel is applied. g[k, l]: The filter (or kernel), a small matrix that is applied to the image to define the operation (e.g., edge detection, smoothing). f [m, n]: The filtered image, where each pixel value is computed through the convolution operation. This formula describes how the convolution process works by sliding a filter (kernel) over an input image and calculating the weighted sum of pixel values within the filter’s receptive field. Each resulting value forms a pixel in the filtered image, preserving spatial relationships while applying transformations such as blurring or edge detection. Practical Example of 2D Convolution Given: Input Image (I[k, l]): A portion of an image matrix (feature). 8 5 2 I = 7 5 3 9 4 1 14 CHAPTER 1. COMPUTER VISION BASICS Filter Kernel (g[k, l]): A 3×3 kernel used for edge detection or smoothing, depending on its values. In this case, the kernel emphasizes differences in horizontal directions: −1 0 1 g = −1 0 1 −1 0 1 Resulting Filtered Pixel (f [m, n]) An element-wise multiplication of the kernel is performed with the image region it overlays, then sum up the results. For example: f [1, 1] = (−1 · 8) + (0 · 5) + (1 · 2) + (−1 · 7) + (0 · 5) + (1 · 3) + (−1 · 9) + (0 · 4) + (1 · 1) = −18 The output size of an image decreases after a convolution operation because the kernel used for convolution requires a full overlap with the input image’s pixels to compute valid results. A convolution kernel, such as a 3 × 3 matrix, performs its operation by sliding across the input image, calculating the weighted sum of the pixel values it overlaps at each position. However, this process restricts the kernel to regions where it can fully align with the image without extending beyond its boundaries. Figure 1.6: Fully Align Kernel For instance, when the kernel is centered near the edges of the image, parts of it fall outside the image area, leaving insufficient pixels for a valid computation. This boundary limitation means that the outermost rows and columns of pixels are excluded from the output, resulting in an image with reduced dimensions. The size of the output image after applying a convolution is determined by the dimensions of the input image and the kernel. For an input image of size N × N and a kernel of size K × K, the formula for the output size is given by: Output Size = (N − K + 1) × (N − K + 1). This formula accounts for the fact that the kernel slides across the input image and only computes values where it can fully overlap the pixels. For example, consider an input image of size 5 × 5 and a kernel of size 3 × 3. Substituting into the formula, the output size becomes: (5 − 3 + 1) × (5 − 3 + 1) = 3 × 3. 1.2. IMAGE LINEAR FILTERING 15 Figure 1.7: Out of Bound Kernel This means that the convolution reduces the spatial dimensions of the image because boundary regions are excluded as the kernel cannot extend beyond the image. The larger the kernel relative to the image size, the greater the reduction in the output dimensions. This property highlights the importance of choosing appropriate kernel sizes based on the image and task at hand. While this behavior is inherent to convolution, techniques like padding can be applied to preserve the original size by adding extra pixels around the image’s edges, allowing the kernel to process these boundary regions fully. Padding To preserve the output size after applying a convolution, padding can be used to add extra rows and columns around the image. Padding involves surrounding the original image with a border of zero values (or another specified padding value) to ensure that the convolution kernel can fully overlap every pixel in the original image, including those near the edges. For a K × K kernel, a typical padding size is ⌊K/2⌋, where ⌊·⌋ represents the floor function, which takes the largest integer less than or equal to K/2. For instance, a 3 × 3 kernel requires a padding of 1 pixel, while a 5 × 5 kernel requires 2 pixels of padding. This approach ensures that the kernel has sufficient space to slide across the entire input image without reducing its dimensions, maintaining the output size equal to the input size. Padding is especially useful in applications like deep learning, where consistent input and output dimensions simplify the design of convolutional neural networks. Correlation VS Convolution The key difference between convolution and correlation lies in the flipping of the kernel. For an operation to qual- ify as convolution, the kernel must be flipped both horizontally and vertically before being applied to the input data. For example, consider a kernel: 1 2 3 4 Flipping this kernel would result in: 4 3 2 1 16 CHAPTER 1. COMPUTER VISION BASICS Figure 1.8: Padding If this flipping is omitted, the operation becomes cross-correlation instead of convolution. Convolution emphasizes specific transformations, such as edge detection or blurring, by combining the flipped kernel with the input data to generate a transformed output. Cross-correlation, on the other hand, measures the similarity between two signals by directly applying the kernel to the input without flipping, making it useful for pattern matching and feature alignment. While these operations are mathematically related, their distinct characteristics make them suitable for different applications in image processing and signal analysis. Properties of Linear Systems Linear systems are governed by the principles of homogeneity, additivity, and superposition, which collectively define their behavior and response to inputs. Homogeneity Homogeneity states that if the input to a system is scaled by a factor a, the output will also be scaled by the same factor. Mathematically, for a system T and input X, this property is expressed as: T [aX] = aT [X]. Proof Let the input X produce an output T [X]. Scaling X by a means each element of X is multiplied by a. Since the system is linear, the operation performed by T on aX distributes over a, producing a · T [X], which satisfies the homogeneity property. Additivity Additivity means the response of the system to the sum of two inputs X1 and X2 is equal to the sum of the individual responses. This can be written as: T [X1 + X2 ] = T [X1 ] + T [X2 ]. Proof Suppose the system processes two inputs X1 and X2 separately, yielding outputs T [X1 ] and T [X2 ]. When the combined input X1 + X2 is passed through the system, the linearity ensures that the system processes each input independently and then sums the results, demonstrating the additivity property. 1.2. IMAGE LINEAR FILTERING 17 Superposition Superposition combines homogeneity and additivity into a single property, stating that for a linear system, the response to a weighted sum of inputs aX1 + bX2 is the weighted sum of their individual responses: T [aX1 + bX2 ] = aT [X1 ] + bT [X2 ]. Proof Using homogeneity, we know T [aX1 ] = aT [X1 ] and T [bX2 ] = bT [X2 ]. By additivity, the response to aX1 + bX2 is: T [aX1 + bX2 ] = T [aX1 ] + T [bX2 ] = aT [X1 ] + bT [X2 ]. Linear Systems and Superposition Superposition is a defining characteristic of linear systems, showing that the system’s response to any linear com- bination of inputs is equivalent to the linear combination of their responses. For inputs aX1 + bX2 , the output is: T [aX1 + bX2 ] = aT [X1 ] + bT [X2 ]. Proof By the definition of linear systems, superposition naturally follows as it combines both homogeneity and addi- tivity. Thus, any system satisfying superposition is inherently linear. These properties ensure that linear systems behave predictably under operations like scaling and combination of inputs, making them fundamental in signal processing, control systems, and image processing. 1.2.3 Filtering to Reduce Noise Filtering is a fundamental technique in image processing used to reduce noise and enhance image quality. Noise refers to unwanted variations in pixel values that obscure the true signal of the image, often caused by factors like sensor errors, quantization, or environmental conditions. Filtering methods are designed to smooth out these random fluctuations while preserving essential image details. By leveraging the assumption that neighboring pixels are similar in intensity, filters like the mean or Gaussian filter replace each pixel’s value with a weighted average of its neighborhood. This process reduces the impact of noise by emphasizing consistent patterns over random variations. Filtering not only improves the visual quality of the image but also prepares it for more accurate analysis and feature extraction in tasks such as object detection or segmentation. Images Noise Noise in image processing refers to unwanted variations in pixel values that obscure or distort the meaningful content of an image. It is categorized into two main types: Low-level Noise: It includes issues such as light fluctuations, sensor imperfections, quantization errors, and finite precision during image acquisition or processing. These types of noise often manifest as random variations in intensity, making the image appear grainy or distorted. Complex Noise: It involves larger-scale disruptions such as shadows or extraneous objects that interfere with the image’s interpretation, though these are not typically addressed in basic noise reduction techniques. The fundamental assumption for reducing noise is that a pixel’s neighborhood contains correlated intensity infor- mation, meaning nearby pixels are likely to have similar values. This assumption is leveraged in techniques like averaging, where the pixel value is replaced with the mean of its neighbors to smooth out random variations and reduce noise while preserving the underlying structure. Additive Noise model 18 CHAPTER 1. COMPUTER VISION BASICS The additive noise model in image processing assumes that an observed image I is the sum of the true signal S and a noise component N , represented as: I =S+N Figure 1.9: Noise Model This model posits that the noise is independent of the signal, meaning it does not systematically affect or correlate with the true content of the image. In this framework, the intensity of a pixel Ii is expressed as the sum of its true intensity Si and the noise Ni. Several key considerations underlie this model: Pixel Intensity reflects both the desired information and ran- dom variations caused by noise. Noise Properties are defined as follows: The noise Ni has an expected value of zero, E(Ni ) = 0, ensuring that it does not introduce a bias to the image; Noise values Ni and Nj are independent for i ̸= j, meaning there is no relationship between noise at different pixel locations. The noise is i.i.d. (independent and identically distributed), signifying that noise values follow the same statistical distribution across the image. This model provides a foundational framework for analyzing and mitigating noise in image processing applications. The additive noise model has significant implications for noise reduction techniques in image processing. One key approach is averaging noise, which leverages the assumption that a pixel’s true intensity is similar to that of its neighbors (Box Filter). By averaging the values in a pixel’s neighborhood, random variations introduced by noise are smoothed out, effectively reducing its impact on the signal. This technique relies on the statistical principle that averaging independent noise components tends to cancel out their random fluctuations. Smoothing as inference takes this concept further by viewing smoothing operations as a way to infer the true underlying signal from noisy observations. By applying techniques like Gaussian or mean filtering, the system mitigates noise and enhances the clarity of the image, while preserving essential features like edges and textures to the extent possible. These methods are foundational in preprocessing tasks, ensuring that subsequent analysis operates on cleaner, more reliable data. Gaussian Filtering Gaussian filtering is a widely used technique in image processing for noise reduction. It works by applying a Gaus- sian kernel to an image, which smooths out variations by averaging the pixel values in a local neighborhood. The Gaussian kernel gives higher weights to pixels closer to the center of the neighborhood and progressively lower weights to farther pixels, ensuring that nearby pixels have a more significant influence on the final output. This property makes Gaussian filtering particularly effective at reducing random noise while preserving important image features, such as edges, to a certain extent. By treating high-frequency noise as rapid intensity changes, the Gaus- sian filter acts as a low-pass filter, smoothing out these fluctuations and producing a cleaner, more uniform image. Probabilistic Inference Approach 1.2. IMAGE LINEAR FILTERING 19 Reminder: Standard Deviation and Mean The formula for the standard deviation (σ) of a set of data points is: v u u1 X N σ=t (xi − µ)2 N i=1 where: N : Total number of data points. xi : The i-th data point in the dataset. µ: The mean (average) of the dataset, calculated as: N 1 X µ= xi N i=1 Explanation: xi − µ: This calculates the deviation of each data point from the mean. (xi − µ)2 : Squaring ensures all deviations are positive. PN N1 i=1 (xi − µ)2 : This calculates the average of the squared deviations (variance). √ ·: Taking the square root of the variance gives the standard deviation, which represents the spread of the data around the mean. If the data represents a sample rather than the entire population, N − 1 is used in the denominator instead of N (this is called the sample standard deviation). The probabilistic inference approach in Gaussian smoothing ensures that nearby pixels contribute more significantly to the smoothing process than distant ones, reflecting the intuition that closer pixels are more likely to share similar intensities. This concept is mathematically captured by the Gaussian kernel formula: 2 x + y2 1 g(x, y) = exp −. 2πσ 2 2σ 2 The kernel assigns weights to each pixel based on its distance from the center (µ), with the weights exponentially decreasing as the distance increases. The parameter σ (standard deviation) controls the spread of the kernel, determining how quickly the weights decay. A smaller σ results in a narrower kernel, where only very close pixels have a significant influence, while a larger σ creates a broader kernel that incorporates contributions from more distant pixels. This probabilistic framework ensures that the smoothing process respects the spatial structure of the image, reducing noise while preserving essential features. Separability of Box and Gaussian Filters The separable property of filters refers to the ability to decompose a 2D convolution operation into two sequential 1D convolutions—one along the rows and the other along the columns. This property applies to certain filters where the 2D kernel can be expressed as the outer product of two 1D kernels. Mathematically, separability allows convolution to be performed in a sequence: (fx ⊗ fy ) ⊗ I = fx ⊗ (fy ⊗ I), where: fx and fy are the 1D kernels I is the input image. 20 CHAPTER 1. COMPUTER VISION BASICS Figure 1.11: Gaussian at σ variation Figure 1.10: Smoothing Grid This formulation leverages the linearity, associativity, and commutativity of convolution, ensuring that the output remains equivalent to that of a full 2D convolution. The advantages of separability are significant: It drastically reduces computational cost, as the complexity of a full K × K 2D convolution is reduced from O(K 2 ) to O(2K). It simplifies implementation, as the two sequential 1D convolutions are easier to compute and optimize. Despite these benefits, the separable property achieves the same effect as the full 2D convolution, making it an efficient choice in many image processing and machine learning tasks. Practical Example of Filter Separability Filters like BOX and Gaussian filters can exploit their separable property to be efficiently applied as two sequential 1D filters instead of a full 2D convolution. This involves first convolving the image with a 1D filter along the rows, followed by convolving the result with a 1D filter along the columns. Box Filter Separability Figure 1.12: Box Filter Separability A BOX filter is a simple averaging filter represented by a 3 × 3 kernel, where each value is 19 , effectively computing the average of the pixel and its 8 neighbors. This 3 × 3 BOX filter is separable,meaning it can be decomposed into two 1D filters for computational efficiency. The horizontal filter is 13 , 13 , 13 , which 1.2. IMAGE LINEAR FILTERING 21 1 3 applies averaging along the rows of the image, and the vertical filter is 13 , which performs averaging along 1 3 the columns. By applying these two 1D filters sequentially, the same result as the full 3 × 3 convolution is achieved, but with reduced computational complexity, making this approach more efficient for image processing tasks. Gaussian Filter Separability A Gaussian filter, widely used for smoothing and noise reduction in image processing, is separable, allowing it to be decomposed into two 1D filters applied sequentially. In the x-direction, the 1D Gaussian filter is defined as: x2 1 g(x) = √ exp − 2 , 2πσ 2σ and in the y-direction, it is given by: y2 1 g(y) = √ exp − 2. 2πσ 2σ The full 2D Gaussian filter combines these components into a single expression: 2 x + y2 1 g(x, y) = exp −. 2πσ 2 2σ 2 By leveraging the separable property, the 2D convolution with a Gaussian filter can be performed efficiently as two 1D convolutions: first along the rows using g(x), and then along the columns using g(y). This decomposition not only reduces computational complexity but also simplifies implementation while achieving the same smoothing effect as a direct 2D Gaussian filter. Proof m X X n h(i, j) = f (i, j) ∗ g(i, j) = g(k, l)f (i − k, j − l) k=1 l=1 m X n X k2 +l2 = e− 2σ 2 f (i − k, j − l) k=1 l=1 m " n # X k2 X l2 − 2σ − 2σ = e 2 e 2 f (i − k, j − l) k=1 l=1 m X k2 = e− 2σ2 h′ (i − k, j) k=1 Pn l2 − 2σ where h′ (i − k, j) = l=1 e f (i − k, j − l), representing the 1-D Gaussian horizontally, and the final 2 summation represents the 1-D Gaussian vertically. Starting with the 2D convolution formula, h(i, j) = f (i, j) ∗ g(i, j), it is expressed as a double sum- mation over the input image f (i, j) and the Gaussian kernel g(k, l), where k and l are the indices of the k2 kernel. The Gaussian kernel g(k, l) is expanded as a product of two 1D Gaussian functions, e− 2σ2 and l2 e− 2σ2 , due to its separable nature. The double summation is then split into two steps: 22 CHAPTER 1. COMPUTER VISION BASICS 1. For each fixed k, the summation over l is isolated, effectively applying a 1D Gaussian convolution along the rows of the image, resulting in an intermediate result h′ (i − k, j). This step corresponds to the horizontal 1D Gaussian convolution. 2. In the second step, the summation over k is performed, applying a 1D Gaussian convolution along the columns, using the intermediate result h′. This corresponds to the vertical 1D Gaussian convolu- tion. By performing the horizontal and vertical convolutions sequentially, the same result as the full 2D Gaussian convolution is achieved. This proof demonstrates how the separability property reduces the computational complexity, as the double summation over two dimensions is replaced with two independent 1D convolutions, making the operation more efficient without compromising the outcome. 1.3 Multi Scale Image Representation(DA Finire) Chapter 2 Edge Detection Edge recognition is a fundamental concept in image processing, focusing on identifying the boundaries or transitions between different regions in an image. One common approach is recognition using line drawings, where the goal is to extract the structural outlines or edges in an image, simplifying it into a representation of shapes and objects. This process often relies on image derivatives, which measure changes in pixel intensity. The 1st-order derivative detects rapid intensity changes, identifying edges as areas where the gradient (rate of change) is highest. The 2nd-order derivative goes further, detecting variations in the gradient itself, which can highlight finer edge details or transitions. These techniques provide the mathematical foundation for identifying and analyzing edges, helping in tasks like object recognition, segmentation, and scene understanding. Recognition Using Line Drawings Recognition using line drawings is an image processing technique that focuses on extracting and analyzing the structural outlines of objects in an image. The primary purpose of line drawings in image processing is to simplify visual information by reducing it to essential lines and edges. This simplification helps algorithms focus on an object’s shape and structure without being in- fluenced by less relevant details like textures, colors, or noise. By abstracting the image into a basic representation, line drawings make it easier to analyze the underlying geometry and structural features. Additionally, line drawings are invaluable for feature extraction, as they highlight critical elements such as edges, corners, and outlines. These features are essential for recognition algorithms, which often rely on geometric and structural information to classify objects, detect patterns, or understand scenes. This makes line drawings an effective tool in applications such as object recognition, robotics, and computer vision. Typical use cases of line drawings include sketch-based recognition, a technique where users provide sketches or line drawings as input for a system to recognize and match them to real-world objects. This approach leverages the simplified representation of objects through their contours and edges, allowing the system to focus on shape and structure rather than textures or colors. By comparing the geometric and structural features of the sketch to a database of objects, the system can identify similar items or categories. This method is widely used in applications such as design tools, where users create sketches to search for matching templates, and in augmented reality, where hand-drawn inputs are used to interact with virtual objects. Sketch-based recognition highlights the power of line drawings in bridging human creativity with computational understanding. Edge detection involves a series of steps to identify significant transitions in intensity within an image. 1. Smoothing Using a Gaussian Filter: It reduces noise and high-frequency variations that could obscure meaningful edges. By smoothing the image, only the significant transitions in intensity are retained, making subsequent steps more accurate. 2. Derivatives of The smoothed Image: These are calculated to measure the rate of intensity change. These derivatives, such as gradients, help highlight regions where intensity varies significantly, which typically correspond to edges. 3. Maxima of the Derivative: They are identified as the locations of edges. These maxima represent points of the highest intensity change, marking the boundaries between different regions in the image. 23 24 CHAPTER 2. EDGE DETECTION This step-by-step approach ensures that edges are detected robustly while minimizing false positives caused by noise or minor intensity variations. Goals of Edge Detection The primary goals of edge detection are to accurately identify significant intensity changes in an image while minimizing errors caused by noise or artifacts. Good detection: Ensuring that the filter or algorithm can differentiate between actual edges and random noise. This requires high sensitivity to edges and low sensitivity to noise, allowing the detection of true boundaries without being misled by irrelevant variations. Good location: It is the second goal, ensuring that the detected edge corresponds precisely to the location where the intensity change occurs, without any shifting or offset. single response: It is crucial, meaning that each edge should be detected only once. Multiple responses to the same edge can lead to redundancy, confusion, and errors in subsequent processing. Together, these goals ensure accurate, precise, and reliable edge detection. Edge Detection Issues Despite these goals, several issues can arise in edge detection: Poor Localization: It is a common problem where detected edges are shifted from their true location, leading to inaccuracies in identifying boundaries. This can occur due to improper filtering or interference from noise. Too Many Responses: Where multiple edges are identified for a single true edge. This redundancy can result in false detections and clutter, complicating further image analysis. These challenges highlight the importance of designing robust edge detection algorithms that balance sensitivity, precision, and noise handling to achieve accurate results. 2.1 1D Edge Detection Steps 1D edge detection involves analyzing intensity changes along a single dimension of an image, typically a row or column, to identify boundaries or transitions. The process detects sharp changes in pixel intensity, which correspond to edges in the image. This approach is useful for simplifying edge detection to a single dimension, where intensity profiles are analyzed using techniques like derivatives or gradient-based methods to locate peaks and troughs rep- resenting edges. Example: In the image, a specific section of the ”Barbara” image is selected, corresponding to line 250, which represents a single row of pixels. The intensity values of the pixels along this row are plotted as a profile graph. Figure 2.1: Single Dimention Line Function where: 2.1. 1D EDGE DETECTION STEPS 25 The x-axis corresponds to the pixel positions (image length) The y-axis represents the intensity values (colors). The profile shows variations in intensity, with sharp changes indicating potential edges. This visualization helps highlight the changes in intensity along a single dimension, making it easier to identify edges within the selected row. The intensity profile extracted from the image section appears noisy, particularly in regions with rapid and frequent changes in intensity. These variations are caused by high-frequency components in the image, such as textures, fine patterns, or sharp transitions between pixel intensities. High-frequency fluctuations represent intricate details, like the fabric’s texture in the ”Barbara” image, which are visually important but can complicate edge detection. Such fluctuations mask the larger, more meaningful intensity transitions that define edges, making it harder to iden- tify them accurately. Without preprocessing, this noise can lead to false edge detections or missed boundaries. Techniques like smoothing filters (e.g., Gaussian filters) are often applied to reduce the influence of high-frequency components, suppressing noise while retaining significant edges for better analysis and detection. 2.1.1 Preprocessing the Image with Gaussian Smoothing Smoothing an intensity profile is a crucial preprocessing step for enhanced edge detection in 1D, as it reduces noise and emphasizes significant intensity transitions. Applying a Gaussian smoothing filter is a common method for this purpose. The Gaussian filter reduces high-frequency components and noise in the intensity profile, resulting in a smoother curve that is easier to analyze. This smoothing minimizes random fluctuations caused by noise or fine textures, which can otherwise obscure meaningful edges in the profile. By attenuating these minor variations, the process highlights more prominent changes in intensity, which correspond to edges or boundaries between different regions in the image. The smoothed profile thus provides a clearer indication of where significant transitions occur, improving the reliability of edge detection algorithms. The purpose of smoothing is to enhance the ability to identify edges or transitions by filtering out unnecessary noise while preserving important intensity changes. This enables more accurate detection of boundaries, leading to better recognition and analysis of objects or patterns in the image. Overall, smoothing is an essential step that balances noise reduction with edge preservation, making edge detection more robust and effective. Figure 2.2: Smoothed Function Edge detection via derivatives from smoothed images is a fundamental method in image processing for identifying significant transitions in intensity within an image. An edge is mathematically defined as a point or region where there is a sharp change in intensity, marking the boundary between different regions. To detect these changes, derivatives are employed to measure the rate of change in intensity values. 2.1.2 First Derivative (Gradient) The first derivative in image processing measures the rate of change of intensity across the image. It quantifies how quickly pixel values change, providing a mathematical representation of intensity variations. Large values of the first derivative correspond to rapid changes in intensity, making it a reliable indicator of edges, where significant 26 CHAPTER 2. EDGE DETECTION transitions occur between regions. This property makes the first derivative particularly useful for detecting edges, as it highlights areas with sharp intensity transitions. By applying gradient operators like Sobel or Prewitt, the first derivative can be computed efficiently, aiding in edge detection tasks and emphasizing critical boundaries within an image. First Derivative Formulation The formula shown represents the mathematical definition of the first derivative, which measures the rate of change of a function f (x). The derivative is defined as the limit of the difference quotient as the step size h approaches zero: d f (x + h) − f (x) f (x) = lim. dx h→0 h In image processing, this derivative is approximated using a difference operation, where h is set to 1 (the distance between adjacent pixels). The approximation becomes: d f (x) ≈ f (x + 1) − f (x), dx which calculates the difference in intensity between neighboring pixels. This discrete difference is widely used in digital images to compute derivatives efficiently, as the continuous nature of the mathematical derivative cannot be directly applied to discrete pixel data. The difference operation allows for detecting intensity changes across an image, which is the basis of many edge detection techniques. By using simple subtraction, this approach highlights regions with significant changes in brightness, making it a computationally efficient method for approximating the derivative and analyzing intensity transitions in digital images. Linear Kernels Linear kernels are commonly used to approximate the derivatives of an image by calculating intensity differences between pixels, enabling the detection of edges and transitions. A direct filter uses a simple kernel, [−1, 1], to compute the difference between two neighboring pixels. This filter directly subtracts the intensity of one pixel from its adjacent pixel, making it an efficient method for identifying rapid intensity changes and detecting edges. A symmetric filter employs the kernel [−1, 0, 1], which computes the central difference by comparing two symmetrical pixels relative to a center pixel. This approach provides a more balanced estimate of intensity changes by considering both sides of the central pixel, resulting in more accurate edge detection. Both filters serve as fundamental tools for deriving intensity variations in an image, with the symmetric filter offering greater precision in preserving directional information. First Derivative Formulation Let’s consider a 2D grayscale image represented as a matrix of intensity values: 10 15 20 25 30 35 40 45 I= 50 55 60 65 70 75 80 85 We want to compute the horizontal derivative using the Direct Filter [−1, 1] and the Symmetric Filter [−1, 0, 1]. Using the Direct Filter [−1, 1] For horizontal derivatives, we apply the kernel row-wise: 2.1. 1D EDGE DETECTION STEPS 27 K = −1 1 Convolving this kernel with the image computes differences between adjacent pixels in each row. For example, for the first row: (15 − 10), (20 − 15), (25 − 20) = [5, 5, 5] Applying this to the entire image: 5 5 5 5 5 5 Horizontal Derivative (Direct Filter) = 5 5 5 5 5 5 This highlights the changes along the horizontal direction. Using the Symmetric Filter [−1, 0, 1] For horizontal derivatives, the symmetric kernel considers the difference between two neighbors, skipping the center pixel: K = −1 0 1 For the first row, the central difference calculation is: (20 − 10), (25 − 15) = [10, 10] Applying this across the image (ignoring boundaries): 10 10 10 10 Horizontal Derivative (Symmetric Filter) = 10 10 10 10 This provides a more balanced difference measure compared to the Direct Filter. Extending to Vertical Derivatives: For vertical derivatives, the kernels are transposed: −1 Direct Filter: K = 1 −1 Symmetric Filter: K = 0 1 These are applied column-wise to compute intensity changes along the vertical direction. The resulting matrices will highlight intensity transitions between rows. This approach shows how linear kernels are applied to 2D images to detect edges and intensity changes in both horizontal and vertical directions. Type of Edges in 1D Edges in an image are critical features that represent transitions or boundaries between different regions. Various types of edges exist, each characterized by the way intensity values change across them. Step edges show an abrupt change in intensity, indicating sharp boundaries. Ramp edges, on the other hand, have a gradual transition in intensity, often caused by lighting variations or smoother boundaries. Line-on-bar edges appear as narrow, in- tense changes in a single region, often representing thin structures or lines. Finally, roof edges exhibit a peak in 28 CHAPTER 2. EDGE DETECTION intensity, with a rise and fall resembling a roof shape, commonly seen in curved or soft boundaries. Understanding these types of edges is essential for designing detection techniques tailored to their unique characteristics. Step Edges A step edge occurs when there is an abrupt change in intensity from one level to another, creating a distinct boundary between two regions. It is characterized by a sharp, vertical transition in the intensity profile, where pixel values suddenly shift from one level to another without gradual variation. This type of edge involves significantly different brightness levels, such as the boundary between a dark and a bright region. Step edges are prominent in images with clear, well-defined objects and are crucial for identifying boundaries and segmenting regions effectively in image analysis. Ramp Edges A ramp edge is characterized by a gradual transition between two intensity levels, where the change in brightness occurs smoothly over a region rather than abruptly. For example, a shadow transitioning gradually from dark to light illustrates a ramp edge, as the intensity values increase progressively rather than jumping sharply. Ramp edges are common in images with soft boundaries or gradual shading, and they appear as sloped transitions in the intensity profile. This type of edge is typically more challenging to detect compared to step edges, as the lack of abrupt changes requires more sensitive techniques to identify the transition accurately. Line or Bar Edges A line or bar edge consists of a narrow region of high intensity flanked by two lower-intensity regions, creat- ing a distinct, thin feature within the image. For example, a thin bright line against a darker background is a typical representation of a line or bar edge. This type of edge highlights thin, well-defined features, such as wires, cracks, or small structural elements, and is essential for detecting fine details in images. The intensity profile of a line or bar edge shows a sharp peak, making it easily identifiable and useful for tasks requiring precise feature recognition. Roof Edges A roof edge is defined by a sharp peak in intensity with sloping sides, resembling a ridge or pinnacle in the intensity profile. It typically represents thin structures or sharp transitions in an image where the intensity grad- ually increases to a peak and then decreases, forming a triangular or roof-like shape. Roof edges are common in features such as ridges, creases, or curved surfaces and are characterized by their distinct yet smooth transitions, making them crucial for identifying detailed structures or subtle boundaries in images. Figure 2.3: Type of Edges 2.1. 1D EDGE DETECTION STEPS 29 2.1.3 Second Derivative (Laplacian) The second derivative in image processing measures the rate at which the first derivative changes , pro- viding insight into the curvature of the intensity profile. It determines whether the first derivative is increasing or decreasing, effectively capturing changes in the shape of the intensity curve. This makes the second derivative par- ticularly useful for identifying where the curve changes direction, such as at edges or inflection points. A key concept in second-derivative edge detection is the zero-crossing, which occurs when the second derivative transitions from positive to negative or vice versa. These zero-crossings often correspond to points where the first derivative reaches a peak, marking the location of edges. This property is leveraged in methods like the Laplacian of Gaussian (LoG) filter, where zero-crossings help identify fine edge details and enhance edge localization in an image. Figure 2.4: First and Second Derivatives in 1D Edge Detection 2.1.4 Simplified Edge Detection Normally the edge detection process is the following: Figure 2.5: Edge Detection Steps The process of edge detection can be simplified by combining smoothing and derivative calculations into a single step. Instead of first convolving the image with a Gaussian function g for smoothing and then calculating the derivative of the smoothed image, the derivative of the Gaussian filter g can be computed beforehand. This results in a derivative filter that combines both operations. By directly convolving this derivative filter with the image, the smoothing and edge detection processes are performed simultaneously. This approach not only reduces computational complexity but also ensures that the edges detected are based on the smoothed intensity transitions, effectively minimizing noise while identifying significant changes. This simplification is widely used in edge detection algorithms. d d (g ⊗ f ) = g ⊗f dx dx g ⊗ f : Convolution of the Gaussian filter and the signal. 30 CHAPTER 2. EDGE DETECTION Figure 2.6: Simplified Edge Detection Steps d dx g: Derivative of the Gaussian filter. The operation simplifies by swapping the order of convolution and differentiation. The simplification of the edge detection process is possible because both convolution and differentiation are linear operations. This means that their order of application can be swapped without altering the final result. Instead of smoothing the image with a Gaussian filter and then calculating its derivative, the derivative of the Gaussian can be computed beforehand and directly applied to the image as a single convolution operation. This approach significantly reduces computational complexity by eliminating the need for intermediate calculations, such as separately convolving the image with the Gaussian filter and then taking the derivative. By combining these steps into one, the process becomes more efficient while still preserving accuracy in edge detection. 2.1.5 Hysteresis Hysteresis is a technique used in edge detection to ensure that detected edges are continuous and not frag- mented due to minor intensity fluctuations. It employs two thresholds to make the edge detection process more robust. High Threshold: It is used to start detecting an edge, requiring the gradient (or derivative) to exceed this value to initialize an edge segment. Low Threshold: It is applied to continue the edge, allowing connected points with gradient values below the high threshold but above the low threshold to be included as part of the edge. This dual-threshold mechanism ensures that strong edges are reliably detected and weaker but connected edge segments are preserved. The benefits of hysteresis include robust edge detection, which maintains edge continuity despite small fluctuations in intensity, and avoiding fragmentation, ensuring edges are smooth and not broken into discon- nected segments. This makes hysteresis a key component in advanced edge detection techniques, such as the Canny edge detector. 2.2 Color Recognition A Useful fiture for Object Recognition can Be the color of the object. 2.2.1 Advantages of Color Recognition The color is consistent under geometric transformations THIS means that when an object undergoes transla- tion rotation or scaling its color remains unchanged. Color as a local feature color is defined at each pixel making it a highly localized feature. This means it robust to partial occlusion, meaning that even if part of the object is hidden the visible part’s color still aid in recognition. 2.2. COLOR RECOGNITION 31 Figure 2.7: Hysteresis The direct color usage approach uses the exact color of objects for identification or recognition. Instead of relying on a dominant color of the object for identification on recognition, we can use statistics of object colors and computing them with histograms that capture the distribution of colors within the object. This adds robustness to noise and other variations in appearance. 2.2.2 Color Histograms Color histograms are a representation of the distribution of colors in an image. For each pixel, the values for Red, Green, and Blue (RGB) are given. Histograms are computed for these color channels. For each color channel (Red, Green, Blue), a histogram counts how many pixels in the image have a particular intensity of that color. Luminance histograms represent the brightness of the image. This histogram measures how many pixels have specific levels of brightness, independent of color. These histograms are used as features to describe the color distribution of an object in an image, which can be compared against histograms of other images to identify similarities or match objects. 2.2.3 Joint 3D Color Histograms stead of separate 1D histograms for Red, Green, and Blue, a 3D histogram considers the RGB values together as a vector. This allows for a more precise representation of color combinations present in the image. Each entry in this 3D space represents a combination of Red, Green, and Blue, and the count in each bin represents how many pixels have that specific color combination. This representation makes it easier to compute the similarity of two images. Comparing two 3D histograms can show whether two objects have a similar color composition, even if they are rotated or partially occluded. Like the 1D case, this is a robust representation because it works even if the objects in the image are rotated, partially occluded, or viewed under different lighting conditions. 32 CHAPTER 2. EDGE DETECTION images/Computer Vision/colors_histo.jpeg Figure 2.8: Histogram of color distribution 2.2.4 Color Normalization by Intensity When dealing with color images, each pixel’s color is typically represented by its Red, Green, and Blue (RGB) components. However, the intensity of a color can vary due to changes in lighting or shading. Even if the colors are the same, varying intensity can make them appear different. We can handle this with nor- malization. Intensity of a Pixel The total brightness of each pixel is defined as: I =R+G+B Chromatic representation involves normalizing the color of each pixel by dividing each color component (R, G, B) by the intensity I. This transformation removes the effect of varying brightness or illumination, making the color representation consistent across different lighting conditions. Using Normalization If I know, for example, two colors R, G, I can calculate B using the formula: B =1−R−G We can fully describe the color using just two values. The cube represents the range of possible values for RGB, with axes corresponding to each color channel. The constraint R + G + B = 1 implies that the colors lie on a plane 2.2. COLOR RECOGNITION 33 images/Computer Vision/3dcolor_histo.jpeg Figure 2.9: Histogram of 3D color distribution within this cube, forming a 2D space for normalized color. In image recognition, it’s important to have a color representation that is invariant to lighting changes. The chromatic representation allows systems to focus on the actual color of the object rather than being influenced by lighting conditions. 2.2.5 Recognition Using Histograms This is a method for identifying objects based on their color distributions. 1. Histograms Comparison Step: A histogram representing the color distribution of a ”test image” is com- pared to histograms from a database of ”known objects”. The object whose histogram closely resembles that of the test is identified as the best match. 2. Multiple Views per Object: Since an object can appear in different orientations and lighting conditions, the database stores multiple views of each object. Each view has its histogram.This increases accuracy of object recognition, as the system can compare the test image’s histogram with histograms from different angles or views of the same object. 3. Histogram-Based Retrieval In the example below, a ”query” object is given (e.g., a yellow cat figurine), and its color histogram is used to retrieve similar objects from the database. The system retrieves objects whose color histograms closely match that of the query, displaying objects such as yellow cars or yellow objects (e.g., nuclear waste barrel) in the database.This process highlights the use of histograms for identifying objects based on color similarities, even when the objects belong to different categories but share similar color profiles. 34 CHAPTER 2. EDGE DETECTION images/Computer Vision/norm_colors.jpeg Figure 2.10: Normalized Colors 2.3 Histogram Comparison Technique The Histogram Comparison technique is a method used to measure the similarity or dissimilarity between two histograms. A histogram represents the distribution of certain features (such as pixel intensity, color, or texture) within an image or dataset. In image processing and computer vision, histograms are often used to represent the distribution of colors, brightness, or other properties of an image. 2.3.1 Histogram Comparison: Intersection Method Histogram comparison is a fundamental method in image analysis and computer vision for determining how similar two histograms are. One commonly used metric is Histogram Intersection, which measures the common parts between two histograms. This method is particularly useful in tasks such as object recognition, color analysis, and retrieval of images from databases based on their content. Histogram Intersection Formula The histogram intersection method calculates the similarity between two histograms Q and V by taking the sum of the minimum values for each bin in the histograms. The formula is given by: X ∩(Q, V ) = min(qi , vi ) i Where: Q = [q1 , q2 ,... , qn ] represents the histogram of the first image. 2.3. HISTOGRAM COMPARISON TECHNIQUE 35 images/Computer Vision/RGB_cube.jpeg Figure 2.11: RGB Cube V = [v1 , v2 ,... , vn ] represents the histogram of the second image. qi and vi are the values of the i-th bin in histograms Q and V , respectively. The min(qi , vi ) function returns the minimum value between the corresponding bins. The sum is over all n bins. Motivation Histogram intersection has the following motivations and properties: Measures the Common Parts: This method directly measures the overlap between the histograms by focusing on their common parts. The more similar two histograms are, the greater their intersection. Range: The result of the histogram intersection lies in the range of [0, 1]. ∗ A value of 1 means the histograms are perfectly similar. ∗ A value of 0 indicates no similarity between the histograms. Unnormalized Histograms: For unnormalized histograms, the intersection is scaled by the sum of the histogram values. The following formula is used to normalize the intersection for unnormalized histograms: P P