Summary

This document provides an introduction to object detection using deep neural networks, comparing it to image classification. It discusses the key components of object detection, including input image processing, feature extraction using convolutional neural networks (CNNs), optional region proposals, and object localization and classification. The document also details the workflow and architecture of Region-based Convolutional Neural Networks (R-CNNs), focusing on the benefits of end-to-end learning and faster processing compared to earlier methods.

Full Transcript

Chapter 6 Object detection with R-CNN, SSD, and YOLO Introduction In the previous chapters, we discussed how deep neural networks can be used for image classification tasks. In image classification, the primary assumption is that there is a single main target object in the image, and the model’s f...

Chapter 6 Object detection with R-CNN, SSD, and YOLO Introduction In the previous chapters, we discussed how deep neural networks can be used for image classification tasks. In image classification, the primary assumption is that there is a single main target object in the image, and the model’s focus is to identify its category. However, in many real-world scenarios, we encounter images with multiple objects of interest. In these cases, we not only want to classify the objects but also determine their specific locations within the image. This task is referred to as object detection in computer vision. Figure 6.1 illustrates the distinction between image classification and object detection. While image classifica- tion involves identifying the category of a single target object, object detection requires both object localization and classification. Localization involves determining the position of an object by drawing a bounding box around it, while classification predicts the correct category for each detected object. Object detection is inherently more challenging than image classification because it requires the system to handle multiple tasks simultaneously: detecting and accurately localizing one or more objects within an image and classifying each of them (see Table 6.1). The model must not only predict the class of each object but also provide the coordinates of the bounding box that fits the detected object. This dual responsibility makes object detection a complex but vital task in modern computer vision applications. 65 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Figure 6.1: Comparison of image classification and object detection tasks. Aspect Image Classification Object Detection Predicts both the classes and Predicts the class of the main Task locations of multiple objects in object in the image. the image. Multiple labels with A single label for the entire corresponding bounding box Output image (e.g., "Cat"). coordinates (e.g., "Cat: [x1, y1, x2, y2]"). Handles multiple objects of Assumes one primary object per Input Assumption varying sizes and classes within image. the same image. Not required; focuses only on Required; predicts bounding Localization classification. boxes for each object. Relatively simple; deals with one More complex; combines Complexity target. localization and classification. Real-time detection tasks, such Identifying the overall content of Applications as autonomous driving, security, an image (e.g., "dog photo"). and robotics. Example Model ResNet, VGG, AlexNet YOLO, Faster R-CNN, SSD Mean Average Precision (mAP), Evaluation Metrics Accuracy, Precision, Recall Intersection over Union (IoU) Tableau 6.1: Comparison of Image Classification and Object Detection 66 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO General Object Detection Framework A general object detection framework consists of several key components, as outlined below: 1. Input Image Processing Input Image: The system receives an image of any size or resolution. Preprocessing: – Resize the image to a fixed size for compatibility with the model. – Normalize pixel values for numerical stability. – Optionally perform data augmentation (e.g., flipping, rotation) to improve model generalization. 2. Feature Extraction Backbone Network: A convolutional neural network (CNN) extracts features from the input image. Popular backbone networks include ResNet, VGG, or MobileNet. The backbone transforms the input image into feature maps that encode high-level spatial and semantic information. 3. Region Proposal (Optional) Region Proposal Network (RPN): – Identifies regions in the image that are likely to contain objects. – Generates anchor boxes of different sizes and aspect ratios. – Filters out irrelevant regions using heuristics like non-maximum suppression (NMS). This step is used in two-stage detectors (e.g., Faster R-CNN). 4. Object Localization and Classification Bounding Box Regression: Predicts the coordinates of bounding boxes for de- tected objects. Bounding boxes can be represented as: (xmin , ymin , xmax , ymax ) 67 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO or (centerx , centery , width, height). Object Classification: Assigns a class label to each detected object (e.g., “Cat”, “Car”). Single-stage vs. Two-stage Frameworks: – Single-stage detectors (e.g., YOLO, SSD): Perform localization and classifica- tion in one step. – Two-stage detectors (e.g., Faster R-CNN): Use a separate stage for region proposals before classification. 5. Postprocessing Non-Maximum Suppression (NMS): Eliminates overlapping bounding boxes by retaining the box with the highest confidence score. Confidence Thresholding: Filters out predictions with low confidence scores. 6. Output For each detected object, the framework outputs: Class Label: The predicted category (e.g., “Dog”). Bounding Box Coordinates: The location of the object in the image. Confidence Score: The probability that the prediction is correct. Popular Object Detection Architectures R-CNN Family: R-CNN, Fast R-CNN, Faster R-CNN (two-stage detectors with region proposals). YOLO (You Only Look Once): Single-stage detector with real-time performance. SSD (Single Shot MultiBox Detector): Predicts bounding boxes and labels at multiple scales directly from feature maps. Transformers: DETR (DEtection TRansformer) uses attention mechanisms for de- tection. 68 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Region Proposals in Object Detection In the region proposals step, the system analyzes the input image and identifies regions of interest (RoIs) that are likely to contain objects. The process is described as follows: 1. Generating Regions of Interest (RoIs) RoIs are regions in the image that the system predicts have a high likelihood of containing an object. Each RoI is assigned an objectness score, which indicates the probability that the region contains an object 6.19 Regions with high objectness scores are forwarded to the next steps for further processing, while regions with low scores are discarded. Figure 6.2: Regions of interest (RoIs) proposed by the system. Regions with high objectness score represent areas of high likelihood to contain objects (foreground), and the ones with low ob- jectness score are ignored because they have a low likelihood of containing objects (background). 2. Approaches for Generating Region Proposals Selective Search Algorithm: – Originally used to generate object proposals in traditional object detection sys- tems. – Combines texture, color, and edge information to identify regions. Deep Learning-based Approaches: 69 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO – Use complex visual features extracted from the image by a deep neural network. – Examples include Region Proposal Networks (RPNs) used in Faster R-CNN. 3. Trade-offs in Region Proposal Generation Number of Regions vs. Computational Complexity: – Generating more regions increases the likelihood of detecting all objects but sig- nificantly increases computational cost. – The goal is to use problem-specific information to reduce the number of RoIs without compromising detection accuracy. Threshold for Objectness Score: – A configurable threshold determines whether a region is considered as fore- ground (object) or background (no object). – Low Threshold: Increases the number of RoIs, improving object detection cov- erage but at a higher computational cost. – High Threshold: Reduces the number of RoIs, improving efficiency but with a risk of missing objects. 4. Outcome of the Region Proposal Step The system generates thousands of bounding boxes for further analysis and classifi- cation. Each bounding box is classified as either: – Foreground: If the objectness score is above the threshold. – Background: If the objectness score is below the threshold. Foreground regions are forwarded to subsequent steps in the network for further processing. Network Predictions in Object Detection In object detection, the network predictions phase involves the following steps: Feature Extraction A pretrained Convolutional Neural Network (CNN), such as ResNet, VGG, or Effi- cientNet, is commonly used to extract visual features from input images. These pretrained 70 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO models, often trained on datasets like ImageNet or MS COCO, generalize well to var- ious tasks. The extracted features are then analyzed to detect regions of interest (RoIs) likely containing objects. Region Analysis and Predictions For each identified region, the network makes two key predictions: 1. Bounding Box Prediction: The coordinates of the box surrounding the object are represented as: (x, y, w, h) where: x, y: Coordinates of the bounding box center. w, h: Width and height of the bounding box. 2. Class Prediction: Using the softmax function, the network predicts the class prob- ability for each object and assigns the object to the most likely class. Since the network evaluates thousands of proposed regions, an object may be detected multiple times, each with slightly different bounding box coordinates and classification probabilities. For example, detecting a dog might result in several bounding boxes around the same object. 71 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Figure 6.3: The bounding-box detector produces more than one bounding box for an object. We want to consolidate these boxes into one bounding box that fits the object the most Reducing Redundancy with Non-Maximum Suppression (NMS) To address the issue of multiple detections for the same object, Non-Maximum Sup- pression (NMS) is applied. The process is as follows: 1. Input: A set of bounding boxes, each with: Predicted coordinates (x, y, w, h). A class label and confidence score indicating how likely the box contains the object. 2. Steps: (1) Sort the Boxes: Rank all bounding boxes for a specific class by their confidence scores in descending order. (2) Select the Top Box: Start with the box having the highest confidence score and mark it as the final prediction. (3) Remove Overlapping Boxes: Compute the Intersection over Union (IoU) for each pair of boxes. Discard boxes with IoU greater than a specified threshold (e.g., 0.5), as they are considered redundant. 72 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO (4) Repeat: Move to the next highest-confidence box that has not been discarded and repeat the process. 3. Output: A reduced set of bounding boxes, each representing one detected object. Non-Maximum Suppression (NMS) One of the key challenges in object detection algorithms is dealing with multiple detections for the same object. Instead of creating a single bounding box for an object, the algorithm may produce multiple overlapping boxes. Non-Maximum Suppression (NMS) is a technique used to ensure that each object is detected only once by selecting the best bounding box and suppressing the rest. As the name implies, NMS examines all the bounding boxes around an object, finds the box with the maximum prediction probability, and eliminates or suppresses the others. This process reduces the number of candidate boxes to a single bounding box per object. Steps of the NMS Algorithm 1. Discard Low-Confidence Predictions: Eliminate all bounding boxes with prediction probabilities below a certain thresh- old, called the confidence threshold. The confidence threshold is a tunable hyperparameter that determines the mini- mum probability required for a box to be considered. 2. Select the Box with the Highest Confidence: From the remaining boxes, identify the bounding box with the highest prediction probability. 3. Calculate the Overlap (IoU): For all remaining boxes that predict the same class, calculate the overlap with the selected box. This overlap is measured using the Intersection over Union (IoU), which is defined as: Area of Overlap IoU = Area of Union 4. Suppress Overlapping Boxes: Discard any bounding boxes that have an IoU value greater than a certain thresh- old, known as the NMS threshold. The NMS threshold is typically set to 0.5 but can be tuned to achieve fewer or more bounding boxes. 73 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 5. Repeat the Process: Repeat steps 2–4 for the remaining bounding boxes until all boxes have been processed. Figure 6.4: Multiple regions are proposed for the same object. After NMS, only the box that fits the object the best remains; the rest are ignored, as they have large overlaps with the selected box Key Hyperparameters in NMS Confidence Threshold: Determines the minimum probability required for a bound- ing box to be considered valid. NMS Threshold: Controls the degree of overlap allowed between bounding boxes. A lower threshold suppresses more boxes, while a higher threshold retains more boxes. Example Scenario For instance, if an object is fairly large and over 2, 000 object proposals are generated, many of these proposals will have significant overlap with the object. NMS ensures that only the most accurate bounding box is retained while suppressing the others. 74 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Importance of NMS NMS is a standard technique across most object detection frameworks and is essential for: Reducing Redundancy: Ensures only one bounding box per object. Improving Accuracy: Helps focus on the most confident predictions. Optimizing Detection: Balances between suppressing redundant boxes and retain- ing accurate predictions. Tuning hyperparameters like the confidence threshold and NMS threshold is crucial for adapting the algorithm to specific scenarios. The combination of network predictions and non-maximum suppression forms the foundation of many object detection frameworks, such as Faster R-CNN, YOLO, and SSD, enabling them to detect objects efficiently and accurately. Object-Detector Evaluation Metrics When evaluating the performance of an object detector, two main metrics are commonly used: frames per second (FPS) to measure detection speed and mean average pre- cision (mAP) to measure detection accuracy. Frames Per Second (FPS) for Detection Speed The most common metric for evaluating the detection speed of an object detector is the frames per second (FPS). It measures how many frames the network can process in one second. For example: Faster R-CNN operates at approximately 7 FPS. SSD operates at approximately 59 FPS. In benchmarking experiments, results are often reported as: "Network X achieves mAP of Y% at Z FPS," where: X: The name of the network. Y : The mean average precision (mAP) percentage. Z: The FPS value. 75 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Mean Average Precision (mAP) for Network Precision The most common metric for evaluating the accuracy of an object detection model is the mean average precision (mAP). It is expressed as a percentage from 0 to 100, where higher values indicate better performance. Unlike accuracy used in classification tasks, mAP considers both localization and classification. To understand how mAP is calculated, we must first explain two foundational con- cepts: 1. Intersection over Union (IoU) 2. Precision-Recall (PR) Curve Intersection over Union (IoU) The Intersection over Union (IoU) is a measure of overlap between two bounding boxes: Figure 6.5: The IoU score is the overlap between the ground truth bounding box and the predicted bounding box. The ground truth bounding box (Bground truth ). The predicted bounding box (Bpredicted ). The IoU is calculated as: Area of Overlap IoU = Area of Union Where: Area of Overlap: The shared area between Bground truth and Bpredicted. 76 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Area of Union: The total area covered by both bounding boxes. Figure 6.6: IoU scores range from 0 (no overlap) to 1 (100The higher the overlap (IoU) between the two bounding boxes, the better. Using IoU, we can determine: True Positive (TP): When the IoU between the predicted box and the ground truth box exceeds a certain threshold (e.g., 0.5). False Positive (FP): When the IoU is below the threshold, indicating an incorrect prediction. Precision-Recall (PR) Curve The Precision-Recall (PR) Curve is used to evaluate the performance of an object detector at various confidence thresholds. It is defined as: Precision: The proportion of correct detections among all detections. True Positives Precision = True Positives + False Positives Recall: The proportion of correct detections among all ground truth objects. True Positives Recall = True Positives + False Negatives A PR curve plots precision against recall at various thresholds. The area under this curve (AUC) provides the average precision (AP) for a single class. Mean Average Precision (mAP) The mean average precision (mAP) is the mean of the average precision (AP) values across all classes. It provides an overall measure of the model’s ability to detect and classify objects accurately. 77 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO The formula for mAP is: N 1 X mAP = APi N i=1 where: N : Total number of object classes. APi : Average precision for the ith class. Summary of Metrics FPS: Measures detection speed, critical for real-time applications. mAP: Measures detection accuracy, accounting for both classification and localiza- tion. IoU: Determines whether a detection is valid based on bounding box overlap. The R-CNN Family of Object Detection Techniques The R-CNN family of object detection techniques, usually referred to as R-CNNs (short for Region-based Convolutional Neural Networks), was developed by Ross Girshick et al. in 2014. The R-CNN family later expanded to include Fast R-CNN and Faster R-CNN in 2015 and 2016, respectively. In this section, I’ll quickly walk you through the evolution of the R-CNN family: From the original R-CNN (2014), To Fast R-CNN (2015), And then to Faster R-CNN (2016). R-CNN: The Foundation of Region-Based Architectures R-CNN is the least sophisticated region-based architecture in its family, but it serves as the basis for understanding how multiple object-recognition algorithms work. It was one of the first large, successful applications of Convolutional Neural Networks (CNNs) to the problem of object detection and localization, paving the way for other advanced detection algorithms. The approach was demonstrated on benchmark datasets, achieving then-state-of-the- art results on: The PASCAL VOC-2012 dataset, and The ILSVRC 2013 object detection challenge. 78 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Figure 6.7 provides a summary of the R-CNN model architecture. Figure 6.7: Summary of the R-CNN model architecture. R-CNN Architecture and Training Process R-CNN (Region-Based Convolutional Neural Networks) is composed of four key modules: 1. Selective Search Region Proposal: Role: Generate potential bounding boxes for objects in the image. Characteristic: This algorithm is pre-designed and not learned, meaning it cannot be optimized for specific datasets or tasks. Drawback: Computationally expensive and time-consuming, as it generates thou- sands of region proposals per image. 2. Feature Extractor (CNN): Role: Extract feature representations for each region proposed by Selective Search. Training: Typically involves fine-tuning a pre-trained network (e.g., AlexNet or VGG). Training from scratch is rare due to high computational and data requirements. Issue: Each region is processed independently, leading to redundant computations and high costs. 3. SVM Classifier: Role: Classify the extracted features into object categories or background. Training: Binary SVM classifiers are trained separately for each object class using labeled data. Drawback: The classifier is trained independently of the feature extractor, result- ing in no end-to-end learning. 79 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Figure 6.8: Illustration of the R-CNN architecture. Each proposed RoI is passed through the CNN to extract features, followed by a bounding-box regressor and an SVM classifier to produce the network output prediction. 4. Bounding-Box Regressor: Role: Refine bounding box coordinates to better match the objects. Output: Produces four real-valued numbers ([x, y, width, height]) per ob- ject class. Training: This module is trained separately for each class, adding complexity to the pipeline. Challenges in Training R-CNN The R-CNN training process faces several limitations: 1. Multistage Pipeline: Each module (feature extractor, SVM, and regressor) is trained independently, leading to inefficiencies due to the lack of shared computa- tion. 2. Expensive Training Process: Thousands of region proposals are generated per image, and each region is passed through a CNN, making the process computationally 80 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO intensive. 3. Storage Issues: Features for all proposals are often pre-computed and stored on disk, requiring significant disk space. 4. Inference Time: Each image must go through region proposal, feature extraction, classification, and regression for every region, making predictions slow. 6.1 Fast R-CNN Fast R-CNN is an improvement over the original R-CNN (Regions with Convolutional Neural Networks) for object detection tasks. Developed by Ross Girshick in 2015, it introduces several key changes that lead to faster and more efficient object detection while also improving classification accuracy. 6.2 Key Improvements Over R-CNN Single CNN for Feature Extraction: Unlike R-CNN, which applied the CNN feature extractor on each of the proposed regions (around 2,000 regions per image), Fast R-CNN applies the CNN to the entire input image first. It then proposes regions using the feature map, reducing the need to run multiple ConvNets for each region. This is much more efficient since it runs just one ConvNet over the entire image, as opposed to 2,000 ConvNets over 2,000 overlapping regions. Softmax Layer Instead of SVM: In R-CNN, a Support Vector Machine (SVM) classifier was used to classify the objects in each region. Fast R-CNN replaces this SVM with a softmax layer. This allows the CNN to directly output the class prob- abilities for each region, simplifying the model by combining feature extraction and classification into a single neural network. The use of a softmax layer also improves the accuracy of object classification. 6.3 Fast R-CNN Architecture As shown in Figure 6.9, Fast R-CNN consists of several key components: 1. Feature Extractor Module: The network begins by applying a ConvNet (Convo- lutional Neural Network) to extract features from the entire image. This produces a feature map, which will be used to generate region proposals. 81 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 2. Region Proposal Generation (RoI): The Selective Search algorithm is used to propose about 2,000 region candidates per image. These are the potential regions where objects may be located. 3. RoI Pooling Layer: The RoI Pooling Layer is a key component of Fast R-CNN. It takes each of the proposed regions and converts them into a fixed-size feature map, regardless of their original size. This is done by applying max pooling to the feature map produced by the ConvNet. The RoI pooling layer ensures that each region proposal can be fed into the fully connected layers of the network, enabling the model to handle regions of different sizes and aspect ratios efficiently. 4. Two-Head Output Layer: The output of Fast R-CNN consists of two parts: Softmax Classifier: This layer outputs a probability distribution for each region proposal, indicating the likelihood of each object class being present in the region. Bounding Box Regressor: This layer predicts the offsets (location adjust- ments) for each region proposal relative to the original bounding box. Figure 6.9: The Fast R-CNN architecture consists of a feature extractor ConvNet, RoI extrac- tor, RoI pooling layers, fully connected layers, and a two-head output layer. Note that, unlike R-CNNs, Fast R-CNNs apply the feature extractor to the entire input image before applying the regions proposal module. 6.4 Loss Function Since Fast R-CNN is designed to learn both the classification (which object is present) and the localization (where the object is located), it uses a multi-task loss function. 82 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO The loss function is composed of two parts: Classification Loss (Lcls ): This is the loss associated with the object classifica- tion task. It is minimized during training to improve the accuracy of object class predictions. Localization Loss (Lloc ): This is the loss associated with predicting the bounding box. It measures how well the predicted bounding box matches the actual ground truth bounding box, and the model is trained to minimize this loss. The model is trained end-to-end using these two loss functions. The multi-task loss combines both the classification and localization losses, and during optimization, the goal is to minimize both losses simultaneously. he total loss function used in Fast R-CNN is a multi-task loss that combines the classification loss and the localization loss. It is given by: L(p, u, t, v) = Lcls (p, u) + λ[u ≥ 1]Lloc (t, v) Where: Lcls (p, u): The classification loss, defined as the log loss over true class u and predicted probability p: Lcls (p, u) = − log pu Here, pu is the predicted probability of the true class u. Lloc (t, v): The localization loss, defined as the Smooth L1 loss between the predicted bounding box offsets t = (tx , ty , tw , th ) and the ground truth box v = (vx , vy , vw , vh ): X Lloc (t, v) = SmoothL1 (ti − vi ) i∈{x,y,w,h} The Smooth L1 loss is given by:  0.5x2 if |x| < 1, SmoothL1 (x) = |x| − 0.5 otherwise. [u ≥ 1]: An indicator function that activates the localization loss only for foreground (object-containing) regions. λ: A hyperparameter to balance the contribution of the two loss components. 83 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 6.5 Summary of Fast R-CNN Architecture 1. Feature Extraction: A ConvNet processes the entire image to produce a feature map. 2. Region Proposal: Selective Search proposes region candidates from the feature map. 3. RoI Pooling: Regions of interest are pooled into a fixed-size feature map for each region proposal. 4. Classification & Bounding Box Prediction: The two-head output layer classifies the regions and predicts bounding box adjustments. 5. Loss Function: Multi-task loss combines classification and localization losses to optimize the model. 6.6 Advantages of Fast R-CNN Efficiency: Only one ConvNet is used, which significantly speeds up the process by reducing the number of convolutions. End-to-End Learning: The entire architecture is learned together, simplifying the training process and eliminating the need for separate SVM classifiers. Improved Accuracy: By using a softmax classifier and a more integrated approach, Fast R-CNN improves both classification accuracy and bounding box localization. Faster R-CNN Faster R-CNN is a deep learning-based object detection framework introduced in 2016 by Shaoqing Ren et al. It improves upon its predecessors (R-CNN and Fast R-CNN) by incorporating a learnable Region Proposal Network (RPN) for generating region proposals, making it faster and more accurate. Faster R-CNN is end-to-end trainable, enabling a unified object detection pipeline. Architecture The architecture of Faster R-CNN can be broken down into the following components: 1. Base Network (Feature Extractor): A pretrained Convolutional Neural Network (CNN), such as VGG or ResNet, is used to extract a convolutional feature map from the input image. 84 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 2. Region Proposal Network (RPN): The RPN replaces traditional region proposal methods (e.g., selective search) by using a small sliding window over the feature map. The RPN predicts: Objectness Score: The probability that an anchor contains an object. Bounding Box Regression: Adjustments to anchor box coordinates. 3. RoI Pooling Layer: The proposed regions are projected onto the feature map and pooled into fixed-size regions (e.g., 7 × 7). This ensures uniform input size for the detection head. 4. Detection Head: This component has two branches: Softmax Classifier: Outputs class probabilities for each region. Bounding Box Regressor: Refines the coordinates of each region proposal. Figure 6.10: The Faster R-CNN architecture has two main components: an RPN that identifies regions that may contain objects of interest and their approximate location, and a Fast R-CNN network that classifies objects and refines their location defined using bounding boxes. The two components share the convolutional layers of the pretrained VGG16. Workflow 1. The input image is passed through the base network to generate a feature map. 2. The RPN processes the feature map to generate region proposals. 3. The RoI Pooling layer extracts fixed-size feature regions for each proposal. 4. The detection head classifies the objects and refines the bounding boxes. 85 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 5. The final output is a set of detected objects with their bounding boxes and class labels. Advantages Eliminates the need for external region proposal algorithms, speeding up the process. End-to-end trainable for a unified detection pipeline. High accuracy, achieving state-of-the-art results in object detection tasks. The Region Proposal Network (RPN) is a critical component of Faster R-CNN, responsible for generating potential regions that may contain objects of interest. Unlike traditional methods such as Selective Search, the RPN directly integrates region proposal generation into the Faster R-CNN architecture, significantly improving efficiency and speed. Role of the RPN The RPN is an attention network that guides the network’s focus toward potentially interesting regions in the image. It outputs region proposals, which are used to identify objects during the object detection process. Specifically, the RPN provides: An objectness score, indicating whether a region contains an object (foreground) or is background. A bounding box prediction, refining the location of the detected object by adjust- ing the anchor box coordinates. Architecture of the RPN The architecture of the RPN consists of the following layers: 3 × 3 Convolutional Layer: The feature map from the backbone network (such as VGG or ResNet) is passed through a 3 × 3 fully convolutional layer with 512 output channels. This layer learns spatial features for potential object regions. Two Parallel 1 × 1 Convolutional Layers: After the 3 × 3 convolutional layer, the output is passed through two separate 1 × 1 convolutional layers: – Classification Layer: This layer predicts the objectness score for each anchor box, determining if it contains an object (foreground) or not (background). 86 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO – Bounding Box Regression Layer: This layer predicts the coordinates to refine the location of the anchor box, improving its accuracy in localizing the object. Figure 6.11: Convolutional implementation of an RPN architecture, where k is the number of anchors The Region Proposal Network (RPN) plays a crucial role in object detection pipelines such as Faster R-CNN. Below is a detailed explanation of its structure and functionality: 1. Sliding Window (3 × 3 Kernel): A 3×3 sliding window is applied over the input feature map, which is typically the output of a backbone convolutional neural network (e.g., ResNet). This operation captures localized spatial context within the feature map for small regions. 2. Two 1 × 1 Convolutional Layers: The output of the sliding window is passed to two separate 1×1 convolutional layers: (1) Classifier: Outputs an objectness score, which is a binary classification indi- cating whether the region is a foreground (contains an object) or background (does not contain an object). (2) Bounding-box regressor: Predicts adjustments to refine anchor box posi- tions into tighter bounding boxes. 3. Purpose of RPN: The primary goal of the RPN is not to classify objects but to determine whether a region is worth further investigation. It generates candidate regions (proposals) likely to contain objects, which are passed to downstream components for classification and bounding box refinement. 4. Pipeline Flow: Regions identified by the RPN as likely to contain objects are processed further: 87 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO (1) RoI Pooling: Aligns features for regions of interest. (2) Final Layers: Fully connected layers perform: – Classification: Predict the class of the object. – Bounding-box regression: Provide the final refined bounding box coordinates. The RPN thus serves as a filtering mechanism, efficiently focusing computational resources on promising regions of the image. This modular approach is a key feature of modern object detection architectures. Anchors in RPN Anchors are predefined bounding boxes of different sizes and aspect ratios, placed at each location on the feature map. These anchors serve as initial proposals that the RPN refines during the process. The use of multiple anchors allows the RPN to detect objects of various sizes and shapes in the same image. The process is as follows: Anchor Generation: Anchors are placed at each sliding window location on the feature map. Figure 6.12: Anchors are at the center of each sliding window. IoU is calculated to select the bounding box that overlaps the most with the ground truth. 88 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Anchor Refinement: The RPN predicts the objectness score and bounding box regression for each anchor. Selection: The RPN selects the top-K anchors based on objectness scores (e.g., 2000 proposals) to be passed on to the RoI Pooling layer for further processing. Workflow of the RPN The workflow of the RPN can be summarized as follows: 1. Input: The feature map from the backbone CNN is passed into the RPN. 2. Anchor Placement: Anchors of various sizes and aspect ratios are generated at each sliding window location. 3. Convolutional Layers: The feature map undergoes a 3 × 3 convolution followed by two 1 × 1 convolutional layers for classification and bounding box regression. 4. Outputs: The RPN produces objectness scores and bounding box refinements for each anchor. Figure 6.13: The RPN classifier predicts the objectness score, which is the probability of an image containing an object (foreground) or a background. 5. Top-K Selection: The highest-scoring region proposals are selected for further pro- cessing by the RoI Pooling layer. Training the RPN The Region Proposal Network (RPN) is trained to classify an anchor box and output: An objectness score, which indicates whether the anchor box contains an object or not. 89 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO The four location parameters that approximate the coordinates of the object. The training process uses human annotations to label the ground truth bounding boxes in the dataset. The objective is to correctly classify the anchor boxes based on their overlap with the ground truth and predict the bounding box coordinates. IoU and Anchor Classification For each anchor box, the overlap probability value p is computed based on the Intersection over Union (IoU) between the anchor and the ground-truth bounding box. The overlap is categorized as follows:    1 if IoU > 0.7  p = −1 if IoU < 0.3   otherwise  0 Where: p = 1: The anchor box has a high overlap (IoU > 0.7) with a ground-truth box, and is classified as a positive anchor (likely to contain an object). p = −1: The anchor box has a low overlap (IoU < 0.3) with a ground-truth box, and is classified as a negative anchor (background). p = 0: The anchor box has an intermediate overlap and is considered neutral, meaning it is ignored for training. During the training process, anchors with positive and negative labels are used as input to the RPN for two tasks: Classification Task: Classify each anchor as either containing an object or being background. Regression Task: Predict the location of the object by refining the anchor box coordinates (bounding box regression). Anchor Output and RPN Structure For each sliding window location, the RPN generates multiple anchor boxes. If the number of anchors per sliding window is k, the RPN produces: 2k objectness scores (one for each anchor box, predicting whether it contains an object or not). 4k bounding box coordinates (the regression parameters for each anchor box to better fit the object). 90 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO For example, if there are 9 anchors per sliding window (k = 9), the RPN will output: 18 objectness scores and 36 location coordinates. Illustration of Anchor Generation and Output The RPN generates anchors at each spatial location of the feature map, where each anchor is parameterized by its objectness score and bounding box coordinates. The network then refines these proposals to generate accurate region proposals for object detection. Figure 6.14: Region proposal network Loss Function in Faster R-CNN The total loss in Faster R-CNN is a combination of two parts: RPN Loss (Region Proposal Network Loss): This part focuses on the classi- fication of anchors (whether they contain an object or not) and the bounding box regression (refining the object locations). 91 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Fast R-CNN Loss: This part focuses on the classification of objects within proposed regions and the bounding box regression to improve localization. The total loss L is given by: L = LRPN + Lreg Loss Function for RPN The RPN is trained to classify anchors and predict the location parameters (bounding box) for each anchor. The RPN loss has two components: Classification Loss (Lcls ): This is the loss for classifying anchors as foreground or background (object or not). Regression Loss (Lreg ): This loss is used for refining the bounding box coordinates. The RPN loss can be written as: 1 X λ X ∗ LRPN = Lcls (pi , p∗i ) + p Lreg (ti , t∗i ) Ncls i Nreg i i Where: pi is the predicted objectness score for the i-th anchor. p∗i is the ground truth label for the i-th anchor (either 1 for foreground, -1 for back- ground). ti is the predicted bounding box coordinates. t∗i is the ground truth bounding box coordinates. Ncls is the number of anchors used for classification. Nreg is the number of anchors used for bounding box regression. λ is a balancing factor between the classification and regression losses. Loss Function for Fast R-CNN The Fast R-CNN component is responsible for classifying objects within the proposed regions and refining the bounding boxes for these regions. The loss function for Fast R-CNN is: 1 X λ X ∗ LFast-RCNN = Lcls (pi , p∗i ) + [p > 0]Lreg (ti , t∗i ) Ncls i Nreg i i 92 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Where: Lcls (pi , p∗i ) is the classification loss, usually a softmax loss over multiple classes. Lreg (ti , t∗i ) is the bounding box regression loss, typically calculated using Smooth L1 loss. p∗i > 0 ensures the regression loss is computed only for foreground objects (i.e., p∗i > 0 indicates that the anchor is a foreground object). Ncls and Nreg are the number of anchors used for classification and regression, respec- tively. Smooth L1 Loss for Bounding Box Regression The bounding box regression loss is computed using Smooth L1 loss, defined as:  0.5(t − t∗ )2 if |t − t∗ | < 1, Lreg (t, t ) = SmoothL1(t − t ) = ∗ ∗ |t − t∗ | − 0.5 otherwise. Where: t represents the predicted bounding box coordinates. t∗ represents the ground truth bounding box coordinates. 93 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Tableau 6.2: Comparison of R-CNN Family Aspect R-CNN Fast R-CNN Faster R-CNN Mask R-CNN Raster CNN Year Intro- 2014 2015 2015 2017 Recent duced Architecture Two-stage: Two-stage: Two-stage: RPN Faster R-CNN + Faster R-C CNN + SVM Shared CNN + RCNN Mask Head + Raster A Backbone ment Region Selective Selective Search Region Proposal Region Proposal Region Prop Proposal Search Network (RPN) Network (RPN) Network (R Method Feature Ex- Per Region: Shared Features: Shared Features: Shared Features: Fine-grained traction CNN per ROI ROI Pooling ROI Pooling ROI Align Raster A ment Speed Im- Slow: Sepa- Faster: End-to- Real-Time Fea- Similar to Faster Enhanced provement rate Steps End Network sible R-CNN ture Fidelity Key Loss Cls + Reg Cls + Reg Cls + Reg Cls + Reg + Cls + Re Compo- (SVM) Mask Raster nents Mask Sup- No No No Yes No port Key Innova- Selective ROI Pooling Integrated RPN Pixel-Level Raster Fea tion Search Pro- Mask Prediction Alignment posals Training Multi-step End-to-End End-to-End End-to-End End-to-End Complexity Accuracy ∼ 0.53 ∼ 0.66 (VOC) ∼ 0.73 (VOC) / ∼ 0.37 (COCO) ∼ 0 (mAP) (VOC) ∼ 0.35 (COCO) (COCO) Test Time ∼ 47 s ∼ 2.3 s ∼ 0.2 s ∼ 0.25 s (box + ∼ 0.25 − 0.3 per Image mask) Use Case Generic Ob- Faster Object Real-Time Ob- Instance Seg- Dense D ject Detection Detection ject Detection mentation tion (Com Regions) 6.7 Single Shot MultiBox Detector (SSD) The Single Shot MultiBox Detector (SSD), introduced by Wei Liu et al. in 2016, offers a fast and efficient approach to object detection. Unlike traditional two-stage meth- ods like R-CNN and its variants, SSD combines object localization and classification into a single network pass. This streamlined architecture achieves high accuracy and real-time performance, making it suitable for a wide range of applications. 94 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Key Achievements Performance: Scored 74.3% mAP on PASCAL VOC and performed competitively on COCO datasets. Speed: Operates at 59 FPS for 300×300 input resolution, enabling real-time appli- cations. 6.8 Understanding the Name The name Single Shot MultiBox Detector reflects its core principles: 6.8.1 Single Shot SSD performs object localization and classification in a single forward pass through the network, avoiding the separate region proposal stage used by multi-stage detectors. 6.8.2 MultiBox The MultiBox component uses: Default Boxes (Anchors): Predefined bounding boxes with various aspect ratios and scales. Figure 6.15: (a) Image with GT boxes (b) 8 × 8 feature map (c) 4 × 4 feature map Bounding Box Regression: Adjusting default boxes to match ground-truth ob- jects. Class Scores: Predicting class probabilities for each bounding box. 6.8.3 Detector SSD acts as a full object detector, identifying objects via bounding boxes and assigning class labels. 95 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 6.9 Architecture and Design 6.9.1 Backbone Network SSD typically uses a pretrained VGG-16 network as the base for feature extraction, removing fully connected layers. Figure 6.16: The SSD architecture is composed of a base network (VGG16), extra convolutional layers for object detection, and a non-maximum suppression (NMS) layer for final detections. Note that convolution layers 7, 8, 9, 10, and 11 make predictions that are directly fed to the NMS layer. 6.9.2 Additional Feature Layers Additional convolutional layers are appended to progressively reduce spatial dimensions, enabling multi-scale detection. 6.9.3 Multiscale Feature Maps Feature maps of different sizes are used: High-Resolution Maps: Detect small objects. Low-Resolution Maps: Detect large objects. 6.9.4 Default Boxes Default boxes are pre-defined at each feature map location with varying aspect ratios (e.g., 1:1, 2:1, 1:2) and scales. 96 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 6.9.5 Prediction Heads Separate heads predict: Bounding Box Regression: Refining box coordinates. Class Scores: Assigning confidence scores for object classes. 6.9.6 Loss Function SSD optimizes a multitask loss combining: Localization Loss: Smooth L1 loss for bounding box regression. (6.1) Confidence Loss: Cross-entropy loss for class prediction. (6.2) 6.10 Key Features 6.10.1 Speed Real-time performance achieved by eliminating the region proposal stage, processing im- ages at 59 FPS for 300×300 input. 6.10.2 Multiscale Detection Feature maps of different scales enhance detection of objects of various sizes. 6.10.3 Simplicity A single-stage architecture reduces training and inference complexity. 6.11 Comparison with R-CNN Family Feature SSD R-CNN Family Architecture Single-stage Multi-stage Speed Real-time (59 FPS) Slower (region proposals) Accuracy High (weaker for small objects) Higher (small object detection) Complexity Simpler More complex Tableau 6.3: Comparison of SSD and R-CNN Family 97 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 6.12 Challenges Small Object Detection: Struggles with detecting small objects due to lower fea- ture map resolution. Default Box Optimization: Requires careful tuning of default box sizes and aspect ratios. Performance on Complex Datasets: Faces challenges with datasets having dense or occluded objects. 6.13 Applications Autonomous Driving: Real-time detection of pedestrians, vehicles, and obstacles. Video Surveillance: Efficient monitoring in security footage. Drones and Robotics: Fast detection for navigation in dynamic environments. Augmented Reality: Real-time object recognition for AR experiences. The Single Shot MultiBox Detector (SSD) combines speed and accuracy in a single-stage object detection framework. Its multiscale feature maps and default box strategy paved the way for real-time applications while inspiring future advancements in single-stage detectors. Introduction to YOLO YOLO, an acronym for You Only Look Once, is a state-of-the-art object detection algorithm that has revolutionized computer vision tasks with its speed and efficiency. Introduced by Joseph Redmon et al. in 2016, YOLO offers a groundbreaking approach to object detection by treating it as a single regression problem, rather than breaking it into multiple stages like traditional methods. At its core, YOLO combines object localization and classification into a single, unified neural network. Unlike earlier object detection frameworks such as R-CNN and its variants, YOLO processes the entire image in a single pass, enabling real-time object detection. This characteristic has made it immensely popular for applications where speed is critical, such as autonomous vehicles, video surveillance, and robotics. YOLO is particularly notable for: 1. Real-Time Performance: With the ability to process over 30 frames per second, YOLO is capable of handling real-time scenarios effectively. 98 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO 2. Unified Architecture: YOLO uses a single neural network for predicting bounding boxes and class probabilities, simplifying the overall process. 3. Wide Applicability: From detecting pedestrians in traffic to identifying products on store shelves, YOLO has been widely adopted across industries. Key Points on the YOLO Family 1. Overview: YOLO (You Only Look Once) is a family of object detection networks developed by Joseph Redmon et al.. Similar to the R-CNN family, YOLO focuses on real-time object detection using end-to-end deep learning models. 2. YOLOv1 (2016): Called "unified, real-time object detection" as it unifies the object detector and class predictor into a single network. 3. YOLOv2 (YOLO9000, 2016): Capable of detecting over 9,000 objects, hence the name YOLO9000. Trained on the ImageNet and COCO datasets. Achieved a 16% mAP (mean Average Precision), which indicates modest accu- racy but demonstrated extremely fast inference times. 4. YOLOv3 (2018): Significantly larger and more accurate than its predecessors. Achieved a mAP of 57.9%, the best result within the YOLO family at the time. 5. Strengths of the YOLO Family: Designed for fast object detection, making it ideal for real-time applications. Often demonstrated on real-time video or camera feed inputs. 6. Comparison to R-CNN: While R-CNN models generally offer better accuracy, YOLO models are signif- icantly faster, making them suitable for applications prioritizing speed over precision. YOLO’s Approach to Object Detection The creators of YOLO took a different approach compared to previous networks like R-CNNs. Unlike R-CNNs, YOLO does not include a region proposal step. Instead, it 99 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO simplifies the process by directly predicting a limited number of bounding boxes. The key aspects of YOLO’s approach are: Grid-Based Prediction: The input image is divided into a grid of cells, where each cell is responsible for predicting a bounding box and the classification of objects within that region. Direct Bounding Box Prediction: Instead of generating region proposals, YOLO predicts bounding boxes and associated class probabilities directly from the input image. Non-Maximum Suppression (NMS): A large number of candidate bounding boxes are consolidated into the final prediction using NMS to remove redundant or overlapping boxes. This innovative approach allows YOLO to achieve real-time performance by signif- icantly reducing the computational overhead associated with traditional region proposal methods. Figure 6.17: YOLO splits the image into grids, predicts objects for each grid, and then uses NMS to finalize predictions. Evolution of YOLO Architectures YOLOv1: Proposed the general architecture for object detection, introducing the concept of real-time, unified object detection. YOLOv2: Refined the design and incorporated predefined anchor boxes to improve bounding-box proposals, enhancing detection accuracy. YOLOv3: Further refined both the model architecture and training process, achiev- ing significant improvements in detection accuracy and scalability. 100 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO The YOLOv3 network splits the input image into a grid of S × S cells. Each grid cell is responsible for detecting the existence of an object if the center of the ground-truth bounding box falls into it. Specifically, each grid cell predicts B number of bounding boxes, their objectness scores, and class predictions, as follows: Coordinates of B Bounding Boxes: Similar to previous detectors, YOLOv3 pre- dicts four coordinates for each bounding box: (bx , by , bw , bh ), where (bx , by ) are the offsets relative to the grid cell’s location, and (bw , bh ) represent the width and height of the bounding box. Objectness Score (P0 ): The objectness score indicates the probability that the cell contains an object. This score is passed through a sigmoid function, treating it as a probability with values between 0 and 1. The objectness score is calculated as: P0 = Pr(containing an object) × IoU(pred, truth) where IoU (Intersection over Union) is a measure of the overlap between the predicted bounding box and the ground-truth box. Class Prediction: If the bounding box contains an object, the network predicts the probability for K classes, where K is the total number of classes in the problem. It is important to note that in previous YOLO versions, the softmax function was used for class scores. However, in YOLOv3, Redmon et al. switched to using the sigmoid function. This change was made because softmax assumes that each box can only belong to one class, which is not always true. For instance, an object might belong to both "Woman" and "Person." The sigmoid function allows for a multi-label classification approach, which models the data more accurately. As illustrated in Figure 6.19, for each bounding box B, the prediction includes the following components: [(bounding box coordinates), (objectness score), (class predictions)] 101 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Figure 6.18: Example of a YOLOv3 workflow when applying a 13 × 13 grid to the input image. The input image is split into 169 cells. Each cell predicts B number of bounding boxes and their objectness score along with their class predictions. In this example, we show the cell at the center of the ground-truth making predictions for 3 boxes (B = 3). Each prediction has the following attributes: bounding box coordinates, objectness score, and class predictions. 102 CHAPTER 6. OBJECT DETECTION WITH R-CNN, SSD, AND YOLO Figure 6.19: High-level architecture of YOLO 103 Bibliography Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar- chies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. R Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015. 104

Use Quizgecko on...
Browser
Browser