COMP9517 Computer Vision 2024 Term 2 Week 8 Deep Learning II PDF
Document Details
Uploaded by FastGrowingJackalope
UNSW Sydney
Dr Dong Gong
Tags
Related
Summary
This document covers computer vision topics, focusing on object detection using convolutional neural networks (CNNs). It details different algorithms and datasets, highlighting the challenges and advancements in the field of object recognition. It's a university lecture or course material.
Full Transcript
COMP9517 Computer Vision 2024 Term 2 Week 8 Dr Dong Gong Deep Learning II Object Detection using CNNs Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 2 Outline Ø Computer Vision tasks Ø Object Detection dataset example Ø Proposal based algorit...
COMP9517 Computer Vision 2024 Term 2 Week 8 Dr Dong Gong Deep Learning II Object Detection using CNNs Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 2 Outline Ø Computer Vision tasks Ø Object Detection dataset example Ø Proposal based algorithms Ø R-CNN Ø Fast R-CNN Ø Faster R-CNN Ø Proposal free algorithms Ø Single Shot Detector (SSD) Ø You Only Look Once (YOLO) Ø RetinaNet Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 3 Vision tasks Ø Image classification: Assigning a label or class to an image Ø Object detection: Locate the presence of objects with a bounding box and class of the located objects in an image Ø Semantic segmentation: Label every pixel (pixel-wise classification) Ø Instance segmentation: pixel-wise labeling of instances Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 4 Image Classification CNN Class Cow Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 5 Object Detection Ø Determine “what” and “where” Ø “what”: Classification Ø “where”: Localization Class Cow x,y CNN h w Bbox Coordinates (x,y,w,h) Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 6 How to use CNNs Ø Use Fully-connected layers for mapping to Class and Bbox Coordinates Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 7 Object Detection Ø Use Fully-connected layers for mapping to Class and Bbox Coordinates Class FC-layer Cow: 0.85 (4096 to 1000) Horse: 0.10 … x,y h w FC-layer Bbox (4096 to 4) Coordinates Image Credit: Creative Commons Licenses (x,y,w,h) Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 8 Intersection over Union (IoU) Ø To maximize overlap between the predicted bounding box and the actual bounding box, maximize Intersection over Union (IoU) Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 9 Ø Example: PASCAL VOC challenge dataset Ø Goal: To recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). Ø It is fundamentally a supervised learning problem in that a training set of labelled images is provided. Ø The twenty object classes that have been selected are: Ø Person: person Ø Animal: bird, cat, cow, dog, horse, sheep Ø Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Ø Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor Ø The data provided consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object present in the image. Source: PASCAL VOC 2012 Challenge Dataset http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 10 Ø Example: Microsoft COCO dataset Ø Large-scale object detection, segmentation, and captioning dataset. Source: Microsoft COCO https://cocodataset.org/#home Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 11 Object Detection Actual Class Cow Ø How to train? Ø Use Softmax Loss for Classification Predicted Class Softmax Ø Use Regression loss for Localization Cow Loss FC-layer (4096 to 1000) x,y Final Loss h w FC-layer Predicted Bbox Image Credit: Creative Commons Licenses (4096 to 4) Coordinates L2 Loss (x’,y’,w’,h’) Actual Bbox Coordinates (x,y,w,h) Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 12 Object Detection Ø How to detect multiple objects? x,y h w CNN Cow: (x, y, w, h) CNN Cow: (x, y, w, h) Bull: (x, y, w, h) Zebra: (x, y, w, h) CNN Zebra: (x, y, w, h) Zebra: (x, y, w, h) … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 13 Object Detection Ø How to detect multiple objects? Challenge of mapping to different number of outputs x,y h w CNN Cow: (x, y, w, h) 4 numbers CNN Cow: (x, y, w, h) Bull: (x, y, w, h) 8 numbers Dog: (x, y, w, h) CNN Rabbit: (x, y, w, h) Sheep: (x, y, w, h) Many Chicken: (x, y, w, h) … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 14 Object Detection Ø How to detect multiple objects? Ø A (naïve) solution: Apply a CNN to various crops of the image, CNN classifies each crop as object or background – sliding window with varying location and scale Dog (x, y, w, h): No Rabbit (x, y, w, h): No CNN Sheep (x, y, w, h): No Background (x, y, w, h): Yes … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 15 Object Detection Ø How to detect multiple objects? Ø A (naïve) solution: Apply a CNN to various crops of the image, CNN classifies each crop as object or background – sliding window with varying location and scale Dog (x, y, w, h): Yes Rabbit (x, y, w, h): No CNN Sheep (x, y, w, h): No Background (x, y, w, h): No … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 16 Object Detection Ø How to detect multiple objects? Ø A (naïve) solution: Apply a CNN to various crops of the image, CNN classifies each crop as object or background – sliding window with varying location and scale Dog (x, y, w, h): No Rabbit (x, y, w, h): Yes CNN Sheep (x, y, w, h): No Background (x, y, w, h): No … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 17 Object Detection Ø How to detect multiple objects? Ø A (naïve) solution: Apply a CNN to various crops of the image, CNN classifies each crop as object or background – sliding window with varying location and scale Dog (x, y, w, h): No Rabbit (x, y, w, h): No CNN Sheep (x, y, w, h): Yes Background (x, y, w, h): No … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 18 Object Detection Ø How to detect multiple objects? Ø A (naïve) solution: Apply a CNN to various crops of the image, CNN classifies each crop as object or background – sliding window with varying location and scale Ø Limitation: Computationally very expensive Ø Need to apply CNN on many crops with different locations, scales Dog (x, y, w, h): No Rabbit (x, y, w, h): No CNN Sheep (x, y, w, h): Yes Background (x, y, w, h): No … Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 19 Selective Search 1. Merge two most similar regions based on S. 2. Update similarities between the new region and its neighbors. 3. Go back to step 1. until the whole image is single region. Felzenszwalb and Huttenlocher., Efficient Graph-Based Image Segmentation, IJCV 2004 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 20 Selective Search for Object Detection Ø Selective Search is a region proposal algorithm Ø It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation (Felzenszwalb and Huttenlocher) Ø Selective Search then takes these oversegments as initial input and performs the following steps: 1. Add all bounding boxes corresponding to segmented parts to the list of region proposals 2. Group adjacent segments based on similarity 3. Go to step 1 Ø It’s a bottom-up approach Felzenszwalb and Huttenlocher., Efficient Graph-Based Image Segmentation, IJCV 2004 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 21 Selective Search for Object Detection Ø Selective Search design constraints: Ø Capture all scales (objects can occur at any scale within the image) Ø Diversification (Use diverse set of strategies such as forming regions based on color, texture, shading, etc.) Ø Fast to compute Van de Sande et al., Segmentation as Selective Search for Object Recognition., ICCV 2011. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 22 Region Proposals Ø Find image regions that are likely to contain objects Ø region proposals – generating bounding box proposals (potentially to be objects) based on other methods/priors -- can be fast Ø Gives around 2000 regions proposals in a few seconds on CPU Region Proposal Algorithm Van de Sande et al., Segmentation as Selective Search for Object Recognition., ICCV 2011. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 23 R-CNN: Regions Proposals with CNN Features Ø R-CNN 1. Takes an input image 2. Extracts around 2000 (2k) bottom-up region proposals 3. Compute features for each proposal using a CNN CNN CNN CNN CNN 4. Classifies each region using class-specific linear SVMs. Warped image regions Regions of Interest (RoI) from a proposal method (around 2k) Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 24 R-CNN: Regions with CNN Features Ø R-CNN Class Bbox Class Bbox Class Bbox Class Bbox 1. Takes an input image 2. Extracts around 2000 (2k) bottom-up region proposals 3. Compute features for each proposal using a CNN CNN CNN CNN CNN 4. Classifies each region using class-specific linear SVMs. Warped image regions Regions of Interest (RoI) from a proposal method (around 2k) Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 25 R-CNN: Regions with CNN Features Ø R-CNN Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh) Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014 https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 26 R-CNN was very slow! Ø Why? Ø Training was a multi-stage pipeline 1. R-CNN first fine-tunes a ConvNet on object proposals 2. Fits SVM to ConvNet features 3. Bounding box regressors are learned Ø Training is expensive in space and time Ø For SVM and bounding-box regressor training, features are extracted from each object proposals in each image and written to disk Ø The selective search is a fixed algorithm. No learning is happening! Ø At test time, need to pass approximately 2000 independent forward passes for each image. Object detection with VGG-16 takes 47s/image (on a GPU) Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 27 ØSpatial Pyramid Pooling (SPP-Net) Ø Spatial Pyramid Pooling (SPP)-Net Ø “Pool” features into a common size He et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”, ECCV 2014. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 28 ØSpatial Pyramid Pooling (SPP-Net) Ø “Pool” features into a common size He et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”, ECCV 2014. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 29 ØSpatial Pyramid Pooling (SPP-Net) Ø SPP-Net solved the R-CNN problem of being slow at test time Ø However, still has problems inherited from R-CNN: Ø Training is still slow (a bit faster than R-CNN) Ø Training scheme is still complex Ø Still no end-to-end training He et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”, ECCV 2014. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 30 Fast R-CNN Ø Idea: Pass the image through CNN and then crop RoIs from the feature map Get Regions of Interest (RoIs) from feature maps using region proposal method CNN Pass entire image through CNN Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 31 Fast R-CNN training Relying on pre-generated proposal as R-CNN Forward the image through CNN before cropping with proposal bbox. Cropping on con feature map – RoI pooling! – applying the proposal bbox on image coordinates to feature maps End-to-end Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 32 Fast R-CNN Ø Training is single-stage, using multi-task loss Ø No disk storage is required for feature catching Ø High detection quality (high mAP) than R-CNN Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 33 How to crop features? Projecting Ø RoI Pooling proposals onto feature map CNN Input Image Feature Map (e.g., 3 x W x H) (e.g., 512 x W x H) Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 34 How to crop features? “Snap” to grid cells Ø RoI Pooling (Controls are automatically aligned to grid points) Max-pooling CNN Region features (e.g., 512 x 2 x 2) Input Image (e.g., 3 x W x H) Feature Map (e.g., 512 x W x H) Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 35 How to crop features? “Snap” to grid cells Ø RoI Pooling (Controls are automatically aligned to grid points) Ø Problem: Slight Misalignment Max-pooling CNN Region features (e.g., 512 x 2 x 2) Input Image (e.g., 3 x W x H) Feature Map (e.g., 512 x W x H) Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 36 How to crop features? “Snap” to grid cells Ø RoI Pooling (Controls are automatically aligned to grid points) Ø Problem: Slight Misalignment Ø Solution: ROI Align using bilinear interpolation Max-pooling CNN Region features (e.g., 512 x 2 x 2) Input Image (e.g., 3 x W x H) Feature Map (e.g., 512 x W x H) Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 37 R-CNN vs SPP-Net vs Fast R-CNN Ø Observation: During test time, Fast R-CNN computation time is dominated by computing region proposals. Girshick., Fast R-CNN, ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 38 How to furtehr improve Fast R-CNN? Ø Observation: The convolutional (conv) feature maps used by region- based detectors (e.g., Fast R-CNN) can also be used for generating region proposals Ø Time spent on region proposals is the real bottleneck in state-of-the-art object detection systems Ø Can we unify region proposals and object detection networks (e.g., Fast R-CNN)? Ø Idea: Region Proposal Networks (RPNs) Ø Computing proposals with deep neural networks. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 39 Region Proposal Network (RPN) Ø Slide a small window over the feature map Ø Predict object/no object Ø Regress bounding box coordinates Ø Box regression is with reference to anchors (3 scales x 3 aspect ratios) Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 40 Faster R-CNN Ø Region Proposal Network (RPN) to predict proposals from features Classification Bbox regression Classification Bbox regression loss loss loss loss Ø Jointly train with four loss Ø RPN classification: anchor box is an object/not object Ø RPN regression: predict transform from anchor box to proposal box Ø Object classification: classify proposals as background or object class Ø Object regression: predict transform from proposal box to object box. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 41 Faster R-CNN Ø Anchor Boxes Ø Set of predefined bounding boxes of a certain height and width. These are defined to capture the scale and aspect ratio of specific object classes we want to detect (typically chosen based on object sizes in the training datasets) Anchor is an object? (Yes/No) CNN CNN Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 42 Faster R-CNN Ø Anchor Boxes Ø Set of predefined bounding boxes of a certain height and width. These are defined to capture the scale and aspect ratio of specific object classes we want to detect (typically chosen based on object sizes in the training datasets) Anchor is an object? (Yes/No) CNN CNN Bbox transform (predict a corrections from the anchor to the GT Bbox) Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 43 Faster R-CNN loss Ø In practice, we have k different anchor boxes of different size and scale at each point. Ø reg layer has 4k outputs encoding the coordinates of k boxes Ø cls layer outputs 2k scores that estimate probability of object or not object for each anchor box Ø Loss function Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 44 Region Proposals and Anchor Boxes Ø Extract 9 Proposals relative to 9 Anchors Conv feature map 9 Anchor boxes = 3 ratios x 3 scales ……………………….. Input: each sliding window Output: For i = 1, 2, …., 9 4 coordinates (reg) 2 scores (cls) Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 45 Region Proposals and Anchor Boxes Conv feature map The proposals highly overlaps each other! Need to reduce redundancy. Total # of proposals: 11 x 15 x 9 = 1485 Total # of windows # of proposals per a window Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 46 Reduce redundancy by Non-Maximum Suppression (NMS) Ø Step 1: Take the most probable proposal from 1485 proposals Ø Step 2: Compute the IoU between the most probable and other proposals, and reduce proposals having IoU > threshold (0.7) Ø Step 3: Get the most probable proposals among the rest (original – reduced) proposals and repeat steps 1 to 3. Image Credit: Analytics Vidya. https://www.analyticsvidhya.com/blog/2020/08/selecting-the-right-bounding-box-using-non-max-suppression-with-implementation/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 47 Online hard example mining (OHEM) Ø Class imbalance hurts training (Overwhelming number of easy examples and small number of hard examples) Ø We are training the model to learn background rather than detecting objects. Ø OHEM is a bootstrapping technique that modifies SGD to sample from examples in a non-uniform way depending on the current loss of each example. Ø Sort anchors by their calculated loss, apply NMS. Ø Pick the top ones such that ratio between the picked negatives and positives is at most 3:1. Ø Faster R-CNN selects 256 anchors: 128 positive, 128 negative Image Credit: Analytics Vidya. https://www.analyticsvidhya.com/blog/2020/08/selecting-the-right-bounding-box-using-non-max-suppression-with-implementation/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 48 Faster R-CNN selected examples PASCAL VOC 2007 test set using Faster RCNN Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 49 Faster R-CNN speed Faster R-CNN is super fast Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 50 Two-stage vs One-stage Detectors Ø Two-stage detectors Ø Models in the R-CNN family Step 1. The model proposes a set of RoIs by selective search or RPN Step 2. Classification of region proposals Ø Single-stage detectors Ø Detecting objects in images using a single deep neural network Image Credits: https://github.com/yehengchen/Object-Detection-and-Tracking/blob/master/Two-stage%20vs%20One-stage%20Detectors.md Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 51 Two-stage vs One-stage Detectors Object Detection Two-stage/Proposal-based One-stage/Proposal-free R- CNN SSD Fast R- CNN YOLO Faster R- CNN RetinaNet RFCN Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Mask R-CNN Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016 Lin et al. “Focal Loss for Dense Object Detection”, ICCV 2017 He et al. “Mask R-CNN”, ICCV 2017 Dai et al. “R-FCN: Object Detection via Region-based Fully Convolutional Networks, NeurIPS 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 52 SSD: Single Shot MultiBox Detector Ø Two-stage pipeline: Propose bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. Ø Can we speed-up by eliminating bounding box proposals and the subsequent pixel or feature resampling stage? Ø Series of modifications: Ø Using small convolutional filter to predict object categories and offsets in bounding box locations Ø Using separate filters for different aspect ratio detections Ø Applying filters to multiple feature maps from the later stages of the network to perform detection at multiple scales Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 53 SSD: Single Shot MultiBox Detector Ø SSD has two components: a backbone network and SSD head Ø Backbone network is a pre-trained image classification network as a feature extractor (e.g., ResNet trained on ImageNet) Ø SSD head has one or more convolutional layers and outputs bounding boxes and classes of objects in the spatial location of the final layers activations. Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 54 SSD: Single Shot MultiBox Detector Ø SSD has two components: a backbone network and SSD head Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 55 SSD: Single Shot MultiBox Detector Ø Findings Ø Data augmentation is crucial Ø More default box shapes is better (different scales and aspect ratios) Ø Multiple output layers at different resolutions is better Detection examples on COCO test-dev with SSD512 model Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 56 YOLO: You Only Look Once Ø Proposal-based detection repurpose classifiers or localizers to perform detection (apply the model to an image at multiple locations and scales. High scoring regions are considered detections) Ø YOLO reframe object detection as a single regression problem (Image pixels à Bounding box coordinates and class probabilities Ø YOLO: Apply a single neural network to the full image Ø Step 1. Network divides the image into regions and predicts bounding boxes and probabilities for each region Ø Step 2. These bounding boxes are weighted by the predicted probabilities Ø Looking image at once during test time makes it very fast 1000x faster than R-CNN, 100x faster than Fast R-CNN Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 57 YOLO: You Only Look Once Results on PASCAL VOC 2007 Error Analysis Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 58 YOLO timeline Timeline credit: DataCamp https://www.datacamp.com/blog/yolo-object-detection-explained Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 59 Why one stage detectors trails accuracy? Ø Extreme foreground-background class imbalance encountered. Two-stage: One-stage: The proposals stage rapidly narrows down # candidate object locations to a small number (e.g., Have to process a much larger 2k proposals), filtering out most set of candidate object locations background samples. regularly sampled across an image, which amounts to In the classification stage, fix enumerating ~100k locations that foreground-to-background ratio to densely cover spatial positions, 1:3, or online hard example scales, and aspect ratios. mining (OHEM). Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 60 Activation Maps Ø Extreme foreground-background class imbalance encountered. Semantic Value Resolution As image goes deeper through the How about predicting from network, resolution decreases and multiple maps? semantic value increases. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 61 Feature Pyramid Network (FPN) Ø Improve predictive power of lower-level feature maps by adding contextual information from higher-level feature maps. Ø FPN creates an architecture with rich semantics at all levels by combining low-resolution semantically strong features Top-down + Lateral with high-resolution semantically weak connections features. Lin et al., “Feature Pyramid Networks for Object Detection”, CVPR 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 62 RetinaNet Ø One-stage object detectors proven to work well with dense and small-scale objects Ø RetinaNet: Feature Pyramid Network + Focal Loss (avoiding class-imbalance) Lin et al., “Focal Loss for Dense Object Detection”, ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 63 RetinaNet Architecture Top-down pathway and Classification Bbox subnetwork: Lateral connections: subnetwork: It It's regresses the Bottom-up Pathway: The Upsamples the spatially backbone network which predicts the offset for the coarser feature maps from probability of an bounding boxes calculates the feature maps higher pyramid levels, and at different sales object being from the anchor the lateral connections present at each boxes for each merge the top-down layers spatial location for ground-truth and the bottom-up layers each anchor box object. with the same spatial size. and object class. Lin et al., “Focal Loss for Dense Object Detection”, ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 64 RetinaNet Results Speed (ms) versus accuracy (AP) on COCO test-dev. Lin et al., “Focal Loss for Dense Object Detection”, ICCV 2017. Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 65 Object Detection in 20 Years Zou et al. “Object Detection in 20 Years: A Survey”, ArXiv, 2023 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 66 Implementations. PyTorch https://pytorch.org/vision/main/models/faster_rcnn.html. TensorFlow https://www.tensorflow.org/hub/tutorials/object_detection. Detectron2 (Meta AI’s state-of-the-art object detection and segmentation library) https://github.com/facebookresearch/detectron2. MMDetection (Open-source object detection toolbox based on PyTorch) https://github.com/open-mmlab/mmdetection. Understanding and Implementing Faster R-CNN: A step-by-step guide https://towardsdatascience.com/understanding-and-implementing-faster-r-cnn-a-step-by-step-guide-11acfff216b0. Faster R-CNN in PyTorch and TensorFlow with Keras https://github.com/trzy/FasterRCNN Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 67 Implementations. Mask Detection using Faster R-CNN https://pseudo-lab.github.io/Tutorial-Book-en/chapters/en/object-detection/Ch5-Faster-R-CNN.html. Object Detection using Faster R-CNN https://github.com/spmallick/learnopencv/blob/master/PyTorch-faster-RCNN/PyTorch_faster_RCNN.ipynb. SSD implementation in PyTorch https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/. YOLO implementation in PyTorch https://pytorch.org/hub/ultralytics_yolov5/ https://github.com/ultralytics/yolov5. Mask Detection using RetinaNet https://pseudo-lab.github.io/Tutorial-Book-en/chapters/en/object-detection/Ch4-RetinaNet.html. Faster R-CNN on PASCAL VOC Dataset https://cv.gluon.ai/build/examples_detection/train_faster_rcnn_voc.html Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 68 Further reading on discussed topics Ø Chapter 7 of Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville. https://www.deeplearningbook.org/ Ø Chapter 4: Object Detection and Image Segmentation from Practical Machine Learning for Computer Vision by Valliappa Lakshmanan, Martin Gorner, Ryan Gillard. https://www.oreilly.com/library/view/practical-machine-learning/9781098102357/ch04.html Ø Chapter 7 of Deep Learning for Vision Systems by Mohamed Elgendy. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 69 Acknowledgements Ø Some material drawn from referenced and associated online sources Ø Image sources credited where possible Ø Some slides adapted from cs231n Lecture 9 “Object Detection and Image Segmentation” Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 70 References. Felzenszwalb and Huttenlocher., “Efficient Graph-Based Image Segmentation”, IJCV 2004. https://link.springer.com/article/10.1023/B:VISI.0000022288.19776.77. Van de Sande et al., Segmentation as Selective Search for Object Recognition., ICCV 2011. https://ieeexplore.ieee.org/document/6126456. Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014. https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.pdf. Girshick., Fast R-CNN, ICCV 2015. https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015. https://papers.nips.cc/paper_files/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html. Liu et al. “SSD: Single Shot MultiBox Detector”, ECCV 2016. https://link.springer.com/chapter/10.1007/978-3-319-46448-0_2. Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016. https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 71 References. Zou et al. “Object Detection in 20 Years: A Survey”, ArXiv, 2023. https://arxiv.org/abs/1905.05055. Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015. https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf. Ronneberger et al. (2015). U-net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28. Oktay et al., (2018). “Attention U-Net: Learning where to look for the Pancreas”, MIDL 2018. https://openreview.net/forum?id=Skft7cijM. Diakogiannis et al., (2019). “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data”, ISPRS Journal of Photogrammetery and Remote Sensing. https://www.sciencedirect.com/science/article/abs/pii/S0924271620300149. Chen et al., (2021). “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation”, ArXiv. https://arxiv.org/abs/2102.04306. He et al., “Mask R-CNN”. ICCV 2017. https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 72 Example exam question What is the key benefit of CNNs over traditional ANNs for image classification? A. CNNs are computationally less expensive. B. CNNs automatically learn hierarchical features. C. CNNs require fewer network layers. D. CNNs can learn nonlinear mappings. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning Part 2-1 73