Neural Networks Lecture 4: Convolutional Neural Networks Applications PDF
Document Details
Uploaded by VictoriousGlockenspiel
2023
Tags
Summary
These lecture notes cover applications of convolutional neural networks (CNNs). Topics discussed include semantic and instance segmentation, object detection, and CNNs for time series analysis. The lecture also details fine-tuning techniques.
Full Transcript
Neural Networks - Lecture 4 Applications of Convolutional Neural Networks Outline ● Semantic and Instance Segmentation ● Object Detection ● CNN for Timeseries Analysis ● Fine tuning techniques 2/53 Task definitions Source: AI Pool blog: How does instance segmentation work 3/53 Seman...
Neural Networks - Lecture 4 Applications of Convolutional Neural Networks Outline ● Semantic and Instance Segmentation ● Object Detection ● CNN for Timeseries Analysis ● Fine tuning techniques 2/53 Task definitions Source: AI Pool blog: How does instance segmentation work 3/53 Semantic Segmentation 4/53 Semantic Segmentation – DeepLab-v3 ● Highlights: – Use of dilated convolutions (atrous convolutions) to enable larger effective receptive field sizes. Helps in capturing spatial context information – Use of Atrous Spatial Pyramid Pooling to extract content information from several scale levels at the same time 5/53 Semantic Segmentation – DeepLab-v3 ● Conv Layer vs Dilated Conv Layer: – Use of dilated convolutions to extract larger information context – Output stride (i.e. reduction factor for image resolution) limited to 16, as larger values are harmful for semantic segmentation 6/53 Semantic Segmentation – DeepLab-v3 ● Atrous Spatial Pyramid Pooling: – Uses idea that it is effective to resample features at different scales for more accurate region classification – Implements Spatial Pyramid Pooling using a combination of atrous convolutions and global average pooling – BatchNorm applied in between all diluted-conv filters 7/53 Instance Segmentation 8/53 Instance Segmentation Approaches ● Two main streams of methods in Instance Segmentation – Proposal based – propose possible regions of interest where an object might be found → extract features from that region → use features to classify which pixel is part of the object and what class it actually is – Segmentation based – start from segmentation as first objective (previously explained models) and learn specially designed transformations or instance boundaries 9/53 Instance Segmentation – the case of MaskRCNN ● ● Region of Interest (RoI) proposal based architecture Builds upon Faster-RCNN [Ren et al., 2015] but introduces two important aspects – RoI Align Layer – binary mask prediction (for individual instances) The MaskR-CNN framework for instance segmentation. Source: Mask-RCNN, He et al., 2018 10/53 Instance Segmentation – A prior on FasterRCNN ● ● Faster R-CNN used for object detection Fundamental concepts: Region Proposal Network, Anchor Boxes, RoI Pooling The different layers of Faster R-CNN. Source: alegion.com/faster-r-cnn 11/53 Instance Segmentation – A prior on FasterRCNN ● ● Faster RCNN has a backbone network (VGG-16) which feeds the Region Proposal Network and the Class and BBox regressor network RPN proposes regions where objects might be found – Uses 9 anchor types (3 different scales, 3 different aspect ratios) – Predicts 2k object classification scores (whether foreground or background object) and 4k region regression coords; k – number of anchors Faster-RCNN Architecture. Source: https://towardsdatascience.com/faster-rcnn-object-detection-f865e5ed7fc4 12/53 Instance Segmentation – A prior on FasterRCNN ● Anchor Boxes – Take the final feature map from backbone network as input – Their sizes are relative to input image Downsampling ratios of CNN feature maps. Source: Mathworks, anchor boxes for object detection Anchor boxes: 3 scales, 3 aspect ratios. Source: Medium @smallfishbigsea 13/53 Instance Segmentation – A prior on FasterRCNN ● RoI Pooling – For collected Region Proposals, reuse existing convolutional feature map to extract features – Crop the convolutional feature map using each proposal and then resize each crop to a fixed sized (e.g. 14 x 14 x convdepth) using interpolation (usually bilinear) Region of Interest Pooling. Source: TryoLabs 14/53 Instance Segmentation – A prior on FasterRCNN ● Final Object Class Assignment and BBox fine-tuning – Use RoI Pooling features – Classify proposals into one of the classes, plus a background class – Better adjust BBox for proposal given the predicted class R-CNN architecture – final object classifications and BBox regression. Source: TryoLabs 15/53 Instance Segmentation – Back to Mask-RCNN ● Mask-RCNN changes – Uses different backbones: ResNext-101, FPN (Feature Pyramid Network) – Adds a fully convolutional Mask Prediction branch Left/Right panels show the heads for the ResNet C4 and FPN backbones, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers. All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and we use ReLU in hidden layers. Left: ‘res5’ denotes ResNet’s fifth stage, which for simplicity we altered so that the first conv oper- ates on a 7×7 RoI with stride 1 (instead of 14×14 / stride 2 as in [19]). Right: ‘×4’ denotes a stack of four consecutive convs. Source: Mask R-CNN, He et al., 2018 16/53 Instance Segmentation – Back to Mask-RCNN ● Mask-RCNN changes – Replaces RoI Pooling with a RoI Align Layer ● Avoids quantization of RoI coordinates or spatial bins to feature map grid RoI Align Layer in action. Source: Mask R-CNN, He et al., 2018 17/53 Instance Segmentation – Back to Mask-RCNN ● Mask-RCNN changes – Replaces RoI Pooling with a RoI Align Layer ● Avoids quantization of RoI coordinates or spatial bins to feature map grid RoI Align Layer in action. Source: Mask R-CNN, He et al., 2018 18/53 Instance Segmentation – Mask-RCNN Results 19/53 A note on Deformable Convolution Although convolutions help with learning spatially-local biases, they suffer greatly from signal-subsampling issues: ● ● We are limited in processing patterns using rectangular patterns, whereas most natural patterns are not best suited to this template. We are limited in sampling points only from the discrete grid. Think of an 8x8 grid in which the object of interest is 2x2. We would not have a direct correspondent in a latent 2x2 grid which for said object due to downsampling. – In this case, ideally we should be able to query fractional positions in the grid for a 0.5x0.5 pixel 20/53 A note on Deformable Convolution Deformable convolutions work similarly to RoI Pool / Alling operations: ● They can learn non-rectangular patterns by computing pixel offsets. ○ For a given filter, instead of applying it to its discrete neighbours, we compute a set of (k, dx, dy) offsets based on which new pixels will be selected. ○ Since the (dx, dy) values can be fractional, we can now query subpixels features: i.e. query at a pixel at a position such as (1, 0.5) ○ Ideally we would like to have such offsets for each spatial position in each filter - but this will blow-up memory, we’re required to compute (CHWk2) additional values at each stage. 21/53 A note on Deformable Convolution The deformation mechanism can be used to build more general mechanisms: ● We make a trade-off between modelling capacity and compute, by setting a constant spatial-offset (k, x, y) for each channel C ● We can retrieve an adaptive-pooling mechanism. ○ Average pooling is just a convolution with a uniform kernel with a value of 1/K, where K is the total filter size ○ Learning a scalar for each k pixel in the filter allows us to learn a sampling operation that can be a mixture between (Average,Max,Min) operations or something more general. ● This scaling mechanism can be extended to the normal deformable convolution as well ○ This can be a double-edged sword, since we might not use a kernel at it’s full capacity. Scaling a pixel k in the kernel with a value of 0, renders the pixel useless. 22/53 Object Detection 23/53 Object Detection Approaches ● Two main approaches to Object Detection – Multi stage predictors - region proposals and region classification / bounding box regression happen in separate steps): e.g. Fast/Faster R-CNN, R-FCN – Single stage predictors - use a single forward pass through the architecture to perform both object classification and object bounding box regression at the same time: e.g. YOLO, SSD, RetinaNet 24/53 Object Detection Approaches The “anatomy” of an object detector. Source YOLOv4, Bochkovskiy et al., 2020 25/53 A History of YOLOs - YOLO v1 ● ● Processes frames in real time (45 fps) Trains on full image and has a single pipeline for detection and localization (as opposed to competitors such as R-CNN) – Has high detection accuracy – Has lower (than competitors at the time) localization accuracy 26/53 A History of YOLOs - YOLOv1 Algorithm ● ● YOLO algorithm works using the following three techniques: – Residual blocks – Bounding box regression – Intersection Over Union (IOU) Conceptual design diagram 27/53 A History of YOLOs - YOLO v1 Algorithm ● Bounding box regression – 5 parameters per bounding box: ● (x, y) - relative to the bounds of the grid cell) ● width, height – relative to whole image ● c – confidence = how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts – ● Each grid cell predicts B bounding boxes + class conditional probability – ● Confidence = Pr(Object) ∗ IOUtruthpred class conditional probability p(class i | Object) – conditioning is on grid cell containing an object => YOLO makes S x S x (B x 5 + C) predictions 28/53 A History of YOLOs - YOLO v1 Algorithm ● Intersection over Union – Metric used to force predicted output box to coincide with ground truth 29/53 A History of YOLOs - YOLO v1 Algorithm ● Loss Function: – Regression objective: optimize directly for detection of objects 30/53 A History of YOLOs - YOLO v1 Architecture ● YOLO architecture: – Inspired by GoogLe Net - 24 convolution layers + 2 Fully connected – Activation function : ReLU function (all layers except final), Linear Func (final layer) – Input image of 448x448x3 – to avoid overfitting we use Dropout Layer and extensive data augmentation. 31/53 A History of YOLOs - YOLO v1 Algorithm ● YOLO training procedure: – Pretrain first 20 conv layers + avgpool + fc on ImageNet (for 1 week :-) ), at image resolution of 224 x 224 – Add 4 conv layers and 2 FC layers with random initialization when fine tuning for detection at 448 x 448 resolution 32/53 A History of YOLOs - YOLO v1 Limitations ● Original YOLO limitations: – Struggles with objects of small sizes that appear in groups (due to strong spatial constraints on bounding box predictions – each grid cell only predicts two boxes and can only have one class) – Struggles to generalize to objects in new or unusual aspect ratios – since bounding boxes are learned from data – Loss function treats errors the same in small bounding boxes versus large bounding boxes => incorrect localizations 33/53 A History of YOLOs - YOLOv2 ● Differences to v1: – 23 Conv layers instead of 24 in YOLO v1 – Activation function : Leaky ReLU function (all layers except final), Linear Func (final layer) – Batch Normalization: BN on all layers in YOLOv1 → 2% improvement in mAP. Helps regularize the model, allowing to remove Dropout layer. – Pretrain conv network directly on 448x448 resolution for ImageNet classification → 4% gain in mAP – Use the concept of anchor boxes – predefined list of best match box sizes (determined using k-means clustering on labeled train set). Predicted box is scaled w.r.t. defined anchor boxes ● ● – Predict box center w.r.t. top left corner of the grid cell (scaled by grid width and height) Predict width and height of the box w.r.t. an anchor box Multi-scale training – every 10th batch network trained on different input size (image resolution) → facilitate detections at different scales 34/53 A History of YOLOs - YOLOv2 ● Filtering out boxes of low quality – Filter out boxes with low confidence (as in YOLOv1) – Filter out boxes with high overlap (high IOU) – non maxima suppression – keep box with highest confidence scores 35/53 A History of YOLOs - YOLO v3 ● Highlights of YOLOv3: – Improvement focus on accuracy (especially on small objects) and speed – 106 layer network, using 75 conv layers ● Uses 23 residual layers at regular intervals to tackle vanishing gradient problem – Predictions at different scales – Darknet-53 used as feature extractor (part that can be pretrained) 36/53 A History of YOLOs - YOLO v3 ● Differences to v1 and v2: – Predict objectness score for each Bb using logistic regression – Width and height of box predicted as offsets from cluster centroids (anchors) – Center coordinates of Bbox predicted as relative to location of filter application using sigmoid function – Object class classifier uses logistic function instead of softmax (to help with multi-label classification), i.e. classification uses individual binary cross-entropy for each class label. Allows training on MS Open Images Dataset which has multiple labels per detection. – Prediction across scales – use of Feature Pyramid Networks – YOLOv3 uses 3 different scales 37/53 A History of YOLOs - YOLOv4 and YOLOv5 ● Highlights of YOLOv4, YOLOv5: – Released within one month of each other – v5 is mostly a PyTorch implementation of v4 (developed using the Darknet framework) – Much higher detection speed – running at almost 140 fps – Improved performance on small object detection (primarily due to mozaique data augmentation technique) – Extensive test of backbone network – Use of Path Aggregation Net and Spatial Pyramid Pooling (SPP) blocks to aggregate features from several stages (scales) of the feature extractor network 38/53 A History of YOLOs - YOLO v4 and YOLOv5 ● YOLOv4/v5 template architecture 39/53 A History of YOLOs - YOLO v4 and YOLOv5 ● YOLOv4 backbone selection – CSPResNext50 and CSPDarknet are both based on DenseNet – YOLOv4 chooses CSDarknet53 due to compromise in receptive field size (the higher the resolution, the higher the chances for small object detection) and FPS 40/53 A History of YOLOs - YOLO v4 and YOLOv5 ● YOLOv4 neck selection – Path Aggregation Network selected as multi-scale feature extractor ● ● ● Bottom up Path Augmentation → propagate low-level features to enhance entire feature hierarchy Adaptive Feature Pooling → for a RoI proposal aggregate features from all levels of the bottom up path Fully connected fusion → exploit inductive bias in FC layer → for detection relations of features from different spatial positions matter 41/53 A History of YOLOs - YOLO v4 and YOLOv5 ● YOLOv4 additional modifications (Bag of Freebies and Bag of Specials). Some highlights: – Data Augmentation: Mosaic Data Augmentation (works well for small object detection) – Regularization: DropBlock regularization, Class label smoothing – Activation Functions: MISH activation – Loss: CioU (complete IoU loss) – addresses problem of nonoverlapping bounding boxes; proportional to overlapping area, distance and aspect ration 42/53 A History of YOLOs - YOLO v4 and YOLOv5 results 43/53 Convolutions for Time Series Modeling 44/53 Convolutions for Time Series Modeling ● Why Convolutional Architectures for sequence data? – More efficient to train than Recurrent Neural Net architectures (see next lecture) – Have the comparable (if not better performance) as RNN architectures, especially on multi-variate numeric time series (e.g. sensor data streams) – Variable length inputs can be handled by sliding the convolution kernel along the sequence input 45/53 Convolutions for Time Series - TCN ● Temporal Convolutional Network (TCN) – Introduced in article: “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling” BAI et al., 2018 – Highlights ● ● ● ● 1D conv kernels are used to make an output sequence have the same length as the input sequence Use of causal convolutions – There can no leakage from the past – The convolution kernel that ouputs yj only looks at inputs xi where i <= j Use of dilated convolutions to extend receptive field and increase “historical context” when predicting the output sequence Uses a residual block instead of single conv layers → increases max depth of network → increases “historical context” modeling capabilities 46/53 Convolutions for Time Series - TCN Source: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, Bai et al., 2018 47/53 Convolutions for Time Series - TCN ● Temporal Convolutional Networks – strengths and limitations – Strengths: ● ● ● – Parallelism – due to application of conv filter => input sequence can be processed as a whole, not sequentially Flexible receptive field size: stack more causal conv layers, increase dilation factor, increase filter size Stable gradients: use of residual blocks, backprop easier than for RNNs Limitations ● ● In test/evaluation mode, the network needs the whole input sequence in memory, instead of just the hidden state vectors (as in the case of RNNs – see next lecture) May require parameter change for transfer learning – If transfering to a domain where the needed “historical context” is greater (i.e. larger filter size or larger dilation factor are needed), TCNs may perform poorly 48/53 Convolutions for Time Series - TCN 49/53 Convolutions for Time Series InceptionTime ● InceptionTime – Introduced in article: “InceptionTime: Finding AlexNet for Time Series Classification”, Fawaz et al – Highlights ● ● ● Focuses on time series classification Uses the multiple perspectives principle introduced in GoogleNet (Inception Cell) and applies it to time series → extract relevant features from both long and short time series Uses ensembling (in a bagging manner – initialize the same network architecture 5 times and average the results) to reduce variance in classification performance 50/53 Convolutions for Time Series InceptionTime 51/53 Convolutions for Time Series InceptionTime ● InceptionTime Network – 2 residual blocks, each composed of 3 inception modules – Global average pooling applied before a FC layer that gives the final classification dimension – Ensemble model to reduce variance in classification accuracy of single models 52/53 The end :-) 53/53