Average Precision & Atrous Spatial Pyramid Pooling

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does Atrous Spatial Pyramid Pooling (ASPP) primarily aim to improve in the context of semantic segmentation?

  • The quality of feature extraction by capturing multi-scale contextual information. (correct)
  • The accuracy of object detection within the segmentation.
  • The speed of computation for real-time applications.
  • The reduction of memory usage during training.

What is the primary metric used for evaluating the performance of object detection models, considering both precision and recall?

  • Average Precision (AP) (correct)
  • Intersection over Union (IoU)
  • Accuracy
  • F1-Score

In the context of object detection evaluation, increasing the IoU threshold always results in a higher Average Precision (AP).

False (B)

During bipartite matching for object tracking, what does the Hungarian algorithm help minimize?

<p>The assignment costs between predictions and detections. (C)</p> Signup and view all the answers

What is a key advantage of using dilated convolutions in semantic segmentation?

<p>Increasing the receptive field size without increasing the number of parameters. (A)</p> Signup and view all the answers

Transformers are spatially invariant by default due to their inherent design.

<p>False (B)</p> Signup and view all the answers

What is the purpose of adding the (\sqrt{n}) term in the denominator of the softmax function within the self-attention mechanism?

<p>To stabilize gradients</p> Signup and view all the answers

In the context of Conditional Random Fields (CRF) for image segmentation, what does the pairwise potential primarily model?

<p>The compatibility between the labels of neighboring pixels. (C)</p> Signup and view all the answers

The goal of DINO (self-distillation with no labels) is to train a complex, computationally expensive network using self-supervised learning.

<p>False (B)</p> Signup and view all the answers

What is the main purpose of the 'neural message passing' step in offline MOT (Multi-Object Tracking)?

<p>Propagate cues across the entire graph</p> Signup and view all the answers

In the context of cost-flow networks for multi-object tracking, what do the 'entrance/exit costs' typically represent?

<p>The cost associated with starting or ending a track. (A)</p> Signup and view all the answers

A Region Proposal Network (RPN) directly outputs the final object detections without requiring further refinement.

<p>False (B)</p> Signup and view all the answers

Why is maintaining a high feature resolution considered best practice in semantic segmentation?

<p>To enable finer details and boundaries to be captured accurately. (A)</p> Signup and view all the answers

What is the purpose of Domain Alignment in the context of computer vision tasks?

<p>Align feature domains</p> Signup and view all the answers

In SimCLR, using a higher number of negative samples generally decreases the gradient bias during training.

<p>True (A)</p> Signup and view all the answers

What key assumption underlies the concept of "entropy minimization" in semi-supervised learning?

<p>The classifier should be confident about its predictions on unlabeled data. (A)</p> Signup and view all the answers

In OSVOS, what is the purpose of fine-tuning on the first frame mask of a video sequence?

<p>Learn appearance of target object</p> Signup and view all the answers

What is the fundamental idea behind Virtual Adversarial Networks (VANs)?

<p>Making small changes to the input should not change the output label. (A)</p> Signup and view all the answers

How does Fast R-CNN differ from R-CNN in terms of processing images?

<p>Fast R-CNN processes the entire image through the ConvNet only once to generate a feature map. (C)</p> Signup and view all the answers

In Fast R-CNN, the RoI pooling layer handles feature maps of different sizes by warping them into a fixed size.

<p>True (A)</p> Signup and view all the answers

What layer is used to combine information from different images in FlowNet?

<p>Correlation layer</p> Signup and view all the answers

Which of the following is a disadvantage of Feature Pyramid Networks (FPN)?

<p>Increased model complexity. (A)</p> Signup and view all the answers

GOTURN does not require annotation of the first frame.

<p>False (B)</p> Signup and view all the answers

What happens to the gradients when IoU is zero?

<p>Disappear (D)</p> Signup and view all the answers

Self-attention has a complexity of O(______) given a sequence with length (n) and a representation dimension of size (d).

<p>n^2d</p> Signup and view all the answers

In depthwise separable convolutions:

<p>It requires a lower number of parameters, same output-shape as normal convolutions. (A)</p> Signup and view all the answers

Match the following tracking challenges with their descriptions:

<p>Fast motion = Camera and object moves rapidly, blurring images. Changing appearance/pose = Object shape or texture changes over time. Dynamic background = Background scene contents evolve over time. Occlusions = Objects are partially or fully obscured for short periods.</p> Signup and view all the answers

What is the purpose of the ReID similarity in tracktor?

<p>Track recovery. (C)</p> Signup and view all the answers

What describes Hinge Loss?

<p>Positive pair -&gt; minimize distance and Negative pair -&gt; distance greater than the margin. (A)</p> Signup and view all the answers

In Histogram of Oriented Gradients (HOG), what is accomplished during the block normalization step?

<p>Contrast is normalized to improve robustness to lighting. (B)</p> Signup and view all the answers

In image segmentation what is the main difference between semantic and instance segmentation

<p>Instance segmentation labels different instances from the same class. (C)</p> Signup and view all the answers

Panoptic segmentation can be considered as a combination of semantic and instance segmentation.

<p>True (A)</p> Signup and view all the answers

In multi-object tracking, what does the IDF1 score primarily evaluate?

<p>The consistency of identity assignments over time. (C)</p> Signup and view all the answers

In MDNet, what is single object tracking in more detail?

<p>An annotation of the first frame</p> Signup and view all the answers

In the context of self-supervised learning, what are two things to consider when selecting Pseudo-Labels?

<p>The confidence threshold and low noisy labels -&gt; low accuracy. (B)</p> Signup and view all the answers

In the context of Multi-Object Tracking, MOTA is a metric where 0 is the best score.

<p>False (B)</p> Signup and view all the answers

In one-stage detectors how are positive and negative samples balanced?

<p>Focal Loss</p> Signup and view all the answers

In ResNet, INTERPOLATE then CONVOLVE is implemented to deal with resizing.

<p>True (A)</p> Signup and view all the answers

If referring to a sequence with length (n) and a representation dimension of size (d), what is the complexity for self-attention?

<p>(O(n^2d)) (D)</p> Signup and view all the answers

Flashcards

Average Precision (AP)

Area beneath the recall-precision curve. Metric for evaluating detection results.

Steps to compute AP

Sort predictions, assign to ground truth with IoU threshold, compute TP/FP, compute recall/precision, and calculate area beneath PR curve.

AP Derivatives

Averages over multiple IoU thresholds, categories by ground truth size. mAP is averaged over classes.

Bipartite matching

Match predictions/tracks with detections by minimizing assignment costs using the Hungarian algorithm.

Signup and view all the flashcards

Best practices in Segmentation

Keep high feature resolution, receptive field size, use skip connections, and image-adaptive post-processing.

Signup and view all the flashcards

CNN locality bias

Pixels close to each other are closely related.

Signup and view all the flashcards

CNN Translation Invariance

Convolutions are translation invariant.

Signup and view all the flashcards

Transformers Spatial Relationships

Transformers divide the image into patches.

Signup and view all the flashcards

Self-attention equation

Self-attention equation involves Query, Key, and Value terms. It prevents small gradients.

Signup and view all the flashcards

Conditional Radom Fields CRF

Goal is to minimize the energy function, often used for binary labeling.

Signup and view all the flashcards

Contrastive Random Walk

Maximize similarity between connected patches in video frames.

Signup and view all the flashcards

DINO (self-distillation with no labels)

Student mimics teachers output with lightweight network and distillation.

Signup and view all the flashcards

Offline MOT steps with Neural Message Passing

Encode appearance/geometry into node/edge embeddings, propagate cues across the graph, classify edges.

Signup and view all the flashcards

Cost-flow network parts

Nodes: Detections, Edges: Relationship of nodes. Goal: disjoint set of trajectories

Signup and view all the flashcards

Panoptic Segmentation

Combines semantic and instance segmentation.

Signup and view all the flashcards

Dilated Convolutions (Why)

Maintains receptive field size with stride of 1, improves scale invariance.

Signup and view all the flashcards

Edge Boxes

Ratio of correctness inside/outside

Signup and view all the flashcards

Domain alignment

Align feature domains by training a discriminator while training a generator to fool it.

Signup and view all the flashcards

When to use Domain Aligment

Need Data may come from different distributions/ domains

Signup and view all the flashcards

SimCLR

Simple framework for contrastive learning. Train classifier used by the contrastive loss

Signup and view all the flashcards

Smoothness semi-supervised learning

If two input points are close by, their labels should be the same.

Signup and view all the flashcards

Virtual Adversarial Networks

Maximise the Similarity of unlabelled data and same unlabelled-perturbed data.

Signup and view all the flashcards

Tracktor Concept

Detect object in frame, copie location from previous frame, refine bboxes with regression

Signup and view all the flashcards

How new boxes and lost are recovered in Tracktor.

New boxes and RE-ID similiarties.

Signup and view all the flashcards

Difference between depthwise conv and vanilla conv

depthwise convolutions operate on slices of the input. Vanilla Convs are appiled on all channels

Signup and view all the flashcards

Hinge Loss

positive pair: you minimize distance, negative pair: distance > margine

Signup and view all the flashcards

Histogram of Oriented Gradients

Get patches, get gradients, extract HOG features, Block Normalization,Train SVM classifier

Signup and view all the flashcards

SemanticSegmentation

all into categories (stuff)

Signup and view all the flashcards

Disjoint trajectorys

Max flow a minimum cost.

Signup and view all the flashcards

MP

They can be impementad.

Signup and view all the flashcards

Faster R-CNN

That is why r-CNN is slow

Signup and view all the flashcards

Masked Autoencoders

Learn feature representations invariant to data modifications

Signup and view all the flashcards

Pros and cons of offline MOT

more accurate than online tracker Almost end2end slower than online tracker post-processing required

Signup and view all the flashcards

NMS

to keep only the 'best' boxes

Signup and view all the flashcards

Difficulties

Fast motion. Fast, changing appearance High image quality object in motion

Signup and view all the flashcards

One-Stage DetectorsOutput

Localisation, Object, class.

Signup and view all the flashcards

Overfeat Concept

Combination of regression and class

Signup and view all the flashcards

RoIAlign

Compute feature values withbilinear inter

Signup and view all the flashcards

High-Resolution good

high-resolution layers with upsampling concatenate May require stage-wise training

Signup and view all the flashcards

Image

Multiple locations Covering the image

Signup and view all the flashcards

Study Notes

  • Anki Cards from 2023 from the Computer Vision III: Detection, Segmentation and Tracking course at Technische Universität München.

Average Precision

  • Average Precision (AP) serves as a metric for evaluating detection results by measuring the area beneath the precision-recall curve.
  • Predictions are sorted or binned, and then assigned to a ground truth based on an Intersection over Union (IoU) threshold.
  • True Positives (TP) and False Positives (FP) are computed over confidence levels, and Recall/Precision calculated.
  • The area beneath the Precision-Recall curve is computed, involving averaging over multiple IoU thresholds and object categories.
  • Mean Average Precision (mAP) categorizes by Ground Truth (GT) size.

Atrous Spatial Pyramid Pooling

  • Atrous Spatial Pyramid Pooling (ASPP) uses multiple dilated convolutional layers to capture multi-scale context in images.
  • Incorporates various dilation rates within its convolutional kernels (e.g., rates of 6, 12, 18, and 24).
  • It can effectively integrate contextual information, which is especially relevant for semantic image segmentation tasks.

Bipartite Matching

  • Bipartite matching matches predictions or tracks with detections using the Hungarian algorithm and minimizes assignment costs.
  • Values indicate the likelihood of a detection and prediction belonging together, potentially informed by 1-IoU.
  • A missing detection signifies the termination of a track, whereas a missing prediction indicates the start of a new track.
  • Bounding boxes are combined or refined using a Bounding Box regressor for matched entities.

CNNs vs Transformers

  • CNNs possess inductive biases, including locality (related pixels are close) and translation invariance (consistent convolutions).
  • Transformers aren't spatially spatial due to positional encoding and divide images into patches.
  • Locality is used by CNNs compared to the global attention mechanism used by transformers.
  • Transformers tend to need substantially more computational resources than CNNs.

Conditional Random Fields

  • Conditional Random Fields minimizes the energy function, primarily for binary labeling.
  • Energy function defined by a sum of unary potentials and pairwise potentials.
  • Unary potential represents local information and the likelihood of belonging to a certain class.
  • Pairwise potential represents neighborhood information and the dissimilarity of neighboring pixels.

Constrastive Random Walk

  • Contrastive random walk is a technique involving constructing palindromes from frames in a video.
  • Each frame is divided into patches, that are then connected inter-frame such that they arrive at their original location.
  • Works by maximizing the similarity between connected patches.

Cosine Similarity

  • Cosine Similarity measures the similarity between two non-zero vectors, by finding the cosine of the angle between them
  • Resulting values corresponds to distance on a unit sphere.
  • Can range from -1 (different sites) to 1 (same).

DINO

  • DINO (self-distillation with no labels) concept involves a student network that mimics the output of a teacher network.
  • The teacher network is pretrained and has a large architecture, while the student network is lightweight and simple.
  • Aims to create an efficient, lightweight network (student) with comparable performance.

Dilated Convolutions

  • In dilated convolutions the receptive field size is r = D(K-1) + 1, where 'r' is the receptive field, 'D' is the dilation rate, and 'K' is the kernel size.
  • Scale invariance is improved, and receptive field size maintained while using a stride of 1.

Domain Alignment

  • Domain alignment is used when data originates from distinct distributions or sources.
  • Used with Generative Adversarial Networks (GANs), with discriminator trained to distinguish domains.
  • Generator network aligns outputs to deceive the discriminator in the GAN setup.

Edge Boxes

  • Edge boxes generates object proposals and returns boxes with a high ratio of contours inside versus outside.
  • Based on detected edges, finding correct boxes and edge groups.

Entropy Minimization

  • Entropy minimization employs uncertainty in unlabeled data to improve classification scores.
  • Train on labeled data, predict classes of unlabeled data, then add highest entropy samples to labeled data.
  • Can be used to strengthen decision boundaries, but risk miscalibration due to wrong pseudo-labels.

Feature Pyramid Network

  • Feature Pyramid Networks are designed to address the issue that CNNs aren't scale invariant.
  • Rather than generating an image pyramid and running images through a CNN you can define RPN on each level of a pyramid

Generalised IoU

  • Generalised IoU (GIoU) helps to prevent vanishing gradients when Intersection over Union (IoU) is zero.
  • GIoU is defined as [GIoU = IoU - \frac{|C \setminus (A \cup B)|}{|C|}] ensures a gradient even when IoU is zero.

Given Bounding boxes

  • Non-Max Suppression is implemented by choosing the highest scoring anchor box and then eliminating any other overlapping boxes.
  • This is repeated for all other boxes which may be above a set threshold.

Hinge Loss

  • Hinge loss is used so the positive pair minimizes distance, and the negative pair's distance is greater than the margin.
  • Involves two inputs, a pair of samples

Histogram of Oriented Gradients

  • Image patches are preprocessed to have fixed ratio/size, the gradient magnitude and direction gradients are calculated.
  • Patches are divided into cells, and a histogram of gradients is calculated, which is then normalized to create a flattened result.

How do you do time-aware edges

  • Time-aware edge-to-node updates independently compute edge-to-node updates for nodes from past and future frames.
  • Multiply concatenated results with a weight matrix to perform time-aware aggregation.

IDF1

  • IDF1 evaluates the number of correctly identified detections over the ground truth and computed detections.

List of what applies

  • YOLO is a single-stage detector while SSD is a two-stage detector
  • DETR doesn't use positional encoding
  • Fast R-CNN uses anchors

List of what applies (Metric Learning)

  • Contrastive loss uses all the relations in the mini-batch.
  • Triplet loss modifies contrastive loss, so it uses more relations.

Mask R-CNN

  • Mask R-CNN follows the Faster R-CNN framework but includes a mask head and ROI align.
  • Consists of a region proposal network (RPN), and object recognition head combined with a segmentation head.

MDNet

Concept

  • MDNet does single object tracking on the first frame, without consistence feature patterns to track
  • Does not need to use non-maximum supression. Architecture is encoder-decoder. Online Adpation
  • After the first regression, target candidates are drawn, and the optimal state is found.
  • A classifier can't concentrate on meaningful proposals because of class imbalance.

MOT

  • Steps for Multi-Object Tracking includes graph construction encoding, neural message passing and edge classification to a flow classification for Min-Cost algorithms.
  • Cost flow networks uses nodes as detections with disjoint ordered sets and determines the maximum flow with minimum cost.

Multi Head Attention

  • You use Masked Multi-Head attention at beginning of the decoder
  • Masked Multi-Head feeds the Transformer with the whole sequence but mask tokens to avoid causality breaking

Name the pros and cons of offline offline MOT with a Message passing network.

  • Message Passing Network is more accurate than online tracker with learnable costs on data.
  • Strong assumptions on graph structures handles occlusions without requiring a end2end process. Slower post-processing required.

One Stage Detectors

  • One-Stage Detectors predict the class directly with each anchor.
  • Class imbalance on foreground versus background leads to the classificator not concentrating on meaningful proposals and a solution is Focal Loss for emphasis.

OSVOS

  • OSVOS uses training steps of pre-training, parent network training and fine tuning
  • First using classification then segmentation.

PAN

  • PAN allows you to do semantic segmentation without every item needing itemized detection. Improves connectedness.

Recize Connvolutions

  • Is done by first interpolating then convolving so there's less artifacts but less information. Improves data artifacts

RetinaNet

  • A combination of ResNet and Feauture Pynet one a single stage with a focal loss
  • Feature extraction and multistage prediction.

Rol Pool versus Rol Align

  • There a quantisation
  • Algin computes future values with bilinear interpolation. Algin therefore used for segmentation whereas pool is used for classification.

Self Training

  • Self Training uses Pre Training on labeled data, predict pseudo labels for unlabled and combine or repeat on dataset.
  • Its fails if inital predictions are inaccurate and low accuracy becomes tedious.

Seg Net

SegNet relies on passing pooling indices from encoder layers such as a Convolutional Encoder-Decoder.

SimCLR

  • Self-supervised contrastive learning involves a contrastive loss, and a base encoder such as a CNN
  • The gradient is biased, with more negative samples reduce.

Single Shot Dectector SSD

  • Multiple feature scales creates pyramide and the data segmentation is crucial.
  • With Skip Connections connects high resolution is concatenate which maybe require training.

Sliding Window

  • Sliding Window selects a template, then for each position distance or correlation is measured.
  • Does this whether highest or lowest is best.
  • However this can lead to object changes

Spatial Pyramids

  • Creates fixed length presentations of inputs, and appies pooling on grids deviding into multipe sizes. Improves output in fixed-length representaion

Swin

  • Swin Transofrmers Constructs hierarchies of image patches to succeed hierarchical decrease with Linear computation with state of art

Tracktor + Faster RCNN

  • The bounding box can rerefined but the current frame needs refined with Bounding box regressor or detection of frame

Transformers

Typically transformer layers have Both multi-head extraction and MLP sublayer and the number of layers in encoder dont need to be same as the number of attention heads.

Triplet

• Given an anchor, Triplet Loss' objective is make the model bring similar samples together to increase differencering. The training is Imgaes in this set

Two Stage Detectors

  • 1 Stage allows Object Proposaels . 2 Clasifys or Refinies them
  • It also does bounding boxes that can enclose detector

U-Net Network

U-Net improves using skips.

Virtual adversarial networks

• Is done by pretraining on labelled an augmented examples and mixsmize similarly to unlabeled set.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser