Podcast
Questions and Answers
What does Atrous Spatial Pyramid Pooling (ASPP) primarily aim to improve in the context of semantic segmentation?
What does Atrous Spatial Pyramid Pooling (ASPP) primarily aim to improve in the context of semantic segmentation?
- The quality of feature extraction by capturing multi-scale contextual information. (correct)
- The accuracy of object detection within the segmentation.
- The speed of computation for real-time applications.
- The reduction of memory usage during training.
What is the primary metric used for evaluating the performance of object detection models, considering both precision and recall?
What is the primary metric used for evaluating the performance of object detection models, considering both precision and recall?
- Average Precision (AP) (correct)
- Intersection over Union (IoU)
- Accuracy
- F1-Score
In the context of object detection evaluation, increasing the IoU threshold always results in a higher Average Precision (AP).
In the context of object detection evaluation, increasing the IoU threshold always results in a higher Average Precision (AP).
False (B)
During bipartite matching for object tracking, what does the Hungarian algorithm help minimize?
During bipartite matching for object tracking, what does the Hungarian algorithm help minimize?
What is a key advantage of using dilated convolutions in semantic segmentation?
What is a key advantage of using dilated convolutions in semantic segmentation?
Transformers are spatially invariant by default due to their inherent design.
Transformers are spatially invariant by default due to their inherent design.
What is the purpose of adding the (\sqrt{n}) term in the denominator of the softmax function within the self-attention mechanism?
What is the purpose of adding the (\sqrt{n}) term in the denominator of the softmax function within the self-attention mechanism?
In the context of Conditional Random Fields (CRF) for image segmentation, what does the pairwise potential primarily model?
In the context of Conditional Random Fields (CRF) for image segmentation, what does the pairwise potential primarily model?
The goal of DINO (self-distillation with no labels) is to train a complex, computationally expensive network using self-supervised learning.
The goal of DINO (self-distillation with no labels) is to train a complex, computationally expensive network using self-supervised learning.
What is the main purpose of the 'neural message passing' step in offline MOT (Multi-Object Tracking)?
What is the main purpose of the 'neural message passing' step in offline MOT (Multi-Object Tracking)?
In the context of cost-flow networks for multi-object tracking, what do the 'entrance/exit costs' typically represent?
In the context of cost-flow networks for multi-object tracking, what do the 'entrance/exit costs' typically represent?
A Region Proposal Network (RPN) directly outputs the final object detections without requiring further refinement.
A Region Proposal Network (RPN) directly outputs the final object detections without requiring further refinement.
Why is maintaining a high feature resolution considered best practice in semantic segmentation?
Why is maintaining a high feature resolution considered best practice in semantic segmentation?
What is the purpose of Domain Alignment in the context of computer vision tasks?
What is the purpose of Domain Alignment in the context of computer vision tasks?
In SimCLR, using a higher number of negative samples generally decreases the gradient bias during training.
In SimCLR, using a higher number of negative samples generally decreases the gradient bias during training.
What key assumption underlies the concept of "entropy minimization" in semi-supervised learning?
What key assumption underlies the concept of "entropy minimization" in semi-supervised learning?
In OSVOS, what is the purpose of fine-tuning on the first frame mask of a video sequence?
In OSVOS, what is the purpose of fine-tuning on the first frame mask of a video sequence?
What is the fundamental idea behind Virtual Adversarial Networks (VANs)?
What is the fundamental idea behind Virtual Adversarial Networks (VANs)?
How does Fast R-CNN differ from R-CNN in terms of processing images?
How does Fast R-CNN differ from R-CNN in terms of processing images?
In Fast R-CNN, the RoI pooling layer handles feature maps of different sizes by warping them into a fixed size.
In Fast R-CNN, the RoI pooling layer handles feature maps of different sizes by warping them into a fixed size.
What layer is used to combine information from different images in FlowNet?
What layer is used to combine information from different images in FlowNet?
Which of the following is a disadvantage of Feature Pyramid Networks (FPN)?
Which of the following is a disadvantage of Feature Pyramid Networks (FPN)?
GOTURN does not require annotation of the first frame.
GOTURN does not require annotation of the first frame.
What happens to the gradients when IoU is zero?
What happens to the gradients when IoU is zero?
Self-attention has a complexity of O(______) given a sequence with length (n) and a representation dimension of size (d).
Self-attention has a complexity of O(______) given a sequence with length (n) and a representation dimension of size (d).
In depthwise separable convolutions:
In depthwise separable convolutions:
Match the following tracking challenges with their descriptions:
Match the following tracking challenges with their descriptions:
What is the purpose of the ReID similarity in tracktor?
What is the purpose of the ReID similarity in tracktor?
What describes Hinge Loss?
What describes Hinge Loss?
In Histogram of Oriented Gradients (HOG), what is accomplished during the block normalization step?
In Histogram of Oriented Gradients (HOG), what is accomplished during the block normalization step?
In image segmentation what is the main difference between semantic and instance segmentation
In image segmentation what is the main difference between semantic and instance segmentation
Panoptic segmentation can be considered as a combination of semantic and instance segmentation.
Panoptic segmentation can be considered as a combination of semantic and instance segmentation.
In multi-object tracking, what does the IDF1 score primarily evaluate?
In multi-object tracking, what does the IDF1 score primarily evaluate?
In MDNet, what is single object tracking in more detail?
In MDNet, what is single object tracking in more detail?
In the context of self-supervised learning, what are two things to consider when selecting Pseudo-Labels?
In the context of self-supervised learning, what are two things to consider when selecting Pseudo-Labels?
In the context of Multi-Object Tracking, MOTA is a metric where 0 is the best score.
In the context of Multi-Object Tracking, MOTA is a metric where 0 is the best score.
In one-stage detectors how are positive and negative samples balanced?
In one-stage detectors how are positive and negative samples balanced?
In ResNet, INTERPOLATE then CONVOLVE is implemented to deal with resizing.
In ResNet, INTERPOLATE then CONVOLVE is implemented to deal with resizing.
If referring to a sequence with length (n) and a representation dimension of size (d), what is the complexity for self-attention?
If referring to a sequence with length (n) and a representation dimension of size (d), what is the complexity for self-attention?
Flashcards
Average Precision (AP)
Average Precision (AP)
Area beneath the recall-precision curve. Metric for evaluating detection results.
Steps to compute AP
Steps to compute AP
Sort predictions, assign to ground truth with IoU threshold, compute TP/FP, compute recall/precision, and calculate area beneath PR curve.
AP Derivatives
AP Derivatives
Averages over multiple IoU thresholds, categories by ground truth size. mAP is averaged over classes.
Bipartite matching
Bipartite matching
Signup and view all the flashcards
Best practices in Segmentation
Best practices in Segmentation
Signup and view all the flashcards
CNN locality bias
CNN locality bias
Signup and view all the flashcards
CNN Translation Invariance
CNN Translation Invariance
Signup and view all the flashcards
Transformers Spatial Relationships
Transformers Spatial Relationships
Signup and view all the flashcards
Self-attention equation
Self-attention equation
Signup and view all the flashcards
Conditional Radom Fields CRF
Conditional Radom Fields CRF
Signup and view all the flashcards
Contrastive Random Walk
Contrastive Random Walk
Signup and view all the flashcards
DINO (self-distillation with no labels)
DINO (self-distillation with no labels)
Signup and view all the flashcards
Offline MOT steps with Neural Message Passing
Offline MOT steps with Neural Message Passing
Signup and view all the flashcards
Cost-flow network parts
Cost-flow network parts
Signup and view all the flashcards
Panoptic Segmentation
Panoptic Segmentation
Signup and view all the flashcards
Dilated Convolutions (Why)
Dilated Convolutions (Why)
Signup and view all the flashcards
Edge Boxes
Edge Boxes
Signup and view all the flashcards
Domain alignment
Domain alignment
Signup and view all the flashcards
When to use Domain Aligment
When to use Domain Aligment
Signup and view all the flashcards
SimCLR
SimCLR
Signup and view all the flashcards
Smoothness semi-supervised learning
Smoothness semi-supervised learning
Signup and view all the flashcards
Virtual Adversarial Networks
Virtual Adversarial Networks
Signup and view all the flashcards
Tracktor Concept
Tracktor Concept
Signup and view all the flashcards
How new boxes and lost are recovered in Tracktor.
How new boxes and lost are recovered in Tracktor.
Signup and view all the flashcards
Difference between depthwise conv and vanilla conv
Difference between depthwise conv and vanilla conv
Signup and view all the flashcards
Hinge Loss
Hinge Loss
Signup and view all the flashcards
Histogram of Oriented Gradients
Histogram of Oriented Gradients
Signup and view all the flashcards
SemanticSegmentation
SemanticSegmentation
Signup and view all the flashcards
Disjoint trajectorys
Disjoint trajectorys
Signup and view all the flashcards
MP
MP
Signup and view all the flashcards
Faster R-CNN
Faster R-CNN
Signup and view all the flashcards
Masked Autoencoders
Masked Autoencoders
Signup and view all the flashcards
Pros and cons of offline MOT
Pros and cons of offline MOT
Signup and view all the flashcards
NMS
NMS
Signup and view all the flashcards
Difficulties
Difficulties
Signup and view all the flashcards
One-Stage DetectorsOutput
One-Stage DetectorsOutput
Signup and view all the flashcards
Overfeat Concept
Overfeat Concept
Signup and view all the flashcards
RoIAlign
RoIAlign
Signup and view all the flashcards
High-Resolution good
High-Resolution good
Signup and view all the flashcards
Image
Image
Signup and view all the flashcards
Study Notes
- Anki Cards from 2023 from the Computer Vision III: Detection, Segmentation and Tracking course at Technische Universität München.
Average Precision
- Average Precision (AP) serves as a metric for evaluating detection results by measuring the area beneath the precision-recall curve.
- Predictions are sorted or binned, and then assigned to a ground truth based on an Intersection over Union (IoU) threshold.
- True Positives (TP) and False Positives (FP) are computed over confidence levels, and Recall/Precision calculated.
- The area beneath the Precision-Recall curve is computed, involving averaging over multiple IoU thresholds and object categories.
- Mean Average Precision (mAP) categorizes by Ground Truth (GT) size.
Atrous Spatial Pyramid Pooling
- Atrous Spatial Pyramid Pooling (ASPP) uses multiple dilated convolutional layers to capture multi-scale context in images.
- Incorporates various dilation rates within its convolutional kernels (e.g., rates of 6, 12, 18, and 24).
- It can effectively integrate contextual information, which is especially relevant for semantic image segmentation tasks.
Bipartite Matching
- Bipartite matching matches predictions or tracks with detections using the Hungarian algorithm and minimizes assignment costs.
- Values indicate the likelihood of a detection and prediction belonging together, potentially informed by 1-IoU.
- A missing detection signifies the termination of a track, whereas a missing prediction indicates the start of a new track.
- Bounding boxes are combined or refined using a Bounding Box regressor for matched entities.
CNNs vs Transformers
- CNNs possess inductive biases, including locality (related pixels are close) and translation invariance (consistent convolutions).
- Transformers aren't spatially spatial due to positional encoding and divide images into patches.
- Locality is used by CNNs compared to the global attention mechanism used by transformers.
- Transformers tend to need substantially more computational resources than CNNs.
Conditional Random Fields
- Conditional Random Fields minimizes the energy function, primarily for binary labeling.
- Energy function defined by a sum of unary potentials and pairwise potentials.
- Unary potential represents local information and the likelihood of belonging to a certain class.
- Pairwise potential represents neighborhood information and the dissimilarity of neighboring pixels.
Constrastive Random Walk
- Contrastive random walk is a technique involving constructing palindromes from frames in a video.
- Each frame is divided into patches, that are then connected inter-frame such that they arrive at their original location.
- Works by maximizing the similarity between connected patches.
Cosine Similarity
- Cosine Similarity measures the similarity between two non-zero vectors, by finding the cosine of the angle between them
- Resulting values corresponds to distance on a unit sphere.
- Can range from -1 (different sites) to 1 (same).
DINO
- DINO (self-distillation with no labels) concept involves a student network that mimics the output of a teacher network.
- The teacher network is pretrained and has a large architecture, while the student network is lightweight and simple.
- Aims to create an efficient, lightweight network (student) with comparable performance.
Dilated Convolutions
- In dilated convolutions the receptive field size is r = D(K-1) + 1, where 'r' is the receptive field, 'D' is the dilation rate, and 'K' is the kernel size.
- Scale invariance is improved, and receptive field size maintained while using a stride of 1.
Domain Alignment
- Domain alignment is used when data originates from distinct distributions or sources.
- Used with Generative Adversarial Networks (GANs), with discriminator trained to distinguish domains.
- Generator network aligns outputs to deceive the discriminator in the GAN setup.
Edge Boxes
- Edge boxes generates object proposals and returns boxes with a high ratio of contours inside versus outside.
- Based on detected edges, finding correct boxes and edge groups.
Entropy Minimization
- Entropy minimization employs uncertainty in unlabeled data to improve classification scores.
- Train on labeled data, predict classes of unlabeled data, then add highest entropy samples to labeled data.
- Can be used to strengthen decision boundaries, but risk miscalibration due to wrong pseudo-labels.
Feature Pyramid Network
- Feature Pyramid Networks are designed to address the issue that CNNs aren't scale invariant.
- Rather than generating an image pyramid and running images through a CNN you can define RPN on each level of a pyramid
Generalised IoU
- Generalised IoU (GIoU) helps to prevent vanishing gradients when Intersection over Union (IoU) is zero.
- GIoU is defined as [GIoU = IoU - \frac{|C \setminus (A \cup B)|}{|C|}] ensures a gradient even when IoU is zero.
Given Bounding boxes
- Non-Max Suppression is implemented by choosing the highest scoring anchor box and then eliminating any other overlapping boxes.
- This is repeated for all other boxes which may be above a set threshold.
Hinge Loss
- Hinge loss is used so the positive pair minimizes distance, and the negative pair's distance is greater than the margin.
- Involves two inputs, a pair of samples
Histogram of Oriented Gradients
- Image patches are preprocessed to have fixed ratio/size, the gradient magnitude and direction gradients are calculated.
- Patches are divided into cells, and a histogram of gradients is calculated, which is then normalized to create a flattened result.
How do you do time-aware edges
- Time-aware edge-to-node updates independently compute edge-to-node updates for nodes from past and future frames.
- Multiply concatenated results with a weight matrix to perform time-aware aggregation.
IDF1
- IDF1 evaluates the number of correctly identified detections over the ground truth and computed detections.
List of what applies
- YOLO is a single-stage detector while SSD is a two-stage detector
- DETR doesn't use positional encoding
- Fast R-CNN uses anchors
List of what applies (Metric Learning)
- Contrastive loss uses all the relations in the mini-batch.
- Triplet loss modifies contrastive loss, so it uses more relations.
Mask R-CNN
- Mask R-CNN follows the Faster R-CNN framework but includes a mask head and ROI align.
- Consists of a region proposal network (RPN), and object recognition head combined with a segmentation head.
MDNet
Concept
- MDNet does single object tracking on the first frame, without consistence feature patterns to track
- Does not need to use non-maximum supression. Architecture is encoder-decoder. Online Adpation
- After the first regression, target candidates are drawn, and the optimal state is found.
- A classifier can't concentrate on meaningful proposals because of class imbalance.
MOT
- Steps for Multi-Object Tracking includes graph construction encoding, neural message passing and edge classification to a flow classification for Min-Cost algorithms.
- Cost flow networks uses nodes as detections with disjoint ordered sets and determines the maximum flow with minimum cost.
Multi Head Attention
- You use Masked Multi-Head attention at beginning of the decoder
- Masked Multi-Head feeds the Transformer with the whole sequence but mask tokens to avoid causality breaking
Name the pros and cons of offline offline MOT with a Message passing network.
- Message Passing Network is more accurate than online tracker with learnable costs on data.
- Strong assumptions on graph structures handles occlusions without requiring a end2end process. Slower post-processing required.
One Stage Detectors
- One-Stage Detectors predict the class directly with each anchor.
- Class imbalance on foreground versus background leads to the classificator not concentrating on meaningful proposals and a solution is Focal Loss for emphasis.
OSVOS
- OSVOS uses training steps of pre-training, parent network training and fine tuning
- First using classification then segmentation.
PAN
- PAN allows you to do semantic segmentation without every item needing itemized detection. Improves connectedness.
Recize Connvolutions
- Is done by first interpolating then convolving so there's less artifacts but less information. Improves data artifacts
RetinaNet
- A combination of ResNet and Feauture Pynet one a single stage with a focal loss
- Feature extraction and multistage prediction.
Rol Pool versus Rol Align
- There a quantisation
- Algin computes future values with bilinear interpolation. Algin therefore used for segmentation whereas pool is used for classification.
Self Training
- Self Training uses Pre Training on labeled data, predict pseudo labels for unlabled and combine or repeat on dataset.
- Its fails if inital predictions are inaccurate and low accuracy becomes tedious.
Seg Net
SegNet relies on passing pooling indices from encoder layers such as a Convolutional Encoder-Decoder.
SimCLR
- Self-supervised contrastive learning involves a contrastive loss, and a base encoder such as a CNN
- The gradient is biased, with more negative samples reduce.
Single Shot Dectector SSD
- Multiple feature scales creates pyramide and the data segmentation is crucial.
- With Skip Connections connects high resolution is concatenate which maybe require training.
Sliding Window
- Sliding Window selects a template, then for each position distance or correlation is measured.
- Does this whether highest or lowest is best.
- However this can lead to object changes
Spatial Pyramids
- Creates fixed length presentations of inputs, and appies pooling on grids deviding into multipe sizes. Improves output in fixed-length representaion
Swin
- Swin Transofrmers Constructs hierarchies of image patches to succeed hierarchical decrease with Linear computation with state of art
Tracktor + Faster RCNN
- The bounding box can rerefined but the current frame needs refined with Bounding box regressor or detection of frame
Transformers
Typically transformer layers have Both multi-head extraction and MLP sublayer and the number of layers in encoder dont need to be same as the number of attention heads.
Triplet
• Given an anchor, Triplet Loss' objective is make the model bring similar samples together to increase differencering. The training is Imgaes in this set
Two Stage Detectors
- 1 Stage allows Object Proposaels . 2 Clasifys or Refinies them
- It also does bounding boxes that can enclose detector
U-Net Network
U-Net improves using skips.
Virtual adversarial networks
• Is done by pretraining on labelled an augmented examples and mixsmize similarly to unlabeled set.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.