Image Segmentation and CNNs

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the primary goal of image segmentation?

To classify an entire image into a single category.
To enhance the resolution of an image.
To label each pixel in the image with a category label. (correct)
To detect objects in an image.

In semantic segmentation, different instances of the same object class are differentiated.

False (B)

What is a key limitation of using a sliding window approach for semantic segmentation?

It fails to capture contextual information.
It is computationally efficient.
It does not classify each pixel in the image.
It is very inefficient due to not reusing shared features between overlapping patches. (correct)

Fully convolutional networks (FCNs) address the limitations of sliding window approaches by making _______ for pixels all at once.

predictions

Signup and view all the answers

Match the image analysis tasks with their descriptions.

Image Classification = Assigning a single label to an entire image. Object Detection = Identifying and localizing multiple objects within an image. Semantic Segmentation = Labeling each pixel in an image with a category. Instance Segmentation = Identifying and segmenting each distinct object instance in an image.

Signup and view all the answers

Why can classification architectures be problematic for semantic segmentation?

They often reduce the feature spatial sizes to go deeper, but semantic segmentation requires the output size to be same as input size. (C)

Signup and view all the answers

Downsampling operators are essential in a fully convolutional network for semantic segmentation to maintain the original resolution of the input image throughout the network.

False (B)

Signup and view all the answers

In the context of semantic segmentation, what is the purpose of 'upsampling'?

To increase the spatial resolution of the feature maps. (A)

Signup and view all the answers

The 'bed of nails' method is a type of ___________ technique used in semantic segmentation.

unpooling

Signup and view all the answers

What information is retained and used in max unpooling during the upsampling process?

the positions of the maximum elements from the pooling layer

Signup and view all the answers

In transposed convolution, the filter moves a number of pixels in the output for every one pixel in the input.

True (A)

Signup and view all the answers

What is the main purpose of using transposed convolution in semantic segmentation?

To increase the spatial resolution of feature maps. (C)

Signup and view all the answers

A key issue with downsampling-then-upsampling approaches is that important details and ________ information may be lost during downsampling.

spatial

Signup and view all the answers

What type of connections helps to recover lost spatial details in semantic segmentation when adopting a downsampling-then-upsampling approach?

residual connections

Signup and view all the answers

What is the primary benefit of using residual connections in segmentation networks?

They help recover lost spatial details and improve segmentation accuracy. (B)

Signup and view all the answers

In a U-Net architecture, features are downsampled in the decoder and upsampled in the encoder, with skip connections in between.

False (B)

Signup and view all the answers

What is a characteristic feature of the U-Net architecture?

It resembles the shape of the letter "U". (C)

Signup and view all the answers

A U-Net is widely used for ________ tasks, especially in biomedical imaging.

segmentation

Signup and view all the answers

What is one advantage of using a pre-trained encoder (e.g., ResNet, EfficientNet) in a U-Net architecture?

It helps utilize features learned on large datasets (e.g., ImageNet).

Signup and view all the answers

Regarding a pretrained U-Net, which part of the network is usually retrained for the specific task?

Only the decoder is trained since the encoder already contains pre-trained weights. (B)

Signup and view all the answers

With semantic segmentation, we can differentiate each pixel in the image, and differentiate instances of the same class.

False (B)

Signup and view all the answers

What is the primary difference between semantic segmentation and instance segmentation?

Instance segmentation separates object instances, while semantic segmentation does not. (A)

Signup and view all the answers

While semantic segmentation labels each pixel in the image, it does not differentiate ________, only cares about pixels.

instances

Signup and view all the answers

What does instance segmentation achieve that semantic segmentation does not?

Separate object instances/separate object instances, but only things

Signup and view all the answers

Which of the following is true for panoptic segmentation?

It labels all pixels in the image, whether they belong to 'things' or 'stuff'. (B)

Signup and view all the answers

Panoptic segmentation focuses only on labeling distinct object instances (things) in an image and ignores labeling the background (stuff).

False (B)

Signup and view all the answers

In contrast to semantic segmentation, panoptic segmentation labels all pixels in the image, both and .

things / stuff

Signup and view all the answers

The output O(x, y) of a transposed convolution is computed as: O(x, y) = ∑(i,j) · K(x - i . s y _______)

Signup and view all the answers

What does dilated convolution do?

spacing between kernel elements

Signup and view all the answers

How does strided convolution related to learnable down sampling?

learnable downsampling of convolutional layers (B)

Signup and view all the answers

Match the description with the relevant layer:

feature maps = used to make output to the next layer conv 3x3 ReLU = adds non linearity in the process concatenation = combines the tensors up-sampling 2x2 = resizes images to bigger sizes

Signup and view all the answers

What does pooling do?

downsampling the image (B)

Signup and view all the answers

Features are downsampled in both the encoder and the decoder in UNet architecture

False (B)

Signup and view all the answers

What type of feature are downsampled and upsampled in between layers?

segments objects (C)

Signup and view all the answers

What is the most crucial part of the images during convulation?

kernel (C)

Signup and view all the answers

When an original image resolution will be very expensive, what causes problem?

convulations (C)

Signup and view all the answers

In the Nearest Neighbor method, is the input of 2x2 and output of 2x2?

False (B)

Signup and view all the answers

The ratio of the stride is determined from ___________ and Output/Input

movement

Signup and view all the answers

Which segmentation does not differenciate instances?

Semantic (B)

Signup and view all the answers

Match the following with its description

object detection = using cat, dog, truck, plane. . . on a label for image semantic segmentation = giving cat, sky, trees and grass on the images classification = Using bounding boxed detections around instances

Signup and view all the answers

What is the core computer vision task?

Image Classification (A)

Signup and view all the answers

Flashcards

Semantic Segmentation

Assigning a category label to each pixel in an image.

Instance Segmentation

Classifying each pixel and differentiating distinct object instances.

Panoptic Segmentation

Classifying every pixel, distinguishing between distinct objects and background elements.

Semantic Segmentation with Sliding Window

To classify each pixel of an image using a sliding window.

Signup and view all the flashcards

Semantic Segmentation with Convolution

Encoding the entire image with a conv net and doing semantic segmentation on top.

Signup and view all the flashcards

Fully Convolutional Network

A network with convolutional layers that makes pixel predictions all at once, without downsampling operators.

Signup and view all the flashcards

Upsampling

Increasing the spatial resolution of a feature map.

Signup and view all the flashcards

Nearest Neighbor Upsampling

Repeating values to increase resolution.

Signup and view all the flashcards

Bed of Nails Unpooling

Placing values in specific locations.

Signup and view all the flashcards

Max Unpooling

Uses max pooling indices to upsample.

Signup and view all the flashcards

Transposed Convolution

Learns to upsample using filters.

Signup and view all the flashcards

Stride

The step size of the filter movement in transposed convolution.

Signup and view all the flashcards

Residual Connections

Adds features to preserve spatial information.

Signup and view all the flashcards

Addition (Residual)

Adds features element-wise from the encoder to the decoder.

Signup and view all the flashcards

Concatenation (Residual)

Concatenates features from the encoder to the decoder along the channel dimension.

Signup and view all the flashcards

U-Net

U-shaped architecture with residual connections for segmentation.

Signup and view all the flashcards

Pretrained U-Net

Utilizing a pre-trained model for the encoder part of U-Net.

Signup and view all the flashcards

Study Notes

Computer Vision presentation by Naeemullah Khan, [email protected], February 19, 2025

Learning Outcomes

Understand image segmentation fundamentals and its importance.
Understand how Convolutional Neural Networks (CNNs) are adapted for segmentation tasks.
Understand different upsampling techniques in segmentation models.
Understand the role of residual connections and the U-Net architecture in segmentation.
Differentiate between instance and panoptic segmentation.

Image Classification

Image Classification is a core task in Computer Vision

Computer Vision Tasks

Classification involves assigning a single label to an entire image, lacking spatial extent
Semantic Segmentation classifies each pixel in an image into a predefined set of categories, resulting in a pixel-wise labeling of the image
Object Detection identifies and localizes multiple objects within an image by drawing bounding boxes around each object
Instance Segmentation is similar to object detection, but instead of providing bounding boxes, it delineates each object instance at the pixel level

Semantic Segmentation

With paired training data, each pixel is labeled with a semantic category.
At test time, each pixel of a new image gets classified.
Classifying each pixel independently is impossible without considering context.

Semantic Segmentation Idea: Sliding Window

A sliding window approach classifies the center pixel with a CNN, extracting patches around the pixel of interest
The sliding window approach is very inefficient due to not reusing shared features between overlapping patches

Semantic Segmentation Idea: Convolution

Instead encode the entire image with a conv net, and that does semantic segmentation on top
Classification architectures reduce feature spatial sizes to go deeper, but semantic segmentation requires the output size to be the same as input size.

Semantic Segmentation Idea: Fully Convolutional

Design a network with convolutional layers without downsampling operators to make predictions for pixels all at once
Fully Convolutional networks convolutions at the original image resolution can be very expensive
Design the network with downsampling and upsampling inside
- Med-res: D₂ x H/4 x W/4
- Low-res: D₃ x H/8 x W/8
- High-res: D₁ x H/2 x W/2

In-Network Upsampling: Unpooling

Nearest Neighbor unpooling duplicates the values resulting in a blocky output
"Bed of Nails" unpooling places the input values in the top-left corner and fills the rest with zeros

In-Network Upsampling: Max Unpooling

Max Unpooling uses the positions from pooling layer and remembers which element was max

Learnable Upsampling: Transposed Convolution

Normal Convolution: a 3x3 convolution with stride 2 and pad 1.
In normal convolution, the filter moves 2 pixels in the input for every one pixel in the output. -Strided convolution can be interpreted as "learnable downsampling".
Transposed Convolution: a 3x3 transposed convolution, stride 2, pad 1
Input gives weight for filter.
Filter moves 2 pixels in the output for every one pixel in the input.
Strided gives ratio between movement in output and input
There are overlapping outputs, where there is a sum

Mathematical Definition: Transposed Convolution

The output O(x, y) of a transposed convolution is computed as: O(x, y) = ∑ I(i,j) ⋅ K(x − i ⋅ s, y − j ⋅ s)
- I(i,j) is the input value at (i,j)
- K(x', y') is the kernel value at (x', y')
- s is the stride

Transposed Convolution Output Size Formula

Hout = (Hin - 1) × stride[0] – 2 × padding[0] + dilation[0] × (kernel_size[0] – 1) + output_padding[0] + 1
Wout = (Win - 1) × stride[1] — 2 × padding[1] + dilation[1] × (kernel_size[1] — 1) + output_padding[1] + 1
Hout, Wout are the Output height and width.
Hin, Win are the Input height and width.
Stride is the Step size of the filter movement.
Padding is the Number of pixels added around the input.
Dilation is the Spacing between kernel elements.
Kernel size is the Size of the convolution filter.
Output padding is the Additional padding applied to the output.

Semantic Segmentation Idea: Fully Convolutional

Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
- Med-res: D₂ x H/4 x W/4
- Low-res: D₃ x H/4 x W/4
- High-res: D₁ x H/2 x W/2
Downsampling: Pooling, strided convolution
Upsampling: Unpooling or strided transposed convolution

Can We Do Better?

Downsampling-then-upsampling approach works well for semantic segmentation.
Important details and spatial information may be lost during downsampling.
Introduce residual connections to preserve spatial information.

Residual Connections in Segmentation

Connect features from downsampling layers to upsampling layers.
Help recover lost spatial details and improve segmentation accuracy.
Two Types of Residuals:
- Addition: Adds features from the encoder to the decoder element-wise
- Concatenation: Concatenates features from the encoder to the decoder along the channel dimension.
Concatenation is often better because it retains more feature information from the encoder.
Concatenation might be harder to implement because it requires aligning input and output shapes.

U-Net

With residual connections, the architecture is called U-Net.
The architecture resembles the shape of the letter "U"
Features are downsampled in the encoder and upsampled in the decoder, with skip connections in between.
U-Net is widely used for segmentation tasks, especially in biomedical imaging.

Using a Pretrained U-Net

The encoder can use a pretrained backbone (e.g., ResNet, EfficientNet).
Useful features learned on large datasets (e.g., ImageNet).
Only the decoder is trained from scratch for segmentation-specific tasks.

Semantic Segmentation

Label each pixel in the image with a category label
Does not differentiate instances, only care about pixels

Instance Segmentation

Separates object instances, but only things

Panoptic Segmentation

Labels all pixels in the image (both things and stuff)

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Image Segmentation and CNNs

Choose a study mode

Podcast

Questions and Answers

What is the primary goal of image segmentation?

In semantic segmentation, different instances of the same object class are differentiated.

What is a key limitation of using a sliding window approach for semantic segmentation?

Fully convolutional networks (FCNs) address the limitations of sliding window approaches by making _______ for pixels all at once.

Match the image analysis tasks with their descriptions.

Why can classification architectures be problematic for semantic segmentation?

Downsampling operators are essential in a fully convolutional network for semantic segmentation to maintain the original resolution of the input image throughout the network.

In the context of semantic segmentation, what is the purpose of 'upsampling'?

The 'bed of nails' method is a type of ___________ technique used in semantic segmentation.

What information is retained and used in max unpooling during the upsampling process?

In transposed convolution, the filter moves a number of pixels in the output for every one pixel in the input.

What is the main purpose of using transposed convolution in semantic segmentation?

A key issue with downsampling-then-upsampling approaches is that important details and ________ information may be lost during downsampling.

What type of connections helps to recover lost spatial details in semantic segmentation when adopting a downsampling-then-upsampling approach?

What is the primary benefit of using residual connections in segmentation networks?

In a U-Net architecture, features are downsampled in the decoder and upsampled in the encoder, with skip connections in between.

What is a characteristic feature of the U-Net architecture?

A U-Net is widely used for ________ tasks, especially in biomedical imaging.

What is one advantage of using a pre-trained encoder (e.g., ResNet, EfficientNet) in a U-Net architecture?

Regarding a pretrained U-Net, which part of the network is usually retrained for the specific task?

With semantic segmentation, we can differentiate each pixel in the image, and differentiate instances of the same class.

What is the primary difference between semantic segmentation and instance segmentation?

While semantic segmentation labels each pixel in the image, it does not differentiate ________, only cares about pixels.

What does instance segmentation achieve that semantic segmentation does not?

Which of the following is true for panoptic segmentation?

Panoptic segmentation focuses only on labeling distinct object instances (things) in an image and ignores labeling the background (stuff).

In contrast to semantic segmentation, panoptic segmentation labels all pixels in the image, both ________ and ________.

The output O(x, y) of a transposed convolution is computed as: O(x, y) = ∑(i,j) · K(x - i . s y _______)

What does dilated convolution do?

How does strided convolution related to learnable down sampling?

Match the description with the relevant layer:

What does pooling do?

Features are downsampled in both the encoder and the decoder in UNet architecture

What type of feature are downsampled and upsampled in between layers?

What is the most crucial part of the images during convulation?

When an original image resolution will be very expensive, what causes problem?

In the Nearest Neighbor method, is the input of 2x2 and output of 2x2?

The ratio of the stride is determined from ___________ and Output/Input

Which segmentation does not differenciate instances?

Match the following with its description

What is the core computer vision task?

Flashcards

Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

Semantic Segmentation with Sliding Window

Semantic Segmentation with Convolution

Fully Convolutional Network

Upsampling

Nearest Neighbor Upsampling

Bed of Nails Unpooling

Max Unpooling

Transposed Convolution

Stride

Residual Connections

Addition (Residual)

Concatenation (Residual)

U-Net

Pretrained U-Net

Study Notes

Learning Outcomes

Image Classification

Computer Vision Tasks

Semantic Segmentation

Semantic Segmentation Idea: Sliding Window

Semantic Segmentation Idea: Convolution

Semantic Segmentation Idea: Fully Convolutional

In-Network Upsampling: Unpooling

In-Network Upsampling: Max Unpooling

Learnable Upsampling: Transposed Convolution

Mathematical Definition: Transposed Convolution

Transposed Convolution Output Size Formula

Semantic Segmentation Idea: Fully Convolutional

Can We Do Better?

Residual Connections in Segmentation

U-Net

In contrast to semantic segmentation, panoptic segmentation labels all pixels in the image, both and .