SemanticSegmentation.pdf

Outline Semantic segmentation Fully convolutional networks Operations for dense prediction Transposed convolutions, unpooling Architectures for dense prediction DeconvNet, SegNet, U-Net Instance segmentation: Mask R-CNN The Task person grass trees motorbik road e Semantic Segmentation vs. Instance Segmentation CNNs for dense image labeling image classification object detection semantic segmentation instance segmentation Instance Segmentation Instance Segmentation is an extension of object detection, where a binary mask (i.e. object vs. background) is associated with every bounding box. This allows for more fine-grained information about the extent of the object within the box. Instance Segmentation = Object Detection + Semantic Segmentation Vision: Detection, segmentation K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask R-CNN, ICCV 2017 (Best Paper Award) 8 Semantic segmentation Medical Semantic segmentation Example of segmentation output compared to ground truth ("Manual segmentation"). Lesions in green, liver in red. Liver segmentation & extraction A real-time segmented road scene for autonomous driving. Fully convolutional networks Design a network with only convolutional layers, make predictions for all pixels at once Source: Stanford CS231n Fully convolutional networks Design a network with only convolutional layers, make predictions for all pixels at once Can the network operate at full image resolution? Source: Stanford CS231n Fully convolutional networks Design a network with only convolutional layers, make predictions for all pixels at once Can the network operate at full image resolution? Practical solution: first downsample, then upsample Source: Stanford CS231n Fully convolutional networks (FCN) Predictions by 1x1 conv layers, bilinear upsampling to original image resolution Predictions by 1x1 conv layers, learned 2x upsampling using transposed convolutions, fusion by summing J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Fully convolutional networks (FCN) Predictions by 1x1 conv layers, bilinear upsampling to original image resolution Predictions by 1x1 conv layers, learned 2x upsampling using transposed convolutions, fusion by summing J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Fully convolutional networks (FCN) Predictions by 1x1 conv layers, bilinear upsampling to original image resolution Predictions by 1x1 conv layers, learned 2x upsampling using transposed convolutions, fusion by summing J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Refining fully convolutional nets by fusing information from layers with different strides improves segmentation details. Fully convolutional networks (FCN) Comparison on a subset of PASCAL 2011 validation data: J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Outline Fully convolutional networks Operations for dense prediction Transposed convolutions, unpooling Transposed Convolution A transposed convolutional layer is an upsampling layer that generates the output feature map greater than the input feature map. It is similar to a deconvolutional layer. Transposed Convolution The operation of a transposed convolutional layer is similar to that of a normal convolutional layer, except that it performs the convolution operation in the opposite direction. Instead of sliding the kernel over the input and performing element-wise multiplication and summation, a transposed convolutional layer slides the input over the kernel and performs element-wise multiplication and summation. This results in an output that is larger than the input, and the size of the output can be controlled by the stride and padding parameters of the layer. Transposed Convolutional with stride 2 Transposed Convolutional Stride = 1 Transposed convolution with a 2 x 2 kernel Output shape The output shape can be calculated as: Oh = (Ih-1) x S + Kh - 2p Ow = (Iw-1) x S + Kw - 2p Upsampling by transposed convolution Backwards-strided convolution: to increase resolution, use output stride > 1 For stride 2, dilate the input by inserting rows and columns of zeros between adjacent entries, convolve with flipped filter Sometimes called convolution with fractional input stride 1/2 Q: What 3x3 filter would output correspond to bilinear upsampling? input V. Dumoulin and F. Visin, A guide to convolution arithmetic for deep learning, arXiv 2018 Upsampling by unpooling Alternative to transposed convolution: max unpooling Max Max pooling unpooling Remember pooling indices (which element was max) Output is sparse, so unpooling is typically followed by a transposed convolution layer Outline Fully convolutional networks Operations for dense prediction Transposed convolutions, unpooling Architectures for dense prediction DeconvNet, SegNet, U-Net DeconvNet H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015 DeconvNet DeconvNet Original image 14x14 deconv 28x28 unpooling 28x28 deconv 54x54 unpooling 54x54 deconv 112x112 unpooling 112x112 deconv 224x224 unpooling 224x224 deconv H. Noh, S. Hong, and B. Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015 DeconvNet results PASCAL VOC 2012 mIoU FCN-8 62.2 DeconvNet 69.6 Ensemble of DeconvNet and FCN 71.7 Similar architecture: SegNet Drop the FC layers, get better results V. Badrinarayanan, A. Kendall and R. Cipolla, SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, PAMI 2017 SegNet V. Badrinarayanan, A. Kendall and R. Cipolla, SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, PAMI 2017 U-Net Like FCN, fuse upsampled higher-level feature maps with higher-res, lower-level feature maps Unlike FCN, fuse by concatenation, predict at the end O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015 Last time: Dense prediction Fully convolutional networks Operations for dense prediction Transposed convolutions, unpooling Architectures for dense prediction DeconvNet, SegNet, U-Net Summary of dense prediction architectures Figure source Outline Fully convolutional networks Operations for dense prediction Transposed convolutions, unpooling Architectures for dense prediction DeconvNet, SegNet, U-Net Instance segmentation Mask R-CNN Instance segmentation Source: Kaiming He Mask R-CNN Mask R-CNN = Faster R-CNN + FCN on RoIs Faster R-CNN Classification+regression branch Instance segmentation: Mask branch: separately predict segmentation for each possible class K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask R-CNN, ICCV 2017 (Best Paper Award) Mask R-CNN model Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box RoIAlign vs. RoIPool RoIPool: nearest neighbor quantization K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask R-CNN, ICCV 2017 (Best Paper Award) Mask R-CNN From RoIAlign features, predict class label, bounding box, and segmentation mask Classification/regression head from an established object detector (e.g., FPN) Separately predict binary mask for each class with per-pixel sigmoids, use average binary cross- entropy loss K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask R-CNN, ICCV 2017 (Best Paper Award) Mask R-CNN K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask R-CNN, ICCV 2017 (Best Paper Award) Example results

SemanticSegmentation.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue