COMP9517_24T2W8_Deep_Learning_Part_2-2.pdf
Document Details
Uploaded by FastGrowingJackalope
University of New South Wales
2024
Tags
Full Transcript
COMP9517 Computer Vision 2024 Term 2 Week 8 Dr Dong Gong Deep Learning II Semantic Segmentation, Instance Segmentation and Video Understanding using CNNs Copyright (C) UNSW COMP9517 24T2W8...
COMP9517 Computer Vision 2024 Term 2 Week 8 Dr Dong Gong Deep Learning II Semantic Segmentation, Instance Segmentation and Video Understanding using CNNs Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 2 Outline Ø Computer Vision tasks Ø Semantic Segmentation Ø Sliding Window Ø Fully Convolutional Networks (FCNs) Ø U-Net Ø U-Net variants Ø Instance Segmentation Ø Mask R-CNN Ø Video understanding Ø Challenges in processing videos Ø Video datasets Ø C3D: Learning spatiotemporal features with 3D CNN Ø Two-stream network for video classification Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 3 Vision tasks Ø Image classification: Assigning a label or class to an image Ø Object detection: Locate the presence of objects with a bounding box and class of the located objects in an image Ø Semantic segmentation: Label every pixel (pixel-wise classification) Ø Instance segmentation: Differentiate instances Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 4 Semantic Segmentation Ø Classify each pixel in an image Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 5 How to train semantic segmentation network? Ø For each image, annotated mask or ground-truth mask is given Training Test Input CT Image Annotated Mask Segmentation Model Label every pixel Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 6 Sliding Window approach Ø Classify individual pixels Classify? Classify? Classify? Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 7 Sliding Window approach Ø Classifying individual pixel is not a good idea Ø No context! Ø How can we include neighbourhood context to classify individual pixel? Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 8 Sliding Window approach Ø Idea: Extract “patches” from entire image, classify centre pixel based on the neighbouring context CNN Sky CNN Cow CNN Grass Farabet et al. “Learning Hierarchical Features for Scene Labeling”, TPAMI 2013 Pinheiro and Collobert. “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014 Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 9 Sliding Window approach Ø Limitations: Very inefficient! & Not reusing shared features between overlapping patches CNN Sky CNN Cow CNN Grass Farabet et al. “Learning Hierarchical Features for Scene Labeling”, TPAMI 2013 Pinheiro and Collobert. “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014 Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 10 Semantic Segmentation using Convolution Ø Idea: Encode the entire image with “Conv Net”, and do semantic segmentation Ø Problem: Semantic segmentation requires the output size to be the same as input size (see below). Segmentation using CNN Image Credit: Creative Commons Licenses Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 11 Semantic Segmentation using Convolution Ø However, CNN classification architectures reduces spatial size of features as they go deeper (due to downsampling) Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 12 Fully Convolutional Networks for Semantic Segmentation Ø Design a network with only convolutional layers without downsampling operators to make prediction map of same size as of input image Conv Conv Conv argmax Conv Input: Predictions: 3xHxW HxW Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 13 Fully Convolutional Networks for Semantic Segmentation Ø Design a network with only convolutional layers without downsampling operators to make prediction map of same size as of input image Conv Conv Conv argmax Conv Input: Predictions: 3xHxW HxW Problem: Convolutions at original image resolution will be very expensive Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 14 Fully Convolutional Networks for Semantic Segmentation Ø Design a network having convolutional layers, with downsampling and upsampling inside the network (learning in an end-to-end manner) Downsampling Upsampling Low Resolution Medium Resolution Medium Resolution Input: 3xHxW Predictions: High Resolution High Resolution CxHxW HxW Convolutions at original image resolution will be very expensive Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 15 Fully Convolutional Networks for Semantic Segmentation Ø Design a network with only convolutional layers without downsampling operators to make prediction map of same size as of input image Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 16 In-network Upsampling: Unpooling Ø Abstract feature maps are upsampled to make their spatial dimensions equal to the input image Max-Pooling Nearest Neighbour 1 2 6 3 1 1 2 1 3 5 2 1 5 6 1 2 1 1 2 2 1 2 2 1 7 8 3 4 3 3 4 4 7 3 4 8 Output: 2 x 2 Input: 2 x 2 3 3 4 4 Input: 4 x 4 Output: 4 x 4 Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 17 In-network Upsampling: Unpooling Ø Abstract feature maps are upsampled to make their spatial dimensions equal to the input image Max-Pooling Unpooling 1 2 6 3 1 0 2 0 3 5 2 1 5 6 1 2 0 0 0 0 1 2 2 1 7 8 3 4 3 0 4 0 7 3 4 8 Output: 2 x 2 Input: 2 x 2 0 0 0 0 Input: 4 x 4 Output: 4 x 4 Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 18 In-network Upsampling: Max Unpooling Ø Abstract feature maps are upsampled to make their spatial dimensions equal to the input image Max-Pooling Max-Unpooling Remember the position of the max element Use position from pooling layer 1 2 6 3 0 0 2 0 3 5 2 1 5 6 1 2 0 1 0 0 1 2 2 1 7 8 Rest of the network 3 4 0 0 0 0 7 3 4 8 3 0 0 4 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 19 Learning Upsampling: Transpose Convolution Recall: Typical 3 x 3 convolution; stride 1, padding 1 Dot product between filter/kernel and input Input: 4 x 4 Output: 4 x 4 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 20 Learning Upsampling: Transpose Convolution Recall: Typical 3 x 3 convolution; stride 1, padding 1 Dot product between filter/kernel and input Input: 4 x 4 Output: 4 x 4 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 21 Learning Upsampling: Transpose Convolution Recall: Typical 3 x 3 convolution; stride 1, padding 1 Dot product between filter/kernel and input Input: 4 x 4 Output: 4 x 4 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 22 Learning Upsampling: Transpose Convolution Recall: Stride Convolution Recall: Typical 3 x 3 convolution; stride 2, padding 1 Dot product between filter/kernel and input Input: 4 x 4 Output: 2 x 2 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 23 Learning Upsampling: Transpose Convolution Recall: Stride Convolution Recall: Typical 3 x 3 convolution; stride 2, padding 1 Filter/Kernel moves 2 pixels in the input for every one pixel in the output. Dot product Stride gives ratio between between filter/kernel and movement in the Input: 4 x 4 input Output: 2 x 2 input and the output. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 24 Learning Upsampling: Transpose Convolution 3 x 3 Transpose Convolution; stride 2, padding 1 Filter/Kernel moves 2 pixels in the output for every one pixel in the input. Input gives weight for the filter/kernel Stride gives ratio Input: 2 x 2 between movement in the Output: 4 x 4 output and the input. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 25 Learning Upsampling: Transpose Convolution Sum where output 3 x 3 Transpose Convolution; stride 2, padding 1 overlaps Filter/Kernel moves 2 pixels in the output for every one pixel in the input. Input gives weight for the filter/kernel Stride gives ratio Input: 2 x 2 between movement in the Output: 4 x 4 output and the input. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 26 Fully Convolutional Networks for Semantic Segmentation Ø Design a network having convolutional layers, with downsampling and upsampling inside the network (learning in an end-to-end manner) Downsampling: Downsampling Upsampling Upsampling: Pooling, Strided Unpooling, Strided Convolution transpose Convolution Low Resolution Medium Resolution Medium Resolution Input: 3xHxW Predictions: High Resolution High Resolution CxHxW HxW Ø Instead of suddenly blowing up our network, gradually upsample! Ø Learning the upsampling with convolutions! Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 27 U-Net Ø Combines all the previous improvements but also add skip-connections. Ø Skip connections allow outputs from previous layers to feed in directly as input to later layers. Ø U-Net learns segmentation in an end-to-end manner. Ronneberger et al. (2015). U-net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 28 U-Net variants Ø Attention U-Net Oktay et al., (2018). “Attention U-Net: Learning where to look for the Pancreas”, MIDL 2018 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 29 U-Net variants Ø ResUNet Diakogiannis et al., (2019). “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data”, ISPRS Journal of Photogrammetery and Remote Sensing Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 30 U-Net variants Ø TransUNet Chen et al., (2021). “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation”, ArXiv Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 31 Instance Segmentation Ø Differentiate instances Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 32 Mask R-CNN Ø It is an extension of the Faster R-CNN framework for solving instance segmentation problem. Ø Detect and delineate each object in an image in a fine-grained pixel level. He et al., “Mask R-CNN”. ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 33 Mask R-CNN Ø It is an extension of the Faster R-CNN framework for solving instance segmentation problem. Ø Detect and delineate each object in an image in a fine-grained pixel level. Ø Mask R-CNN outputs a binary mask for each RoI on top of the Faster R-CNN He et al., “Mask R-CNN”. ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 34 Need for RoI Align One pixel in RoI means many pixels in the original image https://www.youtube.com/watch?v=Ul25zSysk2A&list=PLkRkKTC6HZMxZrxnHUDYSLiPZxiUUFD2C&index=2 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 35 RoIAlign Ø To extract the pixel-pixel mask, the RoI to be well aligned to preserve the explicit per-pixel spatial correspondence Ø RoIPool: Quantize a floating number RoI to the discrete granularity of the feature map Ø RoIAlign: bilinear interpolation to compute the exact values of the input features. Ø Multi-task loss on each sampled RoI: L = Lcls + Lbox + Lmask He et al., “Mask R-CNN”. ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 36 Mask R-CNN Architecture Ø Architecture has two parts: Ø Backbone architecture: Used for feature extraction Ø Network Head: Comprises of object detection and segmentation Ø Backbone architecture Ø ResNet Ø ResNeXt: Depth 50 and 101 layers Ø Feature Pyramid Network (FPN) Ø Network Head: Use almost the same architecture as Faster R-CNN but add convolution mask prediction branch. He et al., “Mask R-CNN”. ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 37 Mask R-CNN Results Ø Results on MS COCO test set; based on ResNet-101. He et al., “Mask R-CNN”. ICCV 2017. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 38 Video Understanding Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 39 Video Ø A sequence of images Ø 4D tensor: Ø T x 3 x H x W ; or Ø 3xTxHxW Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 40 Challenges in processing videos Ø Capturing the information across frames Ø Huge computational cost! Ø Videos have approximately 30 frames per second (fps) Size of uncompressed video (3 bytes per pixel) SD video (640 x 480): ~ 1.5 GB per minute HD video (1920 x 1080): ~ 10 GB per minute Video: T x 3 x H x W Source: Weizmann Dataset. Gorelick et al. “Actions as Space-Time Series”. TPAMI 2007 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 41 Challenges in processing videos Ø Huge computational cost! Ø Videos have approximately 30 frames per second (fps) Size of uncompressed video (3 bytes per pixel) SD video (640 x 480): ~ 1.5 GB per minute HD video (1920 x 1080): ~ 10 GB per minute Solution: Train on short clips Video: T x 3 x H x W (low fps and low spatial resolution) Source: Weizmann Dataset. Gorelick et al. “Actions as Space-Time Series”. TPAMI 2007 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 42 Video Classification Ø Simple approach: Apply 2D CNNs to classify frames Walk Walk Walk Walk Walk Walk Walk Walk CNN CNN CNN CNN CNN CNN CNN CNN Very strong baseline for video classification! Source: Weizmann Dataset. Gorelick et al. “Actions as Space-Time Series”. TPAMI 2007 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 43 Video Classification Ø Simple approach: Apply 2D CNNs to classify frames Walk Walk Walk Walk Walk Walk Walk Walk CNN CNN CNN CNN CNN CNN CNN CNN Fusion techniques: Early Fusion Mid-level Fusion Late Fusion Source: Weizmann Dataset. Gorelick et al. “Actions as Space-Time Series”. TPAMI 2007 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 44 3D CNN Ø How can be process entire clip? Ø Idea: Use 3D versions of convolution and pooling to slowly fuse temporal information over the course of the network. Source: https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-9d8f76e29610 Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 45 2D vs 3D Convolution Source: Liu et al. A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 46 Classifying 3D data Source: Vu et al. “3D CNN for feature extraction and classification of fMRI volumes”. PRNI 2018. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 47 3D CNN for video classification Source: Vu et al. “3D CNN for feature extraction and classification of fMRI volumes”. PRNI 2018. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 48 3D CNN for 3D Scene Understanding Li, Jie, Yu Liu, Dong Gong, Qinfeng Shi, Xia Yuan, Chunxia Zhao, and Ian Reid. "Rgbd based dimensional decomposition residual network for 3d semantic scene completion." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7693-7702. 2019. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 49 Video Datasets Sports-1M Dataset This sports action recognition dataset contains 1 million videos from 487 classes of sports, such as basketball, soccer, and ice hockey. Source: https://cs.stanford.edu/people/karpathy/deepvideo/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 50 Video Datasets UCF101- Action Recognition One of the most widely used video classification datasets is the UCF101 dataset, which consists of 13320 videos from 101 different action classes, such as walking, jogging, and playing soccer. The dataset is commonly used for evaluating the performance of video classification algorithms in a wide range of action recognition tasks. Source: https://www.crcv.ucf.edu/data/UCF101.php Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 51 Video Datasets Kinetics A collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400/600/700 video clips. Each clip is human annotated with a single action class and lasts around 10 seconds. Source: https://www.deepmind.com/open-source/kinetics Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 52 Video Datasets HMDB This dataset contains 6849 videos from 51 different action classes. This dataset is similar to UCF101, but it has a smaller number of classes and videos. Source: https://www.deepmind.com/open-source/kinetics Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 53 Video Datasets Source: https://research.google.com/youtube8m/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 54 Video Datasets The Action Similarity Labelling (ASLAN) challenge The ASLAN dataset consists of 3, 631 videos from 432 action classes. The task is to predict if a given pair of videos belong to the same or different action. Source: https://talhassner.github.io/home/projects/ASLAN/ASLAN-main.html Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 55 C3D: Learning spatiotemporal features with 3D CNN Layer Size Input 3 x 16 x 112 x 112 Conv1 (3 x 3 x 3) 64 x 16 x 112 x 112 Pool 1 (1 x 2 x 2) 64 x 16 x 56 x 56 Conv2 (3 x 3 x 3) 128 x 16 x 56 x 56 Pool 2 (2 x 2 x 2) 128 x 8 x 28 x 28 Conv3a (3 x 3 x 3) 256 x 8 x 28 x 28 Conv3b (3 x 3 x 3) 256 x 8 x 28 x 28 Pool 3 (2 x 2 x 2) 256 x 4 x 14 x 14 Conv4a (3 x 3 x 3) 512 x 4 x 14 x 14 Conv4b (3 x 3 x 3) 512 x 4 x 14 x 14 Pool 4 (2 x 2 x 2) 512 x 2 x 7 x 7 Conv5a (3 x 3 x 3) 512 x 2 x 7 x 7 Conv5b (3 x 3 x 3) 512 x 2 x 7 x 7 Visualization of C3D model: Pool 5 512 x 1 x 3 x 3 The C3D model captures appearance for the first few frames but FC6 4096 thereafter only attends to salient motion. FC7 4096 FC8 C Source: Tran et al. “Learning Spatiotemporal Features with 3D Convolutional Networks”. ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 56 C3D Results Action Similarity Labeling results on ASLAN Action Recognition results on UCF-101dataset ROC curve of C3D evaluated on ASLAN Source: Tran et al. “Learning Spatiotemporal Features with 3D Convolutional Networks”. ICCV 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 57 Recognizing Actions from Motion Ø Actions can be recognized using only motion information Ø Optical Flow: Ø It is the pattern of apparent motion of image objects between two consecutive frames caused by the movement of object or camera. It is 2D vector field where each vector is a displacement vector showing the movement of points from first frame to second. Source: OpenCV https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 58 Optical Flow Each arrow points in the direction of predicted flow of the corresponding pixel Ø Useful in many applications: Ø Structure from Motion Ø Video Compression Ø Video Stabilization Sparse vs Dense Optical Flow Source: Introduction to Motion Estimation with Optical Flow https://nanonets.com/blog/optical-flow/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 59 Optical Flow Ø Useful in Action Recognition Source: Introduction to Motion Estimation with Optical Flow https://nanonets.com/blog/optical-flow/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 60 Optical Flow ConvNets Ø Input to ConvNet is formed by stacking optical flow displacement fields between several consecutive frames. Ø This explicitly describes the motion between video frames, making recognition easier. (a), (b) : a pair of (c). a close-up of (d). Horizontal (d). Vertical consecutive video frames dense optical component of component of flow in the the displacement the displacement outlined area vector field vector field Source: Simonyan and Zisserman. “Two-Stream Convolutional Networks for Action Recognition in Videos”, NeurIPS 2014. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 61 Two-Stream Network for video classification Ø Videos can naturally be decomposed into spatial and temporal components. Ø The spatial component carries information about scenes and objects depicted in the video. Ø The temporal component conveys the movement of the observer (the camera) and the objects. Source: Simonyan and Zisserman. “Two-Stream Convolutional Networks for Action Recognition in Videos”, NeurIPS 2014. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 62 Two-Stream Network Results Mean accuracy on UCF101 and HMDB-51 Source: Simonyan and Zisserman. “Two-Stream Convolutional Networks for Action Recognition in Videos”, NeurIPS 2014. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 63 How to model Long-Term Temporal Dependency? Ø Often salient information in videos are many frames apart. Ø Problem: How can be model long-term temporal structure in videos? Ø Recall: Ø Convolutional Neural Networks (CNNs) can capture local structure/local context Ø Recurrent Neural Networks (RNNs) can capture global structure/global context Ø We can use a combination of CNNs + RNNs for modelling long-term temporal structure in videos. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 64 Long-term Recurrent Convolutional Network (LRCN) Source: Donahue et al.. “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”, CVPR 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 65 Long-term Recurrent Convolutional Network (LRCN) Source: Donahue et al.. “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”, CVPR 2015. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 66 Recurrent Neural Networks (RNNs) Sequential modeling RNN, GRU, LSTM, … Action recognition or video classification – can also be handled by 3D CNNs. Image captioning 67 Is Space-Time Attention All You Need for Video Understanding? Ø TimeSformer: A convolution-free approach to video classification built exclusively on self-attention over space and time. Ø It applies standard Transformer architecture to video by enabling spatio-temporal feature learning directly form sequence of frame-level patches. Video self- attention blocks investigated in TimeSformer Source: Bertasius et al.. “Is Space-Time Attention All You Need for Video Understanding?”, ICML 2021. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 68 TimeSformer Results Video-level accuracy on Kinetics-400 Visualization of space-time attention from the output taken to the input space. Source: Bertasius et al.. “Is Space-Time Attention All You Need for Video Understanding?”, ICML 2021. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 69 ViViT: A Video Vision Transformer Source: Arnab et al.. “ViViT: A Video Vision Transformer”, ICCV 2021. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 70 Implementation. Implementing U-Net from scratch in PyTorch https://nn.labml.ai/unet/index.html https://towardsdatascience.com/cook-your-first-u-net-in-pytorch-b3297a844cf3. Semantic segmentation using U-Net in PyTorch https://wandb.ai/ishandutta/semantic_segmentation_unet/reports/Semantic-Segmentation-with-UNets-in-PyTorch-- VmlldzoyMzA3MTk1 https://github.com/PacktPublishing/Modern-Computer-Vision-with- PyTorch/blob/master/Chapter09/Semantic_Segmentation_with_U_Net.ipynb. U-Net model in PyTorch https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/. Northern Pike segmentation using U-Net https://www.datainwater.com/post/pike_segmentation/. PyImageSearch’s tutorial on U-Net implementation on TGS Salt Segmentation Challenge https://pyimagesearch.com/2021/11/08/u-net-training-image-segmentation-models-in-pytorch/ Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 71 Implementation. Semantic segmentation using TensorFlow Model Garden https://www.tensorflow.org/tfmodels/vision/semantic_segmentation. Mask R-CNN in PyTorch https://pytorch.org/vision/main/models/mask_rcnn.html. Mask R-CNN for pedestrian instance segmentation https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html. Instance segmentation using Mask R-CNN https://haochen23.github.io/2020/05/instance-segmentation-mask-rcnn.html. Fine-tuning Mask R-CNN on custom data using Detectron2 https://geekyrakshit.dev/geekyrakshit- blog/computervision/deeplearning/segmentation/objectdetction/neuralnetwork/instancesegmentation/convolution/d etectron/maskrcnn/python/pytorch/2020/04/13/detectron-mask-rcnn.html Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 72 Further reading on discussed topics Ø Chapter 7 of Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville. https://www.deeplearningbook.org/ Ø Chapter 4: Object Detection and Image Segmentation from Practical Machine Learning for Computer Vision by Valliappa Lakshmanan, Martin Gorner, Ryan Gillard. https://www.oreilly.com/library/view/practical-machine-learning/9781098102357/ch04.html Ø Chapter 7 of Deep Learning for Vision Systems by Mohamed Elgendy. Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 73 Acknowledgements Ø Some material drawn from referenced and associated online sources Ø Image sources credited where possible Ø Some slides adapted from cs231n Lecture 9 “Object Detection and Image Segmentation” Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 74 References. Farabet et al. “Learning Hierarchical Features for Scene Labeling”, TPAMI 2013 https://ieeexplore.ieee.org/document/6338939. Long et al. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015. https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf. Ronneberger et al. (2015). U-net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28. Oktay et al., (2018). “Attention U-Net: Learning where to look for the Pancreas”, MIDL 2018. https://openreview.net/forum?id=Skft7cijM. Diakogiannis et al., (2019). “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data”, ISPRS Journal of Photogrammetery and Remote Sensing. https://www.sciencedirect.com/science/article/abs/pii/S0924271620300149. Chen et al., (2021). “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation”, ArXiv. https://arxiv.org/abs/2102.04306. He et al., “Mask R-CNN”. ICCV 2017. https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 75 References. Vu et al. “3D CNN for feature extraction and classification of fMRI volumes”. PRNI 2018. https://ieeexplore.ieee.org/document/8423964. Tran et al. “Learning Spatiotemporal Features with 3D Convolutional Networks”. ICCV 2015 https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Tran_Learning_Spatiotemporal_Features_ICCV_2015_paper.pdf. Simonyan and Zisserman. “Two-Stream Convolutional Networks for Action Recognition in Videos”, NeurIPS 2014. https://papers.nips.cc/paper_files/paper/2014/hash/00ec53c4682d36f5c4359f4ae7bd7ba1-Abstract.html. Donahue et al.. “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”, CVPR 2015. https://openaccess.thecvf.com/content_cvpr_2015/papers/Donahue_Long-Term_Recurrent_Convolutional_2015_CVPR_paper.pdf. Bertasius et al.. “Is Space-Time Attention All You Need for Video Understanding?”, ICML 2021. http://proceedings.mlr.press/v139/bertasius21a/bertasius21a.pdf. Arnab et al.. “ViViT: A Video Vision Transformer”, ICCV 2021. https://openaccess.thecvf.com/content/ICCV2021/papers/Arnab_ViViT_A_Video_Vision_Transformer_ICCV_2021_paper.pdf Copyright (C) UNSW COMP9517 24T2W8 Deep Learning II 76 Example exam question What kind of neural network is most suited for image segmentation tasks? A. Multilayer perceptron (MLP) B. Fully convolutional network (FCN) C. Region proposal network (RPN) D. Recurrent neural network (RNN) Copyright (C) UNSW COMP9517 24T2W8 Deep Learning Part 2-2 77