COMP9517 Deep Learning Part 1 - 2024 Term 2 Week 7 PDF

COMP9517 Computer Vision 2024 Term 2 Week 7 Dr Dong Gong Deep Learning Overview of Convolutional Neural Networks Challenges in CV Consider object detection as an example: § Variations in viewpoint § Differences in illumination § Hidden parts of images § Background clutter Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 2 Linear Classifier for Image Classification Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 3 Linear Classifiers Image classification with linear classifier Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 4 Linear Classifiers Hard cases for a linear classifier. Extracting better features (manually) may help but cannot (always) solve the problems. Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 5 From Linear Classifiers to (Non-linear) Neural Networks Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 6 Neural Networks Starting Neural from the originalthe networks: linear classifierlinear original classifier (Before) Linear score function: Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 7 Neural Networks Neural networks: 2 layers 2 layers (Before) Linear score function: (Now) 2-layer Neural Network (In practice we will usually add a learnable bias at each layer as well) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 25 April 08, 2021 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 8 Neural Networks 2 layers Also Neural networks: called as also called fully connected networkfully connected network Fully connected (FC) layer (Before) Linear score function: (Now) 2-layer Neural Network “Neural Network” is a very broad term; these are more accurately called “fully-connected networks” or sometimes “multi-layer perceptrons” (MLP) (In practice we will usually add a learnable bias at each layer as well) Copyright Fei-Fei (C) UNSW Li, Ranjay Krishna, Danfei XuCOMP9517 24T2W7Lecture 4 - 26 Deep Learning I April 08, 2021 9 Neural Networks 2 layers Also called as fully connected network Neural networks: Fully connected hierarchical computation (FC) layer (Before) Linear score function: (Now) 2-layer Neural Network x W1 h W2 s 3072 100 10 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 10 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 28 April 08, 2021 Neural Networks Neural networks: 3 layers 3 layers (Before) Linear score function: (Now) 2-layer Neural Network or 3-layer Neural Network (In practice we will usually add a learnable bias at each layer as well) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 27 April 08, 2021 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 11 Neural Networks Activation function The function max(0, z) is called the activation function. Neural networks: why is max operator important? (Before) Linear score function: (Now) 2-layer Neural Network The function is called the activation function. What if Q: without Whatthe activation if we function? try to build a neural network without one? Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 12 Neural Networks Activation function The function max(0, z) is called the activation function. Neural networks: why is max operator important? (Before) Linear score function: (Now) 2-layer Neural Network The function is called the activation function. What ifQ: Whatthe without if we try to build activation a neural network without one? function? – The model will be linear. Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 13 Neural Networks Activation functions ReLU is a good default Activation § Non-linear functions functions choice for most problems Sigmoid Leaky ReLU tanh Maxout ReLU ELU Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 33 April 08, 2021 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 14 Neural Networks Neural (for Architectures networks: MLP) Architectures “3-layer Neural Net”, or “2-layer Neural Net”, or “2-hidden-layer Neural Net” “1-hidden-layer Neural Net” “Fully-connected” layers Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 15 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 34 April 08, 2021 Neural Networks Convolutional Neural Networks(CNNs) Architectures (for CNNs) Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 16 From Neural Networks to “Deep Learning” Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 17 Deep Learning Deep learning is a collection of artificial neural network techniques that are widely used at present Predominantly, deep learning techniques rely on large amounts of data and deeper learning architectures Some well-known paradigms for different types of data and applications: § Convolutional Neural Networks (CNNs) (Week 7) § Recurrent Neural Networks (Week 8) § GAN (Week 8) § Transformer (Week 8) Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 18 Traditional Approach vs DL Convolutional neural networks (CNNs) are a type of DNNs for processing images. CNNs can be interpreted as gradually transforming the images into a representation in which the classes are separable by a linear classifier. CNNs will try to learn low-level features such as edges and lines in early layers, then parts of objects and then high-level representation of an object in subsequent layers. http://www.analyticsvidhya.com/blog/2017/04/comparison-between-deep-learning-machine-learning/ Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 19 Traditional Approach vs DL https://towardsdatascience.com/convolutional-neural-networks-for-all-part-i-cdd282ee7947 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 20 From Neural Networks to “Deep Learning” Core ideas go back many decades Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 21 From Neural Networks to “Deep Learning” Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 22 From Neural Networks to “Deep Learning” 23 From Neural Networks to “Deep Learning” Visual features extracted in different layers in CNN 24 From Neural Networks to “Deep Learning” Simonyan and Zisserman, 2014 25 Vision Transformer (ViT) From Neural Networks to “Deep Learning” Transformer Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). CLIP (Contrastive Language–Image Pre-training) https://openai.com/research/clip 26 From Neural Networks to “Deep Learning” DL is everywhere 27 From Neural Networks to “Deep Learning” DL is everywhere 28 From Neural Networks to “Deep Learning” DL is everywhere 29 From Neural Networks to “Deep Learning” DL is everywhere 3D vision understanding https://arxiv.org/pdf/2001.01349.pdf http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_RGBD_Based_Dimensional_ Decomposition_Residual_Network_for_3D_Semantic_Scene_CVPR_2019_paper.pdf Neural Radiance Fields (NeRF) for 3D vision Deep learning for depth estimation 30 https://www.matthewtancik.com/nerf https://ruili3.github.io/dymultidepth/index.html From Neural Networks to “Deep Learning” DL is everywhere https://github.com/donggong 1/learn-optimizer-rgdn https://donggong1.github.io/bl ur2mflow.html 31 From Neural Networks to “Deep Learning” DL is everywhere Vision question answering (VQA) Image Captioning. Vinyals et al, 2015 Karpathy and Fei-Fei, 2015 32 From Neural Networks to “Deep Learning” “A raccoon astronaut with the cosmos reflecting on the glass of his helmet dreaming of the stars” Ramesh et al, “DALL·E: Creating Images from Text”, 2021. https://openai.com/blog/dall-e/ Generated by DALL·E 2 33 3 major international CV conferences: CVPR, ICCV, ECCV; and others Top machine learning conferences with CV research: NeurIPS, ICML, ICLR … https://csrankings.org/#/index?vision&mlmining&australasia Convolutional Neural Network (CNN), from MLP to CNN Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 35 What are CNNs? Essentially neural networks that use convolution in place of general matrix multiplication in at least one of their layers. Convolutional layer acts as a feature extractor that extracts feature of the inputs as edges, corners or endpoints. Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 36 CNN- What do they learn? https://www.slideshare.net/NirthikaRajendran/cnn-126271677 [From recent Yann LeCun slides] Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 37 An overview of a CNN CNN-Components Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 38 CNNs CNNs are made up of neurons with learnable weights, as other to regular Neural Networks CNN architecture assumes that inputs are images Using specific assumptions for images So that we have local features Which allows us to encode certain properties in the architecture that makes the forward pass more efficient and significantly reduces the number of parameters needed for the network Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 39 Convolutional Neural Networks (CNNs) Neural networks: Architectures Recap: fully connected (FC) layer A linear model, not CNNs. Fully Connected A component of CNNs Layer “2-layer Neural Net”, or “3-layer Neural Net”, or FC layer “2-hidden-layer Neural Net” “1-hidden-layer Neural Net” “Fully-connected” layers 32x32x3 image -> stretch to 3072 x 1 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 4 - 34 April 08, 2021 input activation 1 1 10 x 3072 3072 10 weights 1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 40 Convolution operator parameters Filter size Padding Stride Dilation Activation function Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 41 Filter size Filter size can be 5 by 5, 3 by 3, and so on Larger filter sizes should be avoided in many cases (not always!) – As learning algorithm needs to learn filter values (weights) Odd sized filters are used more often than even sized filters (not always!) – Nice geometric property of all input pixels being around output pixel Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 42 Padding After applying 3 by 3 filter to 4 by 4 image, we get a 2 by 2 image – Size of the image has gone down If we want to keep image size the same, we can use padding – We pad input in every direction with 0’s before applying filter – If padding is 1 by 1, then we add 1 zero in evey direction – If padding is 2 by 2, then we add 2 zeros in every direction, and so on Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 43 3 by 3 filter with padding of 1 https://training.galaxyproject.org/training-material/topics/statistics/tutorials/CNN/slides-plain.html Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 44 Stride How many pixels we move filter to the right/down is stride Stride 1: move filter one pixel to the right/down Stride 2: move filter two pixels to the right/down Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 45 3 by 3 filter with stride of 2 https://training.galaxyproject.org/training-material/topics/statistics/tutorials/CNN/slides-plain.html Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 46 Dilation When we apply 3 by 3 filter, output affected by pixels in 3 by 3 subset of image Dilation: To have a larger receptive field (portion of image affecting filter’s output) If dilation set to 2, instead of contiguous 3 by 3 subset of image, every other pixel of a 5 by 5 subset of image affects output Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 47 3 by 3 filter with dilation of 2 https://training.galaxyproject.org/training-material/topics/statistics/tutorials/CNN/slides-plain.html Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 48 Activation function After filter applied to whole image, apply activation function to output to introduce non-linearity Preferred activation function in CNN is ReLU ReLU leaves outputs with positive values as is, replaces negative values with 0 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 49 Single channel 2D convolution https://training.galaxyproject.org/training-material/topics/statistics/tutorials/CNN/slides-plain.html Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 50 Convolution CNN: Layer (2D) Convolutional Layer Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 51 Convolution Layer CNN: Convolutional Layer r Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 52 CNN: Convolutional Layer Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 53 Convolution Layer CNN: Convolutional Layer … Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 54 CNN: Convolutional Convolution Layer Layer … Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 55 CNN: Convolutional Layer The output of the Conv layer can be interpreted as holding neurons arranged in a 3D volume. The Conv layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. During the forward pass, each filter is slid (convolved) across the width and height of the input volume, producing a 2-dimensional activation map of that filter. Network will learn filters (via backpropagation) that activate (through the activation function) when they see some specific type of feature at some spatial position in the input. Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 56 CNN: Convolutional Layer Convolution Layer Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 57 CNN: Convolutional Layer Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 58 CNN: Convolutional Layer Stacking these activation maps for all filters along the depth dimension forms the full output volume Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with neurons in the same activation map (since these numbers all result from applying the same filter) With 6 filters, we get 6 activation maps Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 59 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 60 CNN: Convolutional Layer Local Connectivity As we have realized by now, it is impractical to use fully connected networks when dealing with high dimensional images/data Hence the concept of local connectivity: each neuron only connects to a local region of the input volume. The spatial extent of this connectivity is a concept called receptive field of the neuron. The extent of the connectivity along the depth axis is always equal to the depth of the input volume. The connections are local in space (along width and height), but always full along the entire depth of the input volume. Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 61 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 62 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 63 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 64 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 65 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 66 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 67 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 68 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 69 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 70 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 71 A closer look at spatial dimensions: A closer look at spatial dimensions Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 72 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 73 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 74 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 75 Other padding operations: replication padding, reflection padding … Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 76 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 77 Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 78 Pooling Layer Pooling layer From Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture slides Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 79 Max Max Pooling Pooling From Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture slides Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 80 Pooling Layer: Pooling layer: summary summary Let’s assume input is W1 x H1 x C Conv layer needs 2 hyperparameters: - The spatial extent F - The stride S This will produce an output of W2 x H2 x C where: - W2 = (W1 - F )/S + 1 - H2 = (H1 - F)/S + 1 Number of parameters: 0 Fei-Fei Copyright (C)Li, UNSWRanjay Krishna, DanfeiCOMP9517 Lecture Xu 24T2W7 Deep Learning I 5 - 119 April 13, 2021 81 Fully Connected Layer (FC layer) Fully Connected Layer (FC layer) Contains neurons that connect to the entire input volume, as in ordinary Neural Networks From Fei-Fei Li & Andrej Karpathy & Justin Johnson lecture slides Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 82 Fully Fully Connected Connected Layer Layer (FC layer) Copyright (C) UNSW COMP9517 24T2W7 Deep Learning I 83 Summary of CNNs ConvNets stack CONV,POOL,FC layers Trend towards smaller filters and deeper architectures Trend towards getting rid of POOL/FC layers (just CONV) Historically architectures looked like [(CONV-RELU)*N-POOL?]*M-(FC- RELU)*K,SOFTMAX, where N is usually up to ~5, M is large, 0

COMP9517 Deep Learning Part 1 - 2024 Term 2 Week 7 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue