Convolutional Neural Networks PDF

Convolutional Neural Networks Alexandru Sorici October 18, 2023 Table of contents 1. Introduction 2. Building blocks of CNNs 3. Fully Convolutional Networks 4. Normalization 5. Classic Networks 1 Intro Conv Nets are everywhere (contd) Figure 1: SafeUAV: Learning to estimate depth and safe landing areas for UAVs from synthetic data. Alina Marcu, Dragos Costea, Vlad Licaret, Mihai Parvu, Marius Leordeanu 2 Conv Nets are everywhere (contd) Figure 2: Spatio-Temporal Features in Action Recognition Using 3D Skeletal Joints. M. Trascau, M. Nan, AM Florea 3 Conv Nets are everywhere(contd) Figure 3: End-to-end models for self-driving cars on UPB campus roads. A. Mihalea, R. Samoilescu, A. Nica, M. Trascau, A. Sorici, AM Florea 4 CNN Building Blocks Convolution Operation For the continuous case, a convolution product between two functions f and g is defined as: Z +∞ Z ∞ (f ∗ g )(t) = f (τ )g (t − τ )dτ = f (t − τ )g (τ )dτ −∞ −∞ For the discrete case, integral is replaced by a sum: (f ∗ g )(n) = ∞ X m=−∞ f (m)g (n − m) = ∞ X f (n − m)g (m) m=−∞ 5 Convolution Operation in spatial domain An RGB image can be considered as a function (f : R2 → R3 ), where f (x, y ) = (red value, green value, blue value)T Operators can be applied to the image, by treating is a function: 6 Convolution Operation in spatial domain Convolution Operation on images The convolution operation implies use of a convolution kernel (K - of dimension w × w , where w = 2k + 1), applied to source image (I ): O =I ∗K O(i, j) = w −1 w −1 X X I (i + u − k, j + v − k) · K (u, v ) u=0 v =0 7 Convolution Operation in spatial domain - contd Properties Conv. op. is associative and comutative. 8 CNN Layers We use 3 main types of layers when constructing CNN architectures: • Convolution Layer • Pooling Layer • Fully Connected Layer (like in MLP) Figure 4: Example of CNN Layer stacking. Source: Stanford CS231n Lecture Notes 9 Fully Connected Layer Consider an image of dim. 32 × 32 × 3. We wish to ascribe it to one of 10 classes. 1. Linearize image =⇒ dimension of 3072 × 1. 2. Pass resulting vector through a fully connected layer.   w1,1  ..  . w1,2 .. . ... .. . w10,1 w10,2 ...  x1 x2 .. .      w1,3072 b1 y1     .   .  ..     . ·  +  ..  =  ..    w10,3072 b10 y10 x3072 (1) H(X ) = (W .x) + b T Disadvantage: lose info about spatial arrangement of pixel (example of inductive bias) 10 Convolution Layer - I Alternative: connect each neuron to only a local region of the input volume Receptive Field = size of the filter - spatial extent of the local connectivity of each neuron 11 Convolution Layer - I 12 Convolution Layer - I 13 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: ? 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0×1 1×0 1×1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1  ×0 ×1 ×0     0 0 0 1 1 1 0   1 2 4 3 3   1 0 1   ×1 ×0 ×1     0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1×1 1×0 1×1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1      ×1 ×0 ×1  0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1×1 1×0 0×1 0 0        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1      ×1 ×0 ×1  0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1×1 0×0 0×1 0        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1      ×1 ×0 ×1  0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0×1 0×0 0×1        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1     ×1 ×0 ×1   0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1  ×1 ×0 ×1     0 0 0 1 1 1 0   1 2 4 3 3   1 0 1   ×0 ×1 ×0     0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1     ×1 ×0 ×1     0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1   ×1 ×0 ×1     0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1     ×0 ×1 ×0     0 0 1 1 0 0 0   1 3 3 1 1  1 0 1  ×1 ×0 ×1      3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1       0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1     ×1 ×0 ×1     0 0 1 1 0 0 0   1 3 3 1 1  1 0 1  ×0 ×1 ×0      3 3 1 1 0  0×1 1×0 1×1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1       0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1  ×1 ×0 ×1      3 3 1 1 0  0×0 1×1 1×0 0 0 0 0  1×1 1×0 0×1 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1 General case Starting from an input volume of dimension: Height × Width × Depth and applying a convolution layer with kernel (filter) of size K × K × Depth we get an activation map of size [Height − (K − 1)] × [Width − (K − 1)] × 1. 15 Convolution Layer - III - HyperParameters Four hyperparameters control the size of the output volume: • Depth: number of filters (each looks at smth different in the input). • Stride: the step taken when sliding the filter. Usual practice - stride = 1 or 2 (3 or more very uncommon) • Zero-Padding: size of the number of 0s that surround the border of the input volume. Most common use: preserve spatial size of input volume, i.e. input and output have the same width and height. • Dilation: distance in between elements of the convolution kernel. 16 Convolution Layer - III - HyperParameters Convolution arithmetic animation :-) Source: https://github.com/vdumoulin/conv arithmetic 17 Convolution Layer - III - HyperParameter arithmetic Convolution output size computation How do kernel size, stride, padding, dilation influence the size of the output? 18 Convolution Layer - III - HyperParameter arithmetic Convolution output size computation How do kernel size, stride, padding, dilation influence the size of the output? Formulae Hout Wout Hin + 2 ∗ padding − dilation ∗ (kernel size − 1) − 1 = +1 stride Win + 2 ∗ padding − dilation ∗ (kernel size − 1) − 1 = +1 stride Observation There is a constraint on strides: result of the division has to be an integer. E.g. For input of size Hin = 10, padding = 0, kernel size = 3, dilation = 1, use of stride h i = 2 is not possible because: Hin +2∗padding −dilation∗(kernel size−1)−1 +1 = stride (10 + 2 ∗ 0 − 1 ∗ (3 − 1) − 1)/2 + 1 = 7/2 + 1, which is not an integer 18 Convolution Layer - III - Parameter Sharing Parameter sharing is used in CNNs to control the number of learnable parameters. E.g. if output volume has 55 × 55 × 96 and filter size is 11 × 11 × 3, having a weight for each neuron would result in 290400 × 364 = 105705600 parameters just for one layer. =⇒ Make a simplifying assumption: if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). =⇒ Parameters within a depth slice are shared: 11 × 11 × 3 × 96 = 34944 (+ 96 biases). 19 Convolution Layer - III - Learned Features Figure 5: Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume. Source: Stanford CS321n notes. 20 Convolution Layer - IV - PyTorch classes • Class Header: class torch.nn.Conv1d(in channels, out channels, kernel size, stride=1, padding=0, dilation=1, groups=1, bias=True) • Input: (N, Cin , Lin ) • Output: h (N, Cout , Lout ) Lout = Lin +2∗padding −dilation∗(kernel stride size−1)−1 i +1 • Formula: out(Ni , Coutj ) = bias(Coutj ) + PCin −1 k=0 weight(Coutj , k) ? input(Ni , k) 21 Convolution Layer - IV - Pytorch Conv2D • Class Header: class torch.nn.Conv2d(in channels, out channels, kernel size, stride=1, padding=0, dilation=1, groups=1, bias=True) • Input: (N, Cin , Hin , Win ) • Output:h (N, Cout , Hout , Wout ) i size[0]−1)−1 Hout = Hin +2∗padding [0]−dilation[0]∗(kernel +1 stride[0] h i size[1]−1)−1 +1 Wout = Win +2∗padding [1]−dilation[1]∗(kernel stride[1] • Formula: out(Ni , Coutj ) = bias(Coutj ) + PCin −1 k=0 weight(Coutj , k) ? input(Ni , k) 22 Convolution Layer - IV - Pytorch Conv3D • Class Header: class torch.nn.Conv3d(in channels, out channels, kernel size, stride=1, padding=0, dilation=1, groups=1, bias=True) • Input: (N, Cin , Din , Hin , Win ) • Output:h (N, Cout , Dout , Hout , Wout ) i size[0]−1)−1 Dout = Din +2∗padding [0]−dilation[0]∗(kernel +1 stride[0] i h Hin +2∗padding [1]−dilation[1]∗(kernel size[1]−1)−1 Hout = +1 stride[1] i h Win +2∗padding [2]−dilation[2]∗(kernel size[2]−1)−1 Wout = +1 stride[2] • Formula: out(Ni , Coutj ) = bias(Coutj ) + PCin −1 k=0 weight(Coutj , k) ? input(Ni , k) 23 Convolution Layer - IV - Grouped Convolutions Figure 6: a) Normal convolution, b) Grouped Convolution with 2 groups. Source: Optimizing Grouped Convolutions on Edge Devices, Gibson et al. 2018 24 Convolution Layer - V - Spatially Separable Convolutions Figure 7: Spatially separable convolution example - img. source: towardsdatascience.com • Advantage: reduced computational load: fewer kernel parameters –¿ fewer multiplications • Disadvantage: loss of representational power (not all 2D kernels can be factorized into a sequence of 2 1-D kernel applications) • Some useful kernels can be factorized this way: e.g. Sobel kernel for edge detection 25 Convolution Layer - VI - Depthwise Separable Convolutions • Separate the convolution operation into two separate operations: (i) a depthwise convolution and (ii) a pointwise convolution Figure 8: Typical 2D convolution. img. src: opengenus.org • No. multiplications for typical 2D conv: 128 3x3x3 kernels moving 5x5 times =⇒ 128 x 3 x x 3 x 5 x 5 = 86400 26 Convolution Layer - VI - Depthwise Separable Convolutions (cont) Figure 9: Composition of depth- and point-wise convolutions. img. src: opengenus.org Computational advantage • depthwise: 3 3x3x1 kernels moving 5 x 5 times =⇒ 3 x 3 x 3 x 1 x 5 x 5 = 675 • pointwise: 128 1x1x3 filters moving 5 x 5 times =⇒ 128 x 1 x 1 x 3 x 5 x 5 = 9600 • In general, reduction is proportional to 1 , k2 where k is size of kernel • Depthwise separable convolution used in small, efficient conv nets (e.g. MobileNet suite) 27 Pooling Layer • Modifies the input volume into a smaller and more manageable representation • Operates independently over each activation map 28 Pooling Layer Common settings: • filter size = 2, stride = 2 • filter size = 3, stride = 2 29 Pooling Layer • Used usually at final stages of a conv net architecture to replace fully connected layers • Tries to avoid overfitting by forcing feature maps to hold ”global” information that is relevant for the subsequent task (e.g. classification, identification) 30 Fully Convolutional Networks Fully Convolutional Networks • Used most often in semantic segmentation or in generative networks • Relies on two important operations: unpooling and upsampling (through transpose convolution / deconvolution ) Figure 10: Semantic Segmentation • Intuitively: Conv network captures context information (what) about the input, DeConv network (re)captures the spatial information (where) 31 Fully Convolutional Networks - Unpooling indices 1.5 1.7 1.4 1.3 2.0 2.1 1.8 1.6 2.3 1.9 1.5 1.4 2.2 2.1 1.6 1.7 maxpooling (1,1) (2,1) (0,2) (3,3) 2.1 1.8 2.3 1.7 unpooling 0 0 0 0 0 2.1 1.8 0 2.3 0 0 0 0 0 0 1.7 activations Figure 11: Source: Towards Data Science - Review: DeconvNet — Unpooling Layer (Semantic Segmentation) 32 Fully Convolutional Networks - Transpose Convolution The convolution operation can be expressed using a matrix and a reshape of the input into a linearized vector. Figure 13: Original convolution kernel Figure 12: Input=4 × 4, stride=1, no In convolution we have a many-to-one relationship (in the example 9-to-1) between input and output padding Source: Medium - Up-sampling with Transposed Convolution 33 Fully Convolutional Networks - Transpose Convolution • The transpose convolution operation can be expressed using a transposed matrix. • We implement a one-to-many relationship (in our example 1-to-9) between input and output. • Parameters of upsampling kernel are learned. Figure 14: Input=2 × 2, stride=1, no padding 34 Normalization Normalization Figure 15: Visual comparison of different normalization methods. N - batch dimension, C - channel (feature map) dimension, H,W - input size dimension Uses of normalization • address covariate-shift problem - simplification of the learning dynamic =⇒ higher learning rates are possible =⇒ faster convergence time • makes the loss surface smoother (magnitude of gradients bounded better =⇒ ”better behaved” gradients) • some form of normalization becomes necessary in very deep networks 35 Batch Normalization γ and β are learnable parameters → mean and variance of layer activations are decided by two simple parameters (as opposed to complex interactions between layers during learning) 36 Batch Normalization - Discussion Batch Normalization caveats • Normalization statistics in BatchNorm correspond to mean and variance of the mini-batch (instead of the whole data set) • small mini-batches may produce noisy estimates ⇒ negatively affects training • not suitable for recurrent connections in an RNN (recurrent activations will (and also shoud) have different statistics ⇒ would have to fit a separate BatchNorm for each time step 37 Batch Normalization - Discussion Fine Tuning • Question: use the statistics of mean and variance from the original dataset or the new one? • Most frameworks do not freeze the BatchNorm params (γ, β) when freezing a layer ⇒ they use the mini-batch statistics from the new dataset • If new dataset is large and mini-batch size is comparable (equal) to original dataset → can retrain statistics • If new dataset is small or mini-batch size is small → may be better to use original dataset statistics Where to place BatchNorm layer? • Most often before non linearity • Worth-while sometimes to test after an activation function 38 Classic Networks Classic Networks - LeNet 5 Figure 16: LeNet-5 - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner (1998) • No. params: 60000 39 Classic Networks - AlexNet Figure 17: AlexNet - Alex Krizhevsky, Geoffrey Hinton, Ilya Sutskever (2012) • No. params: 60 million 40 Classic Networks - AlexNet Details / Retrospectives: • First CNN to win ImageNet challenge • First use of ReLU • heavy data augmentation (lots of params to train) • dropout of 0.5 • batch size of 128 • optimization using SGD + Momentum 0.9 • learning rate 1e-2, reduced by 10 manually when accuracy plateaus • L2 weight decay 5e-4 • in competition used 7 CNN ensemble: 18.2% → 15.2% 41 Classic Networks - VGG-16 Figure 18: VGG-16 - Karen Simonyan and Andrew Zisserman (2014) • No. params: 138 million 42 Classic Networks - VGG-16 • Uses only 3 × 3 CONV, stride 1, pad 1 and 2 × 2 MAX POOL stride 2 Figure 19: Block diagram of VGG-16 and VGG-19. Source: Stanford CS231n Lecture 9 - slide 42 43 Classic Networks - VGG-16 • Uses only 3 × 3 CONV, stride 1, pad 1 and 2 × 2 MAX POOL stride 2 • Stack of 3 × 3 conv (stride 1) layers has same effectve receptive field as one 7 × 7 conv layer ... but deeper, more non-linearities and fewer params: 3 × (32 C 2 ) vs 72 C 2 , where C is channels per layer Figure 19: Block diagram of VGG-16 and VGG-19. Source: Stanford CS231n Lecture 9 - slide 42 43 Classic Networks - VGG-16 • Uses only 3 × 3 CONV, stride 1, pad 1 and 2 × 2 MAX POOL stride 2 • Stack of 3 × 3 conv (stride 1) layers has same effectve receptive field as one 7 × 7 conv layer ... but deeper, more non-linearities and fewer params: 3 × (32 C 2 ) vs 72 C 2 , where C is channels per layer • Similar training procedure as AlexNet • FC7 features generalize well to other tasks Figure 19: Block diagram of VGG-16 and VGG-19. Source: Stanford CS231n Lecture 9 - slide 42 43 Classic Networks - GoogleNet - Inception Cell 44 Classic Networks - GoogleNet - Architecture Details: • Introduces bottleneck layers, using 1 × 1 convolutions to reduce computational complexity (reduce number of channels over which 3 × 3 and 5 × 5 conv. layers have to operate) • Philosophy of ”inception module”: design good local network topology and stack the modules on top of another • Apply parallel filter operations on the same input • Multiple receptive field sizes (1 × 1, 3 × 3, 5 × 5, • Pooling operation (3 × 3) • Concatenate filter outputs depth-wise 45 Classic Networks - GoogleNet - Architecture Figure 20: GoogleNet - Christian Szegedy et al (2014) • No. params: 5 million 46 Classic Networks - GoogleNet - Architecture Details • After the last convolutional layer, a global average pooling layer is used that spatially averages across each feature map, before final FC layer. • No longer multiple expensive FC layers • Uses auxilliary classification outputs to inject additional gradient at lower layers 47 Classic Networks - Residual Network (ResNet) - Skip connection Figure 21: Skip Connection y = x + F(x) ⇒ ∂E ∂y ∂E ∂E = · = · (1 + F 0 (x)) ∂x ∂y ∂x ∂y 48 Classic Networks Figure 22: ResNet - Kaiming He et al (2015) • No. params: 25 million (ResNet-50) • ImageNet winner 2015: 3.6% error rate (better than human performance - 5.1%) 49 Classic Networks - Residual Network (ResNet) - Architecture Details • Issue: deep networks using stacks of ”plain” conv layers suffer from optimization problem • Solution: Use layers to fit residual F (x) = H(x) − x, instead of H(x) directly (where H(x) would be the output of the ”plain” layers) • Stack residual blocks; periodically double no. filters and downsample spatially using stride 2 • No FC layers at the end • For deeper networks (ResNet-50+) use ”bottleneck layer” to improve computational efficiency • Training ResNet in practice: • BatchNorm after every CONV Layer, Xavier 2/ initialization • SGD + Momentum (0.9), Learning rate: 0.1 (divide by 10 when validation error plateaus) • Mini-batch 256, weight decay of 1e-5, no dropout 50 Classic Networks - Wide ResNet Highlights • Performs experiments to explore the problem of diminishing feature reuse (i.e. assumed weakness in original ResNet design, that can cause the network to forego using the residual blocks, because of skip connections) • Experiments with • • • • Adding Dropout regularization within the Residual Block Different number of conv layers within a residual block (l-factor) Multiplication of feature map size within residual block (k-factor) conv-bn-relu vs bn-relu-conv (post-activation vs pre-activation) 51 Classic Networks - Wide ResNet Figure 23: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 52 Classic Networks - Wide ResNet Figure 24: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 53 Classic Networks - Wide ResNet • B(3,3) - original ”basic” block • B(3,1,3) - with one extra 1×1 layer • B(1,3,1) - with the same dimensionality of all convolutions, straightened bottleneck • B(1,3) - the network has alternating 1×1 - 3×3 convolutions everywhere • B(3,1) - similar idea to the previous block • B(3,1,1) - Network-in-Network style block Figure 25: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 54 Classic Networks - Wide ResNet Figure 26: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 55 Classic Networks - Wide ResNet Figure 27: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 56 Classic Networks - Wide ResNet Takeaway: • widening consistently improves performance across residual networks of different depth • increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed • main power of residual networks is in residual blocks, and not in extreme depth 57 Classic Networks - ResNeXt Highlights • Focuses on three central aspects inspired by previous conv nets (VGG, ResNet and Inception): uniform blocks, residual connections and multi-branch convolutional architectures • Uses grouped convolutions to implement multi-branch aggregated transformations • Considers cardinality (number of branches) as the important control dimension in network performance • Strives to develop an easy-to-use template for building both wide and deep conv nets 58 Classic Networks - ResNeXt Figure 28: Equivalent building blocks of ResNeXt. (a): Aggregated residual transformations, the same as Fig. 1 right. (b): A block equivalent to (a), implemented as early concatenation. (c): A block equivalent to (a,b), implemented as grouped convolutions [24]. Notations in bold text highlight the reformulation changes. A layer is denoted as (# input channels, filter size, # output channels). Source: Aggregated Residual Transformations for Deep Neural Networks. Xie et al., 2017 59 Classic Networks - ResNeXt Figure 30: Increased cardinality is Figure 29: Increased cardinality leads to better accuracy better than increased depth or increased width of the network. Source: Aggregated Residual Transformations for Deep Neural Networks. Xie et al., 2017 60 Classic Networks - DenseNet Highlights • Simplify the connectivity pattern between layers as introduced in other networks (e.g. ResNet) • Focus on obtaining representational power through feature reuse, instead of very deep or very wide network architectures • It introduces direct connections between any two layers with the same feature-map size • Information (feature maps) from previous volumes is concatenated to subsequent volumes. Concatenation amount controlled by a growth factor. Figure 31: Source: Densely Connected Convolutional Networks, Huang et al., 2018 61 Classic Networks - DenseNet Figure 32: Source: Densely Connected Convolutional Networks, Huang et al., 2018 62 Classic Networks - DenseNet Figure 33: Full schematic representation of DenseNet-121. Source: Pablo Ruiz Blog, https://towardsdatascience.com/understanding- and- visualizing- densenets- 7f688092391a 63 Classic Networks - DenseNet Figure 34: Source: Densely Connected Convolutional Networks, Huang et al., 2018 64 Classic Networks - Performance vs Num. Operations Figure 35: Source: Benchmark analysis of representative deep neural network architectures, 2018, S. Bianco et al. 65 Classic Networks - Performance vs Speed (fps) Figure 36: Source: Benchmark analysis of representative deep neural network architectures, 2018, S. Bianco et al. 66 Transfer Learning Figure 37: Source: Stanford CS231n Lecture 9, 2019 67 Transfer Learning Figure 38: Source: Stanford CS231n Lecture 9, 2019 68 Transfer Learning • Transfer Learning with CNNs is the norm, not the exception • CNNs trained on ImageNet used in object detection (e.g. Fast R-CNN), image captioning (CNN + RNN) • Takeaway: if you have some dataset of interest, but it has < 1 million images • Find a large dataset with similar data, train big ConvNet there • Transfer learn to your dataset 69 70

Convolutional Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue