Summary

This presentation describes the fundamentals of convolutional neural networks (CNNs) in the context of multimedia applications. It covers the concept of convolutional neural networks and their application, using examples, to illustrate the various aspects of CNNs.

Full Transcript

MN909 - Deep Learning for Multimedia Convolutional Neural Networks Enzo Tartaglione [email protected] Till now…  We have seen all the working principles of any deep neural network  …but what kind of neural networks we have been working on till no...

MN909 - Deep Learning for Multimedia Convolutional Neural Networks Enzo Tartaglione [email protected] Till now…  We have seen all the working principles of any deep neural network  …but what kind of neural networks we have been working on till now??  The models where every neuron of the n-th layer has one parameter (non- shared!) per each of the neurons in the layer n-1, are called «fully-connected», «multi-layer perceptrons» etc…  NO prior on the distribution of the features at the input! 2 A known dataset…  C=10 classes  60k digits images  28x28 size (32x32)  Grayscale 8 bit (https://en.wikipedia.org/wiki/MNIST_database http://yann.lecun.com/exdb/mnist/) 3 A known dataset… Vector of x1 pixels 28 px 28 px 784 px (https://en.wikipedia.org/wiki/MNIST_database x784 http://yann.lecun.com/exdb/mnist/) 4 The LeNet-300 architecture  Number of inputs equal to number of pixels (e.g. 28x28 = 784)  All inputs connected to all hidden layer neurons  C=10 neurons in output layer x1 Vector of 768 px (normalized) x784 Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, November 1998 5 A look at the complexity… LeNet300 Layer Units Complexity [prms] 1 300 300 * (28*28) = 230k 2 100 100 * 300 = 30k 3 10 10 * 100 = 1k  Most of complexity in first FC layer (88%)  Layer comlpexity m * (1+n) ~ m * n  What about larger images (larger m)?  Increased computational and storage complexity  More training images required to avoid overfitting 6 We lose spatial correlation!  Pixels in natural images spatially correlated  FCNs treat images as 1D vectors  Unaware of image 2D structure 7 Feature learning  Images characterized by features such as edges, corners, etc. corner edge endpoint  We know how to locate keypoints and encode fatures 8 Feature learning  Images characterized by features such as edges, corners, etc. corner edge endpoint  We know how to locate keypoints and encode fatures  Requires ad-hoc feature detector design 9 Feature learning  Images characterized by features such as edges, corners, etc. corner edge endpoint  We know how to locate keypoints and encode fatures  Requires ad-hoc feature detector design  Let the network learn local feature detector(s)  Requires new spatial topology-aware neuron structure 10 The convolution operation 13 The convolution  Usually defined as f ∗ g (f and g continuous functions over x)  E.g.: sliding filter (or kernel) f applied to signal g (https://fr.wikipedia.org/wiki/Produit_de_convolution) 14 Not only for images! 15 Feature detection with 2D convolutions Filter or kernel sized n x n 16 Practical example  5x5 input image crop http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 17 Practical example  5x5 input image crop  3x3 filter http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 18 Practical example  5x5 input image crop  3x3 filter Feature (misleading) 4 1*1 + 1*0 + 1*1 +... +1*1 = 4 http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 19 Practical example  5x5 input image crop stride  3x3 filter 4 3 1*1 + 1*0 + 0*1 +... +1*1 = 3 http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 20 Practical example  5x5 input image crop  3x3 filter 4 4 3 1*1 + 0*0 + 1*0 +... +1*1 = 4 http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 21 Practical example  5x5 input image crop  3x3 filter 4 4 3 2 1*0 + 0*1 + 1*1 +... +1*1 = 2 http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 22 Practical example  5x5 input image crop  3x3 filter http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 23 Practical example  5x5 input image crop  3x3 filter 4 4 3 3 2 4 4 2 3 1*1 + 1*0 + 1*1 +... +0*1 = 4 http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 24 Practical example  5x5 input image crop  3x3 filter Feature map 4 4 3 3 2 4 4 2 3 http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 25 Practical example  5x5 input image crop  3x3 filter  3x3 feature map http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 26 Practical example http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 27 The convolutional neuron  2D convolution operation (w1,..., w9) (x1,..., x25) z = w1 x1 +... + w9 x13 4 z 28 The convolutional neuron  2D convolution operation (w1,..., w9) (x1,..., x25) z = w1 x1 +... + w9 x13 4 z  Neuron functional equivalent b feature w1 z Input + 4 layer w9 weights z = w1 x1 +... + w9 x13 29 The convolutional neuron  Standard neuron with spatially-constrained inputs  Detects same feature across all image positions  Learnable filter via backpropagation  no more feature detector handcrafing! 28 px 26 «px» b 28 px w1 26 «px» + w9 Input Feature image map 30 Effect of the convolution  Feature map smaller than input image 28 px 26 «px» 28 px 26 «px» 31 Zero-padding  Zero-padding all around image (F=3, S=1, P=1)  Input image size preserved 30 px 28 «px» 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 px 28 «px» 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P=1 32 Multiple input channels  Most images RGB, not grayscale = 33 Multiple input channels  A different filter is independently applied to every channel  Co-located features are summed + 34 Complexity in a neural network layer 35 Complexity in deep neural networks  Very blurry and foggy concept…  Does it refer to the number of parameters of a models? -> memory required 36 Complexity in deep neural networks  Very blurry and foggy concept…  Does it refer to the number of parameters of a models? -> memory required  Does it refer to the number of operations to be executed to obtain the output of the model (forward propagation)? -> Hardware capability (execution in parallel) to guarantee faster computation; number of operation to compute the forward propagation (FLOPs, MACs…)  It is a sort of compromise in reality! 37 Parameters of the conv layer  Number of parameters per FC layer was (e.g., n in = 282) nout * (1+ nin) Depends on size of input feature maps C C C Convolutional layer 38 Parameters of the conv layer  Number of parameters per FC layer was (e.g., n in = 282) m * (1+ n)  Number of parameters per convolutional layer (e.g., F=5) m * (1+F2) * #ch C Depends on C number of input C feature maps (channels) Convolutional layer 39 Parameters in LeNet-300  One of LeNet300 probems was 1st FC layer complexity Fully Connected Layer Type Complexity [prms] 1 FC-300 300 * (28*28) = 230k 2 FC-100 100 * 300 = 30k 3 FC-10 10 * 100 = 1k 40 MACs in LeNet-300  One of LeNet300 probems was 1st FC layer complexity Fully Connected Layer Type Complexity [MACs] 1 FC-300 300 * (28*28) = 235k 2 FC-100 100 * 300 = 30k 3 FC-10 10 * 100 = 1k 41 Moving to conv layers…  One convolutional layer extracts, e.g, 6 feature maps 28x28  Feature maps vectorized and concatenated C C C... C... Convolutional layer with 6 neurons 42 Moving to conv layers…  One convolutional layer extracts, e.g, 6 feature maps 28x28  Feature maps vectorized and concatenated  One or more FC layers classify feature maps  Network output is class probability distribution (C=10) C + y0 SoftMax + C + + C + y9... C Convolutional FC layer with layers 6 neurons 43 Moving to conv layers…  1 st layer #params drops 230k -> 60 params! Fully Connected Convolutional Layer Type Complexity [prms] Type Complexity [prms] 1 FC-300 300 * (28*28) = 230k Conv-6 6 * (5x5 +1) x1 = 156 2 FC-100 100 * 300 = 30k FC-100 3 FC-10 10 * 100 = 1k FC-10 10 * 100 = 1k C C C C 6 conv neurons (6 filters) 44 Moving to conv layers…  Total #params soars 260k -> 400k params! Fully Connected Convolutional Layer Type Complexity [prms] Type Complexity [prms] 1 FC-300 300 * (28*28) = 230k Conv-6 6 * (5x5 +1) * 1 = 156 2 FC-100 100 * 300 = 30k FC-100 100 * (6 * (28x28)) = 400k 3 FC-10 10 * 100 = 1k FC-10 10 * 100 = 1k ~260k ~400k C C C C 6 conv neurons 6 x (28x28) = 4700 features (6 filters) 45 Moving to conv layers…  Total #MACs soars 266k -> 400k params! Fully Connected Convolutional Layer Type Complexity [MACs] Type Complexity [MACs] 1 FC-300 300 * (28*28) = 235k Conv-6 6 * (5x5 +1) * (24*24*1) = 86k 2 FC-100 100 * 300 = 30k FC-100 100 * (6 * (28x28)) = 400k 3 FC-10 10 * 100 = 1k FC-10 10 * 100 = 1k ~266k ~486k C C C C 6 conv neurons 6 x (28x28) = 4700 features (6 filters) 46 Moving to conv layers…  Feature maps can be seen as filtered images  Feature maps can be subsampled as images are C C C... C Convolutional layer 47 The (max)-pooling layers  Pick maximum value for each, e.g, 2x2 non-overlapping area  Feature map spatial subsampling input Max output feature map Pooling feature map  Why just not averaging ?  Maxima suppression (ie. We identify very sharp features) 48 Conv+pool are a common pattern!  Each Conv layer is followed by MaxPool layer 28x28 28x28 14x14 image feature map feature map Max Pooling 5x5 filters 49 Conv+pool are a common pattern!  Each Conv layer is followed by MaxPool layer  Increases each fature receptive field  In case F=5, S=1,P=2: from 5x5 to 6x6 6x6 2x2 1x1 receptive field input field feature Max Pooling 5x5 filters 50 Conv with stride are another pattern!  Each Conv has stride S=2, no MaxPooling  No non-maxima suppression  Standard in recent architectures S=2 28x28 14x14 image feature map 5x5 filters 51 Complexity with pooling  Complexity from ~260k to ~160k params thanks to Maxpooling Fully Connected Convolutional Layer Type Complexity [prms] Type Complexity [prms] 1 FC-300 300 * (28*28) = 230k Conv-6 6 * (5x5 +1) * 1 = 156 2 FC-100 100 * 300 = 30k FC-100 100 * (6 * (14x14)) = 400k 117k 3 FC-10 10 * 100 = 1k FC-10 10 * 100 = 1k Tot ~260k ~118k C C Max Pool C C 6 conv neurons (6 filters) 52 Training more feature maps  Output of the first conv-and-pool layer 14 «px» 14 «px» Single 28 px 14x14 «Image» with C six «channels» Max Pool 28 px C...... C 6 feature 6 conv maps neurons (6 filters) 53 Training more feature maps 7x7 «Image» with 16 «channels»  Output of the first conv-and-pool layer  Input to second conv-and-pool layer 7 «px» 7 «px» 14 «px» 14 «px» 28 px C C C Max Pool Max Pool 28 px C C............ C C 6 feature 16 feature 6 conv maps 16 conv maps neurons neurons (6 5x5 filters) (16 5x5 filters) 54 More conv layers  Total complexity drops from ~260k to ~82k params Fully Connected Convolutional Layer Type Complexity [prms] Type Complexity [prms] 1 FC-300 300 * (28*28) = 230k Conv-6 6 * (5x5 +1) * 1 = 156 2 FC-100 100 * 300 = 30k Conv-10 16 * (5x5 +1) * 6 = 2496* 3 FC-100 100 * ((16 * 7x7) +1) = 78k 4 FC-10 10 * 100 = 1k FC-10 10 * 100 = 1k Tot ~260k ~82k C C C C Max Pool Max Pool C C C C 6 conv 16 conv neurons neurons (6 filters) (16x6 filters) 55 Performance evaluation  Experiments on MNIST 32x32 dataset Network Num. Layers Error [%] 1 FC output layer (10 U) 12.0 Fully connected 1 hidden FC (300 U), 1 output FC (10 U) 4.7 LeNet300 2 hidden FCs (300 + 100 U), 1 output FC (10 U) 3.05 Convol. 2 Conv (3 F), 1 output FC (LeNet1) 1.7 LeNet300 2 conv (6+16 F), 3 FC layer 0.95 Better performance for lower complexity 56 To wrap-up…  The convolutional LeNet300 performs better than its fully connnected counterpart despite:  it has fewer parameters due to the convolutional layers  the filters are not big enough (5x5) to capture an entire digit (at least 20x20 pixels in a 32x32 image)  Let us define at the receptive field  The receptive field of a feature is its back-projection thorugh the pooling and convolutional layers within the input image 57 Architectures and datasets 71 LeNet-5  Stacked sigmoid convolutional layers for feature extraction  Repeated convolve-and-pool pattern  Multiple FC layer for classification Convolutional layers FC layers (feature extraction) (classification) Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, November 1998 (PDF available online) 72 LeNet-5 robustness demo (~1998)!!! Shift invariance Noise robustness Multiple characters Overlaps Y. LeCun, LeNet5 demos,. http://yann.lecun.com/exdb/lenet/index.html 73 LeNet-5 – feat maps visualization http://scs.ryerson.ca/~aharley/vis/conv/flat.html 74 LeNet-1 (1993)!!! This is a demo of LeNet 1, the first convolutional network that could recognize handwritten digits with good speed and accuracy […] developed between 1988 and 1993 […] at Bell Labs in Holmdel, NJ. This "real time" demo shows ran on a DSP card sitting in a 486 PC with a video camera and frame grabber card. The DSP card had a […] 32-bit floating-point DSP and could reach an amazing 12.5 million multiply-accumulate operations per second. Shortly after […], we started working with a development group and a product group at NCR (then a subsidiary of AT&T). NCR soon deployed ATM machines that could read the numerical amounts on checks, initially in Europe and then in the US. At some point in the late 90's these machines were processing 10 to 20% of all the checks in the US. Y. LeCun “Convolutional Network Demo from 1993” ( https://www.youtube.com/watch?v=FwFduRA_L6Q ) 75 CIFAR-10  CIFAR10 dataset (2009)  50k train images, 10k test images, 10 classes, 32x32 https://www.cs.toronto.edu/~kriz/cifar.html 76 CIFAR-10: more difficult than MNIST Loss -50% Accuracy MNIST CIFAR-10 77 The ImageNet challenge (ILSVRC12)  1000 object classes (categories)  1.2M images in the training set  100k images in the test set  Images of various shape: typical scaling to 224x224  Images here are RGB 79 Before the Deep learning era  2010: SIFT descriptors + SVN (NEC)  2011: SIFT descriptors, Fisher Vectors, SVM (XRCE) 80 AlexNet (2012)  One of the first «deep» convolutional networks  5 convolutional layers, 3 fully connected layers  62.3 M parameters (conv layers 6% but take 95% of time) A. Krizhevsky, I. Sutskever, G. E. Hinton. "Imagenet classification with deep convolutional neural networks.“ In Advances in neural information processing systems, pp. 1097-1105. 2012. 81 AlexNet (2012) – training details  Trained over two GTX580 GPUs (2GB memory each)  Split convolutions to different GPUs  Distribute the fully connected layers to different GPUs  Trained on 2 x GTX 580 for 5~6 days (90 epochs) A. Krizhevsky, I. Sutskever, G. E. Hinton. "Imagenet classification with deep convolutional neural networks.“ In Advances in neural information processing systems, pp. 1097-1105. 2012. 82 AlexNet VS LeNet5  Deeper than LeNet5 (5 Conv w.r.t. 3)  ReLU activations in place of sigmoids  Dropout before FC layers (+ L2 regularization)  Batch size 128 images  Data augmentation  LR divide by 10 when valid error settles 83 AlexNet (2012) on ImageNet  2012 ILSVRC winner with top-5 error rate 16.4% (vs. 26.2%)  Problem: very large 11x11 filters in first conv layer 84 Going deeper: VGG architecture  Up to 19 convolutional layers, 3 fully connected layers  Key idea: 3x3 filters everywhere K. Simonyan, A. Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). 85 Some configurations for VGG K. Simonyan, A. Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). 86 VGG on ImageNet  2014 ILSVRC top-5 runner with error rate 7,3% 87 ResNet (2015)  2015 ILSVRC winner with top-5 error rate 6.7%  18, 34, 50, 101,151 layers  (Almost) pool-less (2px convolution stride) He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016. 91 The vanishing gradient problem (recap)  More evident on sigmoid-activated models  Intuitively: more layer we add to the model, more products we have for computing the gradient (remember the chain rule) 𝜕𝐿 𝜕𝐿 𝜕𝑋𝐿 𝜕 𝑋𝑙 = … ◦ If the values are in magnitude > 1, we have gradient 𝜕 𝑤 𝑙 , 𝑖 𝜕 𝑋 𝐿 𝜕 𝑋 𝐿 −1 𝜕 𝑤 𝑙 , 𝑖 explosion ◦ If these values are in magnitude < 1, we have gradient vanishing Layer L-1 Layer 1 Layer 2 Layer 3 Layer L Output Input … L 92 Skip connections  Relies on skip/shortcut connections  Gradient backprop easier Understanding and Implementing Architectures of ResNet https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624 93 Skip connections effectiveness He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016. 94 Deeper gets better performance! 95 This is all for convolutional neural neworks for classification (for the moment!) 109

Use Quizgecko on...
Browser
Browser