Neural Networks - Lecture 6 Normalization Techniques PDF

Document Details

VictoriousGlockenspiel

Uploaded by VictoriousGlockenspiel

2022

Alexandru Sorici

Tags

neural networks normalization machine learning deep learning

Summary

These lecture notes cover various normalization techniques in neural networks, focusing on how they improve model training. Techniques like Batch Normalization, Layer Normalization, and Instance Normalization are explored, along with their advantages and disadvantages.

Full Transcript

Neural Networks - Lecture 6 Normalization Techniques Outline ● Batch Normalization ● Layer Normalization ● Group Normalization ● Instance Normalization – Batch-Instance Normalization – Adaptive Instance Normalization ● Normalization from the Conditioning Perspective ● Weight Normali...

Neural Networks - Lecture 6 Normalization Techniques Outline ● Batch Normalization ● Layer Normalization ● Group Normalization ● Instance Normalization – Batch-Instance Normalization – Adaptive Instance Normalization ● Normalization from the Conditioning Perspective ● Weight Normalization, Weight Standardization , Alexandru Sorici - 2/42 Normalization – General Overview Uses of normalization ● ● ● ● Normalization reduces extreme values in the normalized values (Partially) address the covariate-shift problem (the change in layer input distribution as training of a network progresses) Makes the loss surface smoother => better behaved gradients, making higher learning rates possible => lower convergence time => Normalization is required to train models more effectively (lower training time, better performance, higher stability) , Alexandru Sorici - 3/42 Normalization – General Overview Notations (for possible input volumes) ● μ – mean ● σ – standard deviation ● N – batch size (number of examples in the batch) ● C – number of channels ● H, W - height and width Img source: In-layer normalization techniques for training very deep neural networks, AI Summer, 2020 , Alexandru Sorici - 4/42 Batch Normalization - again :-) , Alexandru Sorici - 5/42 Batch Normalization ● ● γ and β are learnable parameters mean and variance of layer activations are determined by two parameters (instead of the complex interaction between layers during training) Img source: In-layer normalization techniques for training very deep neural networks, AI Summer, 2020 , Alexandru Sorici - 6/42 BatchNorm – Advantages and Downsides Advantages ● Accelerates training – ● ● ● Reduces dependence of gradients on scale of parameters and initial values => higher learning rates are possible Every mini-batch has slightly different statistics => acts as a form or regularization → may alleviate the need to use Dropout in some cases Makes gradients more predictive, makes the loss surface smoother [Santurkar et al., 2018] BatchNorm avoids “getting stuck” when using saturating nonlinearities (e.g. tanh, sigmoid) , Alexandru Sorici - 7/42 BatchNorm – Advantages and Downsides Disadvantages ● ● Small batch size leads to innacurate estimates => problem to use BN in tasks such as video prediction, segmentation, medical image processing where batch-size is usually low (due to GPU memory constraints) Problems when batch size is varying – Training vs inference (e.g. prediction for a single instance). Options: ● Keep training stats ● Compute mean and std. for on entire test set ● Note: γ and β stay the same, but μ and σ change! – Pre-training vs fine tuning – Backbone vs. Head network (where backbone is pre-trained) – Cannot be easily applied in RNNs , Alexandru Sorici - 8/42 BatchNorm – Use in RNNs ● Batch Normalized Recurrent Neural Networks [Laurent et al., 2015] – Apply BN only to the input-to-hidden transition ht = ϕ (W h ht −1 + BN (W x x t )) – In multi-layer RNNs this means BN is applied only in between RNN layers, not also along the hidden dimension (along the timesteps) – Normalization of x can be done frame wise (in cases where prediction occurs one step at a time) or sequence-wise (in applications where one predicts processes the whole sequence at a time – e.g. speech recognition) Frame wise: normalize each time step across batch , Alexandru Sorici - Sequence wise: normalize across timesteps and batch 9/42 BatchNorm – Use in RNNs ● Recurrent batch normalization [Cooijmans et al., 2016] – Apply BN to hidden-to-hidden transition as well – Requires separate statistics (μ and σ) for each timestep!!! (statistics of activation differ significantly in initial time step transitions) – At test time, use μ and σ estimates obtained over minibatch averages from training set – γ parameter needs to be initialized to a small value (e.g. 0.1) , Alexandru Sorici - 10/42 Layer Normalization , Alexandru Sorici - 11/42 Layer Normalization ● Layer Normalization computes the statistics (μ and σ) across channels and spatial dimensions ● ● Statistics are independent of the batch – makes it suitable for normalizing hidden state outputs of an RNN (the use case for which they were first introduced) – usage took off when Transformers became popular (attention models next week) Consider the case of a batch of N sequences of length K K 1 μ n = ∑ x nk K k=1 x^nk = , Alexandru Sorici - x nk −μ n 2 ( σ √ n + ϵ) , x^nk ∈R K 1 σ 2n= ∑ ( x nk −μ n )2 K k=1 LN γ ,β ( x n )= γ⋅x^n + β , x n ∈R K 12/42 Layer Normalization in RNNs ● ● ● In an RNN ht is based on the summed inputs computed from the current input xt and the previous hidden state ht-1 a t =W hh ht −1 +W hx x t Layer Normalization rescales and re-centers the activations For LSTM: – The normalization terms make the average magnitude of the summed input activation at invariant to rescaling of inputs , Alexandru Sorici - 13/42 Layer Normalization ● Generalization to 4D tensor case (i.e. application to input volumes for images) , Alexandru Sorici - 14/42 Instance Normalization , Alexandru Sorici - 15/42 Instance Normalization ● Instance Normalization computes the statistics (μ and σ) only across the spatial dimensions of each input. ● Statistics are independent of the batch and are different for each channel – makes it suitable for Style Transfer: each individual sample can be normalized to a target style (modeled by γ and β) , Alexandru Sorici - 16/42 Adaptive Instance Normalization ● ● What if the mean (β) and variance (γ) governing parameters come from an external source (the style image)? Adaptive Instance Norm takes an input image x (content) and a style image y and performs channel wise alignment of mean and variance of x to match the mean and variance of y Img source: Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization, Huang & Belongie, 2017 , Alexandru Sorici - Where: ● t is the AdaIN output ● Φi are layers from the VGG encoder: relu1_1, relu2_1, relu3_1, relu4_1 17/42 Batch-Instance Normalization ● ● Instance Normalization performs a form of style normalization → by controlling the feature statistics (mean and variance) one can generate different styles – Useful for style transfer and GANs – However, can be a problem in classification tasks where style information provided the discriminating factor (e.g. brightness of an image for weather classification) Batch Instance Normalization [Nam & Kim, 2019] → normalize the styles adaptively to the task and selectively to individual feature maps – Learn to control how much of the style information to propagate through each channel using a learnable gating parameter , Alexandru Sorici - 18/42 Batch-Instance Normalization ● Batch Instance Normalization → normalize the styles adaptively to the task and selectively to individual feature maps – ● ● Learn to control how much of the style information to propagate through each channel using a learnable gating parameter ρ ∈[0,1]C , γ , β ∈ℝ C are all learnable Yields ~1% improvement in CIFAR-10/CIFAR-100/ImageNet image classification and Domain Adaptation tasks compared to BN alone , Alexandru Sorici - 19/42 Group Normalization , Alexandru Sorici - 20/42 Group Normalization ● Group Normalization can be used when batch-sizes are small (e.g. for object detection tasks, segmentation tasks) – For groups = C → Instance Norm – For groups = 1 → Layer Norm – Stable across a greater range of batch-sizes , Alexandru Sorici - 21/42 Group Normalization Img source: Group Normalization, Wu and He, ECCV 2018 , Alexandru Sorici - 22/42 Conditioning from the normalization perspective , Alexandru Sorici - 23/42 A discussion on conditioning ● ● Often used in the context of generative networks (generate content informed by / conditioned on external attributes) Can also be viewed from the perspective of task descriptions (perform an inference/classification in a given context: e.g. analyze an image in the context of a question about the image) , Alexandru Sorici - 24/42 Basic methods of conditioning - Concatenation ● Q: Where to add the conditioning representation? (just input?, at the end?, in between?) – The operation is cheap enough → avoid making assumptions and add it to all layers of the network , Alexandru Sorici - 25/42 Basic methods of conditioning – Conditional Biasing ● Add bias to hidden layers of a network architecture based on the conditioning representation , Alexandru Sorici - 26/42 Basic methods of conditioning ● Concatenation and Conditional Biasing can be shown to have equivalent processing (as long as no non-linearities are involved) , Alexandru Sorici - 27/42 Conditioning – transition to normalization – conditional scaling ● scaling hidden layers based on the conditioning representation – Sigmoidal gating is a special case where the gates select what features to pass on as a function of the conditioning , Alexandru Sorici - 28/42 Conditioning – transition to normalization ● ● ● Multiplicative conditioning → useful when the interest is to learn relationships between inputs (match based) Addititive conditioning → useful when the interest is to perform feature aggregation or feature detection (whether a feature is present or not) How to have best of bost worlds and let “the network decide”? – Conditional Affine Transformation y=m∗x+b ● Important: all these transforms operate feature-wise , Alexandru Sorici - 29/42 Conditioning – transition to normalization – FiLM model ● FiLM = Feature-wise Linear Modulation (Perez et al., FiLM: Visual Reasoning with a General Conditioning Layer, AAAI 2018) FiLM ( x)= γ ( z )⊙x+ β ( z ) – x the input, z the conditioning representation , Alexandru Sorici - 30/42 Conditioning – transition to normalization – FiLM model , Alexandru Sorici - 31/42 Conditioning – FiLM model ● FilM model can be used in a variety of problem settings: – Visual QA – Style Transfer (alongside AdIN) – Image Recognition (e.g. Highway networks, Squeze and Excitation Blocks – forms of self-conditioning) – NLP (LSTM and GRU models are examples of sigmoidal gating) – Domain Adaptation / Few-shot learning – Speech Recognition – Generative Modeling – Reinforcement Learning , Alexandru Sorici - 32/42 Conditioning – FiLM model – Style Transfer Example ● The FiLM generator models each style as a separate set of instance normalization parameters , Alexandru Sorici - 33/42 Conditioning – FiLM model – Adaptive Instance Normalization ● AdIN → can be seen as instance of FiLM-ing where the main network is used both as the FiLM generator and the FiLMed network , Alexandru Sorici - 34/42 Conditioning – FiLM model - Spatially Adaptive Denormalization (SPADE) ● ● ● AdaIN enables a form of conditioning by “inserting” a target style in the content of another image – using first order statistics (γ and β) that come from the target style This normalizes input across spatial dimensions However, what happens when the conditioning is made using a segmentation map (which has many regions of identical pixel values => regions where normalization leads to losing the semantic information)? , Alexandru Sorici - 35/42 SPADE – Idea ● ● ● Similar to BatchNorm, normalize with channel wise mean and standard deviation However, rescaling is not performed using a single scalars (γ and β) Instead γ and β are 3D tensors computed using convolutions on segm. mask , Alexandru Sorici - 36/42 SPADE – Generator ● ● ● SPADE Learned modulation parameters encode information about semantic label layout Random input vector (from a multi-variate Gaussian distribution) is given as input to generator Generator network based on modified ResNet architecture with upsampling layers. After each upsampling – SPADE Res Block Source: GauGAN: semantic image synthesis with spatially adaptive normalization, Park et al., CVPR 2019 , Alexandru Sorici - 37/42 Weight Normalization and Weight Standardization , Alexandru Sorici - 38/42 Weight Standardization ● ● Weight Standardization considers smoothing effect on weights of a convolution kernel instead of the output activations Objective: normalize the gradients during back-propagation – ● Weights W are scaled Ŵ = WS(W) and the loss is optimized on W I = Cin x K, where K is the kernel sizeSource: Micro-Batch Training with Batch-Channel Normalization and Weight Standardization, [Qiao et al. 2019] , Alexandru Sorici - 39/42 Weight Standardization - Results ● Weight Standardization combined with Group Normalization improves results on ImageNet (classification) and COCO (object detection) for micro-batch training Source: Micro-Batch Training with Batch-Channel Normalization and Weight Standardization, [Qiao et al. 2019] , Alexandru Sorici - 40/42 Weight Normalization ● Simple weight normalization (more rarely used) scales the magnitude of weights to a hyperparameter g (gain) w= ● g v ‖v‖ Separation of magnitude of weight (g) from its direction (v) which is trainable Spectral Normalization ● Normalize weight matrix W by its spectral norm ^= W W |W|2 2 W ∈ℝ C x (C Ẇ Ḣ ) is a 2D representation of W ∈ℝ C x C x W x H Used in training of Generative Adversarial Networks (GAN) – see more ATAI course out ● |Wh|2 |Wh|2 |W|2=max = max = σ 1 (W ) |h|2 h ~|h| =1 |h|2 h , Alexandru Sorici - ¿ out ¿ 41/42 References ● ● ● ● ● ● ● ● [Laurent et al., 2015] Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016, March). Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2657-2661). IEEE. [Cooijmans et al., 2016] Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., & Courville, A. (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025. [Santurkar et al., 2018] Santurkar, S., Tsipras, D., Ilyas, A., & Mądry, A. (2018, December). How does batch normalization help optimization?. In Proceedings of the 32nd international conference on neural information processing systems (pp. 2488-2498). [Nam & Kim, 2019] Nam, H., & Kim, H. E. (2018). Batch-instance normalization for adaptively style-invariant neural networks. arXiv preprint arXiv:1805.07925. [Wu & He, 2018] Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (ECCV) (pp. 3-19). [Park et al., 2019] Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2337-2346). [Qiao et al., 2019] Qiao, S., Wang, H., Liu, C., Shen, W., & Yuille, A. (2019). Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520. Pictures for “Normalization from the Conditioning Perspective” https://distill.pub/2018/feature-wise-transformations/ , Alexandru Sorici - 42/42

Use Quizgecko on...
Browser
Browser