Neural Networks - Lecture 6 Normalization Techniques PDF

Neural Networks - Lecture 6 Normalization Techniques Outline ● Batch Normalization ● Layer Normalization ● Group Normalization ● Instance Normalization – Batch-Instance Normalization – Adaptive Instance Normalization ● Normalization from the Conditioning Perspective ● Weight Normalization, Weight Standardization , Alexandru Sorici - 2/42 Normalization – General Overview Uses of normalization ● ● ● ● Normalization reduces extreme values in the normalized values (Partially) address the covariate-shift problem (the change in layer input distribution as training of a network progresses) Makes the loss surface smoother => better behaved gradients, making higher learning rates possible => lower convergence time => Normalization is required to train models more effectively (lower training time, better performance, higher stability) , Alexandru Sorici - 3/42 Normalization – General Overview Notations (for possible input volumes) ● μ – mean ● σ – standard deviation ● N – batch size (number of examples in the batch) ● C – number of channels ● H, W - height and width Img source: In-layer normalization techniques for training very deep neural networks, AI Summer, 2020 , Alexandru Sorici - 4/42 Batch Normalization - again :-) , Alexandru Sorici - 5/42 Batch Normalization ● ● γ and β are learnable parameters mean and variance of layer activations are determined by two parameters (instead of the complex interaction between layers during training) Img source: In-layer normalization techniques for training very deep neural networks, AI Summer, 2020 , Alexandru Sorici - 6/42 BatchNorm – Advantages and Downsides Advantages ● Accelerates training – ● ● ● Reduces dependence of gradients on scale of parameters and initial values => higher learning rates are possible Every mini-batch has slightly different statistics => acts as a form or regularization → may alleviate the need to use Dropout in some cases Makes gradients more predictive, makes the loss surface smoother [Santurkar et al., 2018] BatchNorm avoids “getting stuck” when using saturating nonlinearities (e.g. tanh, sigmoid) , Alexandru Sorici - 7/42 BatchNorm – Advantages and Downsides Disadvantages ● ● Small batch size leads to innacurate estimates => problem to use BN in tasks such as video prediction, segmentation, medical image processing where batch-size is usually low (due to GPU memory constraints) Problems when batch size is varying – Training vs inference (e.g. prediction for a single instance). Options: ● Keep training stats ● Compute mean and std. for on entire test set ● Note: γ and β stay the same, but μ and σ change! – Pre-training vs fine tuning – Backbone vs. Head network (where backbone is pre-trained) – Cannot be easily applied in RNNs , Alexandru Sorici - 8/42 BatchNorm – Use in RNNs ● Batch Normalized Recurrent Neural Networks [Laurent et al., 2015] – Apply BN only to the input-to-hidden transition ht = ϕ (W h ht −1 + BN (W x x t )) – In multi-layer RNNs this means BN is applied only in between RNN layers, not also along the hidden dimension (along the timesteps) – Normalization of x can be done frame wise (in cases where prediction occurs one step at a time) or sequence-wise (in applications where one predicts processes the whole sequence at a time – e.g. speech recognition) Frame wise: normalize each time step across batch , Alexandru Sorici - Sequence wise: normalize across timesteps and batch 9/42 BatchNorm – Use in RNNs ● Recurrent batch normalization [Cooijmans et al., 2016] – Apply BN to hidden-to-hidden transition as well – Requires separate statistics (μ and σ) for each timestep!!! (statistics of activation differ significantly in initial time step transitions) – At test time, use μ and σ estimates obtained over minibatch averages from training set – γ parameter needs to be initialized to a small value (e.g. 0.1) , Alexandru Sorici - 10/42 Layer Normalization , Alexandru Sorici - 11/42 Layer Normalization ● Layer Normalization computes the statistics (μ and σ) across channels and spatial dimensions ● ● Statistics are independent of the batch – makes it suitable for normalizing hidden state outputs of an RNN (the use case for which they were first introduced) – usage took off when Transformers became popular (attention models next week) Consider the case of a batch of N sequences of length K K 1 μ n = ∑ x nk K k=1 x^nk = , Alexandru Sorici - x nk −μ n 2 ( σ √ n + ϵ) , x^nk ∈R K 1 σ 2n= ∑ ( x nk −μ n )2 K k=1 LN γ ,β ( x n )= γ⋅x^n + β , x n ∈R K 12/42 Layer Normalization in RNNs ● ● ● In an RNN ht is based on the summed inputs computed from the current input xt and the previous hidden state ht-1 a t =W hh ht −1 +W hx x t Layer Normalization rescales and re-centers the activations For LSTM: – The normalization terms make the average magnitude of the summed input activation at invariant to rescaling of inputs , Alexandru Sorici - 13/42 Layer Normalization ● Generalization to 4D tensor case (i.e. application to input volumes for images) , Alexandru Sorici - 14/42 Instance Normalization , Alexandru Sorici - 15/42 Instance Normalization ● Instance Normalization computes the statistics (μ and σ) only across the spatial dimensions of each input. ● Statistics are independent of the batch and are different for each channel – makes it suitable for Style Transfer: each individual sample can be normalized to a target style (modeled by γ and β) , Alexandru Sorici - 16/42 Adaptive Instance Normalization ● ● What if the mean (β) and variance (γ) governing parameters come from an external source (the style image)? Adaptive Instance Norm takes an input image x (content) and a style image y and performs channel wise alignment of mean and variance of x to match the mean and variance of y Img source: Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization, Huang & Belongie, 2017 , Alexandru Sorici - Where: ● t is the AdaIN output ● Φi are layers from the VGG encoder: relu1_1, relu2_1, relu3_1, relu4_1 17/42 Batch-Instance Normalization ● ● Instance Normalization performs a form of style normalization → by controlling the feature statistics (mean and variance) one can generate different styles – Useful for style transfer and GANs – However, can be a problem in classification tasks where style information provided the discriminating factor (e.g. brightness of an image for weather classification) Batch Instance Normalization [Nam & Kim, 2019] → normalize the styles adaptively to the task and selectively to individual feature maps – Learn to control how much of the style information to propagate through each channel using a learnable gating parameter , Alexandru Sorici - 18/42 Batch-Instance Normalization ● Batch Instance Normalization → normalize the styles adaptively to the task and selectively to individual feature maps – ● ● Learn to control how much of the style information to propagate through each channel using a learnable gating parameter ρ ∈[0,1]C , γ , β ∈ℝ C are all learnable Yields ~1% improvement in CIFAR-10/CIFAR-100/ImageNet image classification and Domain Adaptation tasks compared to BN alone , Alexandru Sorici - 19/42 Group Normalization , Alexandru Sorici - 20/42 Group Normalization ● Group Normalization can be used when batch-sizes are small (e.g. for object detection tasks, segmentation tasks) – For groups = C → Instance Norm – For groups = 1 → Layer Norm – Stable across a greater range of batch-sizes , Alexandru Sorici - 21/42 Group Normalization Img source: Group Normalization, Wu and He, ECCV 2018 , Alexandru Sorici - 22/42 Conditioning from the normalization perspective , Alexandru Sorici - 23/42 A discussion on conditioning ● ● Often used in the context of generative networks (generate content informed by / conditioned on external attributes) Can also be viewed from the perspective of task descriptions (perform an inference/classification in a given context: e.g. analyze an image in the context of a question about the image) , Alexandru Sorici - 24/42 Basic methods of conditioning - Concatenation ● Q: Where to add the conditioning representation? (just input?, at the end?, in between?) – The operation is cheap enough → avoid making assumptions and add it to all layers of the network , Alexandru Sorici - 25/42 Basic methods of conditioning – Conditional Biasing ● Add bias to hidden layers of a network architecture based on the conditioning representation , Alexandru Sorici - 26/42 Basic methods of conditioning ● Concatenation and Conditional Biasing can be shown to have equivalent processing (as long as no non-linearities are involved) , Alexandru Sorici - 27/42 Conditioning – transition to normalization – conditional scaling ● scaling hidden layers based on the conditioning representation – Sigmoidal gating is a special case where the gates select what features to pass on as a function of the conditioning , Alexandru Sorici - 28/42 Conditioning – transition to normalization ● ● ● Multiplicative conditioning → useful when the interest is to learn relationships between inputs (match based) Addititive conditioning → useful when the interest is to perform feature aggregation or feature detection (whether a feature is present or not) How to have best of bost worlds and let “the network decide”? – Conditional Affine Transformation y=m∗x+b ● Important: all these transforms operate feature-wise , Alexandru Sorici - 29/42 Conditioning – transition to normalization – FiLM model ● FiLM = Feature-wise Linear Modulation (Perez et al., FiLM: Visual Reasoning with a General Conditioning Layer, AAAI 2018) FiLM ( x)= γ ( z )⊙x+ β ( z ) – x the input, z the conditioning representation , Alexandru Sorici - 30/42 Conditioning – transition to normalization – FiLM model , Alexandru Sorici - 31/42 Conditioning – FiLM model ● FilM model can be used in a variety of problem settings: – Visual QA – Style Transfer (alongside AdIN) – Image Recognition (e.g. Highway networks, Squeze and Excitation Blocks – forms of self-conditioning) – NLP (LSTM and GRU models are examples of sigmoidal gating) – Domain Adaptation / Few-shot learning – Speech Recognition – Generative Modeling – Reinforcement Learning , Alexandru Sorici - 32/42 Conditioning – FiLM model – Style Transfer Example ● The FiLM generator models each style as a separate set of instance normalization parameters , Alexandru Sorici - 33/42 Conditioning – FiLM model – Adaptive Instance Normalization ● AdIN → can be seen as instance of FiLM-ing where the main network is used both as the FiLM generator and the FiLMed network , Alexandru Sorici - 34/42 Conditioning – FiLM model - Spatially Adaptive Denormalization (SPADE) ● ● ● AdaIN enables a form of conditioning by “inserting” a target style in the content of another image – using first order statistics (γ and β) that come from the target style This normalizes input across spatial dimensions However, what happens when the conditioning is made using a segmentation map (which has many regions of identical pixel values => regions where normalization leads to losing the semantic information)? , Alexandru Sorici - 35/42 SPADE – Idea ● ● ● Similar to BatchNorm, normalize with channel wise mean and standard deviation However, rescaling is not performed using a single scalars (γ and β) Instead γ and β are 3D tensors computed using convolutions on segm. mask , Alexandru Sorici - 36/42 SPADE – Generator ● ● ● SPADE Learned modulation parameters encode information about semantic label layout Random input vector (from a multi-variate Gaussian distribution) is given as input to generator Generator network based on modified ResNet architecture with upsampling layers. After each upsampling – SPADE Res Block Source: GauGAN: semantic image synthesis with spatially adaptive normalization, Park et al., CVPR 2019 , Alexandru Sorici - 37/42 Weight Normalization and Weight Standardization , Alexandru Sorici - 38/42 Weight Standardization ● ● Weight Standardization considers smoothing effect on weights of a convolution kernel instead of the output activations Objective: normalize the gradients during back-propagation – ● Weights W are scaled Ŵ = WS(W) and the loss is optimized on W I = Cin x K, where K is the kernel sizeSource: Micro-Batch Training with Batch-Channel Normalization and Weight Standardization, [Qiao et al. 2019] , Alexandru Sorici - 39/42 Weight Standardization - Results ● Weight Standardization combined with Group Normalization improves results on ImageNet (classification) and COCO (object detection) for micro-batch training Source: Micro-Batch Training with Batch-Channel Normalization and Weight Standardization, [Qiao et al. 2019] , Alexandru Sorici - 40/42 Weight Normalization ● Simple weight normalization (more rarely used) scales the magnitude of weights to a hyperparameter g (gain) w= ● g v ‖v‖ Separation of magnitude of weight (g) from its direction (v) which is trainable Spectral Normalization ● Normalize weight matrix W by its spectral norm ^= W W |W|2 2 W ∈ℝ C x (C Ẇ Ḣ ) is a 2D representation of W ∈ℝ C x C x W x H Used in training of Generative Adversarial Networks (GAN) – see more ATAI course out ● |Wh|2 |Wh|2 |W|2=max = max = σ 1 (W ) |h|2 h ~|h| =1 |h|2 h , Alexandru Sorici - ¿ out ¿ 41/42 References ● ● ● ● ● ● ● ● [Laurent et al., 2015] Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016, March). Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2657-2661). IEEE. [Cooijmans et al., 2016] Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., & Courville, A. (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025. [Santurkar et al., 2018] Santurkar, S., Tsipras, D., Ilyas, A., & Mądry, A. (2018, December). How does batch normalization help optimization?. In Proceedings of the 32nd international conference on neural information processing systems (pp. 2488-2498). [Nam & Kim, 2019] Nam, H., & Kim, H. E. (2018). Batch-instance normalization for adaptively style-invariant neural networks. arXiv preprint arXiv:1805.07925. [Wu & He, 2018] Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (ECCV) (pp. 3-19). [Park et al., 2019] Park, T., Liu, M. Y., Wang, T. C., & Zhu, J. Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2337-2346). [Qiao et al., 2019] Qiao, S., Wang, H., Liu, C., Shen, W., & Yuille, A. (2019). Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520. Pictures for “Normalization from the Conditioning Perspective” https://distill.pub/2018/feature-wise-transformations/ , Alexandru Sorici - 42/42

Neural Networks - Lecture 6 Normalization Techniques PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue