Neural Networks Lecture 5 (contd) RNN Applications 2022 PDF

Neural Networks - Lecture 5 (contd) RNN based models and training techniques Outline ● Sequence-to-Sequence Models ● RNNs in time series modeling ● – Bidirectional LSTM – CONV LSTM – Issues in timeseries modeling Regularization and Normalization in RNNs 2/25 Sequence-to-Sequence Models 3/25 RNN Tasks e.g. Translation Sequence of words → sequence of words Source: CS231n Lecture 10 4/25 Sequence-to-sequence models ● Original paper: “Sequence to Sequence Learning with Neural Networks”, Sutskever et al., 2014 ● Encoder – Decoder model ● Applied to: – Machine Translation – Text Summarization – Speech Recognition – Multi-Step Time series Prediction 5/25 Sequence-to-sequence models ● seq2seq models are encoder - decoder models – Commonly use RNNs to encode the source (input) into a single vector → intuitively the “context” of the whole input sequence – Use RNNs again in the decoder to decode the “context” vector into a sequence of output tokens 6/25 Sequence-to-sequence models ● Encoder Model 7/25 Sequence-to-sequence models ● Decoder Model 8/25 Sequence-to-sequence models ● Forward pass in seq2seq model: – Pass input, previous hidden and previous cell states (yt , st-1 , ct-1 ) to decoder – Receive prediction, next state and next cell state (ŷt+1, st, ct ) – place prediction, ŷt+1 in tensor of predictions (outputs) – Decide if we use teacher forcing ● ● – If true: next input is ground truth next token yt+1 (trg[t]) If false: next input is predicted token in sequence ŷt+1 (top1 → do argmax over tensor) Break the loop when decoder predicts the <e> (END) token 9/25 Sequence-to-sequence models ● ● Advantages – Can work with variable-length input and output sequence pairs – Can be used in auto-encoding setup (input and output are the same) → use the “context vector” (z) in subsequent time series analysis tasks (e.g. change point detection, classification) Disadvantages – On long input sequences, context vector is forced to “compress” a lot of information before decoding any of it – Long sequences still suffer from training problems, even if using LSTMs or GRUs 10/25 RNNs for Timeseries Modeling 11/25 RNN for Time Series Forecasting ● Alternate models for time series analysis ● Bidirectional LSTM ● ConvLSTM ● Univariate vs. multivariate time series ● Single-step vs. multi-step forecasting 12/25 Bidirectional LSTM ● Address the issue of long range dependencies (how to learn sequences on the order of hundreds or thousands of time steps) Image Source: Medium, analytics-vidhya 13/25 Bidirectional LSTM ● Customization point in Bidirectional LSTM: how to combine outputs from forward and backward layer in Activation Layer that gives yt – Sum, point wise multiplication, concatenation, average 14/25 ConvLSTM ● ● Applied to sequences of visual information, e.g.: social media videos, satellite pictures, security footage Operates similarly to a LSTM cell, but internal matrix multiplications are replaced with convolution operations ● ● Data flowing through cell keeps 3D dimension (instead of 1D vector of features) Input to ConvLSTM is a 5D tensor of shape (samples, time_steps, channels, rows, cols) ConvLSTM cell. Image Source: Medium, neurino 15/25 ConvLSTM ● Operates similarly to a LSTM cell, but internal matrix multiplications are replaced with convolution operations ● ● Where “*” is conv operations and “o” is element-wise product Xt, Ht, Ct are all 3D tensors ConvLSTM cell. Image Source: Medium, neurino Inner structure of ConvLSTM. Image Source: “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting”, Shi et al., 2015 16/25 ConvLSTM ● Highlights: – ConvLSTM can be better than a fully connected LSTM in handling spatio-temporal relations – Size of state-to-state (Whc) kernel > 1 leads to better capturing of spatio-temporal motion patterns – Deeper models (stacked ConvLSTMs) can produce better results with fewer parameters 17/25 Univariate vs Multivariate ● Univariate – Each item in the time series is a scalar value ● Multivariate – Each item in the time series is a vector - shape of xt is (1, n_features) – Output can be a single time series, or multiple (multivariate output) – Challenge: input series alignment 18/25 Single vs Multi step forecast ● Single step learning setup: – ● model learns to predict xt+1 given [xt-k, xt-k+1 , …, xt-1] Multi step learning setup: – Model learns to predict [xt+1, xt+2, ..., xt+n_ahead] given [xt-n_lag, xt-k+1 , …, xt-1] – Strategies: ● ● ● Use repeated single step prediction Direct vector output prediction (needs loss function modification – e.g. Mean Absolute Error) Encoder-Decoder setup 19/25 Considerations for RNN based Time Series Analysis ● ● ● Challenges to be addressed when applying RNNs to timeseries – Addressing Missing Values – Modeling Seasonality (e.g. Seasonal and Trend Decomposition - STL) – Variance stabilization – Multiple Output Strategy selection – Trend or Mean Normalization (to alleviate saturation of RNN gates – sigmoid, tanh functions) See a very useful survey in: “Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions”, Hewamalage et al., 2021 Check out GluonTS suite for Timeseries Forecasting 20/25 Regularization and Normalization in RNNs 21/25 Regularization and Normalization in RNNs ● For RNNs there’s discussion on whether using dropout for feed-forward layers only or hidden connection as well ● Typical dropout removes whole nodes ● Variations for RNN – DropConnect – drop individual weights rather than neurones – Variational Dropout – same dropout mask applied at each time step, including hidden layer connections → better performance on sentiment analysis and language modeling Gal, Ghahramani,2016 - A Theoretically Grounded Application of Dropout in Recurrent Neural Networks 22/25 Regularization and Normalization in RNNs ● Dropout for RNNs – Recurrent Dropout: apply dropout to the hidden state update vectors, rather than hidden state itself Img source: Semeniuta et al. “Illustration of the three types of dropout in recurrent connections of LSTM networks. Dashed arrows refer to dropped connections. – Zoneout ● Variation on dropout (Krueger et al. 2017 - Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations) – – Replace some unit activations with activations from previous timestep → preserve history, better flow of gradient See more techniques here 23/25 Regularization and Normalization in RNNs ● Layer Normalization – normalizes the inputs across features ⇒ normalization statistics are independent of other examples – Advantage: can be applied to layers of a RNN; works with small minibatches 24/25 The end 25/25

Neural Networks Lecture 5 (contd) RNN Applications 2022 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue