lec6-RNNs.pdf
Document Details
Uploaded by EnthralledCharacterization
Stockholm University
Tags
Full Transcript
ML: Lecture 6 Recurrent Neural Networks Panagiotis Papapetrou, PhD Professor, Stockholm University Syllabus Jan 15 Introduction to machine learning Jan 17 Regression analysis Jan 18 Laboratory session I: numpy and linear r...
ML: Lecture 6 Recurrent Neural Networks Panagiotis Papapetrou, PhD Professor, Stockholm University Syllabus Jan 15 Introduction to machine learning Jan 17 Regression analysis Jan 18 Laboratory session I: numpy and linear regression Jan 22 Machine learning pipelines Jan 25 Laboratory session II: ML pipelines and grid search Jan 29 Training neural networks Jan 31 Laboratory session III: training NNs and tensorflow basics Feb 6 Convolutional neural networks Feb 8 Recurrent neural networks and autoencoders Feb 12 Time series classification Feb 15 Laboratory session IV: CNNs and RNNs Feb 19 Deep time series classification Today What are RNNs? What are their main principles? How to train RNNs? Gated RNNs and their benefits Recurrent Neural Networks (RNNs) (Rumelhart et al. 1986) A family of neural networks for processing sequential data Specialized for processing a sequence of values x(1), …, x(τ) Just like CNNs can be applied to images of large width and height, RNNs can scale to much longer sequences and can process sequences of variable length Fully connected feed-forward network: – separate parameters for each input feature – learns all the rules of the input separately RNN: shares the same weights across several time steps Applications of RNNs Language modelling Machine translation Stock market prediction Speech recognition Generating image captions Video tagging Text summarization Learning from medical data Dynamical Systems The classical form: – s(t): the state of the system at time t – the definition of s at time t refers back to its definition at time t-1 – for a finite number of steps τ the graph can be unfolded by applying function f τ-1 times – e.g., for τ=3: Dynamical Systems Consider a dynamical system that is driven by an exogenous signal x(t) Any function involving recurrence can be considered an RNN Intermediate (i.e., hidden) states are defined by function h: What is left: addition of extra architectural features such as output layers that read information out of state h to make predictions The basic RNN task Basic RNN task: predict the future from the past Map the past into a fixed-length vector h(t) h(t) How much of the past should one store/model in h(t) ? Recurrent vs unfolded version? Three common Vanilla RNNs Version I: – output at each time step – recurrent connections between hidden units Version II: – output at each time step – recurrent connections only from the output at one time step to the hidden units at the next Version III: – recurrent connections between hidden units – read an entire sequence and produce a single output The Vanilla RNN – ver 1 output at each time step recurrent connections between hidden units o: output values y: target values L: comparison between softmax(o) and target weights are shared over time copies of the RNN cell are made over time (unfolding), with different U: input-to-hidden connections inputs at different time steps W: hidden-to-hidden recurrent connections V: hidden-to-output connections The Vanilla RNN – ver 1 hidden unit xt U ht W ht-1 For each time step t = 1 to t = τ L(t): cross entropy loss of the predictions given {x(1),... , x(t)} bias vectors: b and c The Vanilla RNN – ver 2 output at each time step recurrent connections only from the output at one time step to the hidden units at the next no direct connections from h going forward it is trained on output o and this is the only information it is allowed to send to the future easier to train (can train each hidden layer in parallel) no need to compute the output for previous time step first…why? Teacher Forcing Applicable to RNNs that have connections from their output to their hidden states at the next time step At training time, we feed the correct output y(t) drawn from the training set as input to h(t+1) The Vanilla RNN – ver 3 recurrent connections between hidden units read an entire sequence and produce a single output Summarize a sequence and produce a fixed-size representation, e.g., as input for further processing Input – Output Scenarios Single - Single Feed-forward Network Image Captioning Single - Multiple (image only) Multiple - Single Sentiment Classification Multiple - Multiple Translation Image Captioning (previous word too) Input – Output Scenarios h1 h2 hn Great service again. Multiple - Single Sentiment Classification Image Captioning Given an image, produce a sentence describing its contents Input: image feature vector (from a CNN) Output: multiple words (e.g., one sentence) The dog is hiding * https://arxiv.org/pdf/1703.09137.pdf Image Captioning The dog output output RNN RNN RNN CNN At which stage should image information be introduced (injected) into the RNN? Several possibilities: Init-Inject Pre-Inject Par-Inject Merge Init-Inject The image vector is used as the RNN’s initial hidden state vector It requires the image vector to have the same size as the RNN hidden state vector This is an early binding architecture and allows the image representation to be modified by the RNN Pre-Inject The image vector is treated as a first word in the prefix This requires the image vector to have the same size as the word vectors This too is an early binding architecture and allows the image representation to be modified by the RNN Par-Inject The image vector is input to the RNN together with the word vectors: (a) two separate inputs or (b) the word vectors are combined with the image vector into a single input The image vector does not need to be exactly the same for each word nor does it need to be included with every word Merge The RNN is not exposed to the image vector at any point Instead, the image is introduced into the language model after the prefix has been encoded by the RNN This is a late binding architecture and it does not modify the image representation with every time step Back-Propagation Through Time (BPTT) L1 L2 L3 Treat the unfolded network as one big feed- forward network y1 y2 y3 This network takes in the entire sequence as input We compute the gradients through the h3 usual back-propagation procedure h2 We update the shared weights h1 The weight gradients are computed for each copy in the unfolded network, then summed (or averaged), and then applied to the common h2 RNN weights h1 h0 x1 x2 x3 23 Truncated BPTT Run forward and back propagation through chunks of the sequence instead of the whole sequence Back-propagate through a chunk (segment) and make a gradient step on the weights Next batch: - still have the hidden state from the previous batch - carry the hidden states forward Multi-layer RNNs We can of course design RNNs with multiple hidden layers y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 Solve more complex sequential problems 25 Bi-directional RNNs RNNs can process the input sequence in the forward and in the reverse direction y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 Popular in speech recognition 26 Vanishing or Exploding Gradients In the same way a product of k real numbers can shrink to zero or explode to infinity, so can a product of matrices Vanishing Gradients: the effect of a change in L to the weight in some layer (i.e., the norm of the gradient) becomes so small due to increased model complexity with more hidden units, that it becomes zero after a certain point Example: Vanishing or Exploding Gradients In the same way a product of k real numbers can shrink to zero or explode to infinity, so can a product of matrices Vanishing Gradients: the effect of a change in L to the weight in some layer (i.e., the norm of the gradient) becomes so small due to increased model complexity with more hidden units, that it becomes zero after a certain point Example: Vanishing or Exploding Gradients In the same way a product of k real numbers can shrink to zero or explode to infinity, so can a product of matrices Vanishing Gradients: the effect of a change in L to the weight in some layer (i.e., the norm of the gradient) becomes so small due to increased model complexity with more hidden units, that it becomes zero after a certain point Exploding Gradients: a large increase in the norm of the gradient during training due to an explosion of long-term components, can result in the gradients growing exponentially Gradient Scaling and Clipping Gradient scaling: normalizing the error gradient vector such that the vector norm (magnitude) equals a defined value, such as 1.0 Gradient clipping: forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeds an expected range Together, these methods are often simply referred to as “gradient clipping” ensures the gradient vector g has norm at most equal to threshold threshold: 0.5 to 10 times the average norm over a sufficiently large number of updates Gradient Scaling and Clipping Gradient scaling: normalizing the error gradient vector such that the vector norm (magnitude) equals a defined value, such as 1.0 Gradient clipping: forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeds an expected range Together, these methods are often simply referred to as “gradient clipping” Common to use the same gradient clipping configuration for all layers In some cases, a larger threshold of error gradients is permitted in the output layer compared to hidden layers Gated RNNs The most effective sequence models used in practical applications: – long short-term memory (LSTM) – gated recurrent unit (GRU) Gated RNNs: create paths through time that have derivatives that neither vanish nor explode – allow the network to accumulate information over a long duration – once this information has been used the neural network can forget the old state – instead of manually deciding when to clear the state, we want the neural network to learn to decide when to do it Long Short-Term Memory (LSTM) [Hochreiter et al. 1997] The LSTM uses this idea of “Constant Error Flow” to create a “Constant Error Carousel” (CEC), which ensures that gradients do not decay The key component is a memory cell that acts like an accumulator (contains the identity relationship) over time Instead of computing a new state as a matrix product with the old state, it rather computes the difference between them Expressivity is the same, but gradients are “better behaved” 33 LSTM: structure Each cell has the same input and output as an ordinary recurrent network But also more parameters and a system of gating units that controls the flow of information The sigmoid layer outputs numbers between 0-1 which determines how much each component should be let through x gate: point-wise multiplication + gate: point-wise addition This forget gate determines how much information LSTM The output gate goes through The input gate decides controls what goes what info to add to the to output cell state Ct-1 Ct ht-1 ht forget input gate gate Ct: cell state (flows along unchanged) slowly, with only minor linear interactions Long-term memory capability: stores and loads information of previous events ft tries to estimate what features of the cell state should be forgotten it decides what features of cell memory should be updated 𝐶!! determines the new input that may go to the long-term memory updates the cell state decides what part of the cell state goes to output RNN vs LSTM RNN LSTM Peephole LSTM [Gers & Schmidhuber 2000] Allows “peeping” into the memory Gates can see the output from the previous time step, and can hence affect the construction of the output Gated Recurrent Unit (GRU) [Cho et al. 2014] update gate: determines which information from the previous hidden state and current input to keep reset gate update gate reset gate: determines which information to discard GRU LSTM A single gating unit that combines the forget and input into a single update gate It also merges the cell state and hidden state This is simpler than LSTM There are many other variants too Has fewer parameters than an LSTM and has been shown to outperform LSTM on some tasks LSTM vs GRU GRU uses less training parameters and hence needs less memory GRU executes faster than LSTM whereas LSTM is more accurate on larger datasets and longer sequences One can choose LSTM when dealing with long sequences GRU is used when we have less memory availability Coming up next… Jan 15 Introduction to machine learning Jan 17 Regression analysis Jan 18 Laboratory session I: numpy and linear regression Jan 22 Machine learning pipelines Jan 25 Laboratory session II: ML pipelines and grid search Jan 29 Training neural networks Jan 31 Laboratory session III: training NNs and tensorflow basics Feb 6 Convolutional neural networks Feb 8 Recurrent neural networks and autoencoders Feb 12 Time series classification Feb 15 Laboratory session IV: CNNs and RNNs Feb 19 Deep time series classification TODOs Reading: – The Deep Learning Book: Chapter 10 Homework 2 – Due: Feb 25