RNN Model and Applications PDF

THE RECURRENT NEURAL NETWORK  A recurrent neural network (RNN) is a universal approximator of dynamical systems.  It can be trained to reproduce any target dynamics, up to a given degree of precision. RNN input Reccurrent Target output Output Hidden Layer Evolution of RNN internal state Dynamical 4 system System observation Evolution of system state  An RNN generalizes naturally to new inputs with any lengths.  An RNN make use of sequential information, by modelling a temporal dependencies in the inputs.  Example: if you want to predict the next word in a sentence you need to know which words came before it  The output of the network depends on the current input and on the value of the previous internal state.  The internal state maintains a (vanishing) memory about history of all past inputs.  RNNs can make use of information coming from arbitrarily long sequences, but in practice they are limited to look back only a few time steps. 5  RNN can be trained to predict a future value, of the driving input.  A side-effect we get a generative model, which allows us to generate new elements by sampling from the output probabilities. teacher forcing generative RNN - RNN mode input output training output feedback 6 DIFFERENCES WITH CNN  Convolution in space (CNN) VS convolution in time (RNN).  CNN: models relationships in space. Filter slides along 𝑥 and 𝑦 dimensions.  RNN: models relationships in time. ‘‘Filter’’ slides along time dimension. CNN Filter RNN Temporal data 𝑦 𝑥 7 𝑡𝑖𝑚𝑒 Spatial data 2. RNN APPLICATIONS 8 APPLICATION 1: NATURAL LANGUAGE PROCESSING  Given a sequence of words, RNN predicts the probability of next word given the previous ones.  Input/output words are encoded as one-hot vector.  We must provide the RNN all the dictionary of interest (usually, just the alphabet).  In the output layer, we want the green numbers to be high and red numbers to be low. [Image: Andrej Karpathy] 9  Once trained, the RNN can work in generative mode.  In NLP context, a generative RNN can be used in Natural Language Generation.  Applications:  Generate text (human readable data) from databse of numbers and log files, not readable by human.  What you see is what you meant. Allows users to see and manipulate the continuously rendered view (NLG output) of an underlying formal language document (NLG input), thereby editing the formal language without learning it. 10 NATURAL LANGUAGE GENERATION: SHAKESPEARE  Dataset: all the works of Shakespeare, concatenated them into a single (4.4MB) file.  3-layer RNN with 512 hidden nodes on each layer.  Few hours of training. 11 [Source: Andrej Karpathy] TEXT GENERATION: WIKIPEDIA  Hutter Prize 100MB dataset of raw Wikipedia.  LSTM  The link does not exist  [Source: Andrej Karpathy] 12 TEXT GENERATION: SCIENTIFIC PAPER  RNN trained on a book (LaTeX source code of 16MB).  Multilayer LSTM 13 [Source: Andrej Karpathy] APPLICATION II: MACHINE TRANSLATION  Similar to language modeling.  Train 2 different RNNs.  Input RNN: trained on a source language (e.g. German).  Output RNN: trained on a target language (e.g. English).  The second RNN computes the output from the hidden layer of the first RNN.  Google translator. 14 [Image: Richard Socher] APPLICATION III: SPEECH RECOGNITION  Input: input sequence of acoustic signals.  Output phonetic segments.  Necessity of encoder/decoder to transit from digital/analogic domain.  Graves, Alex, and Navdeep Jaitly. "Towards End-To-End Speech Recognition with Recurrent Neural Networks.“, 2014. 15 APPLICATION IV: IMAGE TAGGING  RNN + CNN jointly trained.  CNN generates features (hidden state representation).  RNN reads CNN features and produces output (end-to-end training).  Aligns the generated words with features found in the images  Karpathy, Andrej, and Li Fei- Fei. "Deep visual-semantic alignments for generating image descriptions.", 2015. 16 APPLICATION V: TIME SERIES PREDICTION  Forecast of future values in a time series, from past seen values.  Many applications:  Weather forcast.  Load forecast. Electricity Load  Financial time series. Telephonic traffic 17 APPLICATION VI: MUSIC INFORMATION RETRIEVAL Music transcription example  MIR: identification of songs/music  Automatic categorization.  Recommender systems.  Track separation and instrument recognition.  Music generation.  Automatic music transcription. [Source: Meinard Müller] 18 Automatic categorization software Software with recommender system 3. DESCRIPTION OF RNN MODEL 19 ARCHITECTURE COMPONENTS  𝑥: input  𝑊ℎℎ : recurrent layer weights  𝑦: output  𝑊ℎ𝑜 : output weights  ℎ: internal state (memory of the network)  𝑧 −1 : time-delay unit  𝑊𝑖ℎ : input weights  : neuron transfer function 20 STATE UPDATE AND OUTPUT GENERATION  An RNN selectively summarize an input sequence in a fixed-size state vector via a recursive update.  Discrete, time-independent difference equations of RNN state and output: ℎ 𝑡 + 1 = 𝑓 𝑊ℎℎ ℎ 𝑡 + 𝑊𝑖ℎ 𝑥 𝑡 + 1 + 𝑏ℎ , 𝑦 𝑡 + 1 = 𝑔(𝑊ℎ𝑜 ℎ 𝑡 + 1 + 𝑏𝑜 ).  𝑓() is the transfer function implemented by each neuron (usually the same non-linear function for all neurons).  𝑔() is the readout of the RNN. Usually is the identity function - all the non-linearity is provided by the internal processing units (neurons) – or the softmax function. 21 NEURON TRANSFER FUNCTION  The activation function in a RNN is traditionally implemented by a sigmoid.  Saturation causes vanishing gradient.  Non-zero centering produces only positive outputs, which lead to zig-zagging dynamics in the gradient updates.  Another common choice is the tanh.  Saturation causes vanishing gradient.  ReLU (not very much used in RNN).  Greatly accelerate gradient convergence and it has low demanding computational time.  No vanishing gradient.  Large gradient flowing through a ReLU neuron could cause the its 22 “death”. TRAINING  Model's parameters are trained with gradient descent.  A loss function is evaluated on the error performed by the network on the training set and, usually, also a regularization term. 𝐿 = 𝐸 𝑦, 𝑦ො + λ𝑅 Where 𝐸() is the error function, 𝑦 and 𝑦ො are target and estimated outputs, λ is the regularization parameter, 𝑅 is the regularization term.  The derivative of the loss function, with respect to the model parameters, is backpropagated through the network.  Weights are adjusted until a stop criterion is met:  Maximum number of epochs is reached.  Loss function stop decreasing. 23 REGULARIZATION  Introduce a bias, necessary to prevent the RNN to overfit on training data.  In order to generalize well to unseen data, the variance (complexity) of the model should be limited.  Common regularization terms: 1. 𝐿1 regularization of the weights: 𝑊 1. Enforce sparsity in the weights. 2. 𝐿2 regularization of the weights: 𝑊 2. Enforce small values for the weights. 3. 𝐿1 + 𝐿2 (elastic net penalty). Combines the two previous regularizations. 4. Dropout. Done usually only on the output weights. Dropout on recurrent layer is more complicated (the weights are constrained to be the same in each time step by the BPPT) → requires workaround. 24 RNN UNFOLDING  In order to train the network with gradient descent, the RNN must be unfolded.  Each replica of the network is relative to a different time interval.  Now, the architecture of the network become very deep, even starting from a shallow RNN.  The weights are constrained to be the same.  Less parameters than in other deep architectures. 25 BACK PROPAGATION THROUGH TIME 3  In the example, we need to backpropagate the 𝜕𝐸3 𝜕𝐸3 𝜕 𝑦ො3 𝜕ℎ3 𝜕ℎ𝑘 𝜕𝐸 =෍ gradient 𝜕𝑊3 from current time (𝑡3 ) to initial 𝜕𝑊 𝜕𝑦ො3 𝜕ℎ3 𝜕ℎ𝑘 𝜕𝑊 𝑘=0 time (𝑡0 ) → chain rule (eq. on the right).  We sum up the contributions of each time step to the gradient.  With of very long sequence (possibly infinite) we have untreatable depth.  Repeate the procedure only up to a given time (truncate BPPT).  Why it works? Because each state carries a little bit of information on each previous input.  Once the network is unfolded, the procedure is analogue to standard backpropagation used in 26 deep Feedforward Neural Networks. VANISHING GRADIENT: SIMPLE EXPERIMENT  Bengio, 1991.  A simple RNN is trained to keep 1 bit of information for 𝑇 time steps.  𝑃(𝑠𝑢𝑐𝑐𝑒𝑠𝑠|𝑇) decreases exponentially as 𝑇 increases. [Image:Yoshua Bengio] 27 HOW TO LIMIT VANISHING GRADIENT ISSUE?  Use ReLU activations (in RNN however, they cause the “dying neurons” problem).  Use LSTM or GRU architectures (discussed later).  Use a proper initialization of the weights in 𝑊. 30 4. RNN EXTENSIONS 39 DEEP RNN (1/2)  Increase the depth of RNN to increase expressive power.  N.B. here we add depth in SPACE (like FFNN), not in TIME.  Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y, “How to construct deep recurrent neural networks”, 2013. Stacked RNN (learns different time-scales at each layer – 40 Standard RNN from fast to slow dynamics) DEEP RNN (2/2) Deep input-to-hidden + Deep input-to-hidden + Deep hidden-to-hidden + Deep hidden-to-hidden + Shortcut connections (useful for letting the 41 Deep hidden-to-output gradient flow faster during backpropagation). BIDIRECTIONAL RNN  The output at time 𝑡 may not only depend on the previous elements in the sequence, but also future elements.  Example: to predict a missing word in a sequence you want to look at both the left and the right context.  Two RNNs stacked on top of each other.  Output computed based on the hidden state of both RNNs.  Huang Zhiheng, Xu Wei,Yu Kai. Bidirectional LSTM Conditional Random Field Models for Sequence Tagging. 42 DEEP (BIDIRECTIONAL) RNN  Similar to Bidirectional RNNs, but with multiple hidden layers per time step.  Higher learning capacity.  Needs a lot of training data (the deeper the architecture the harder is the training).  Graves Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM.", 2013. 43 5. GATED RNNS Ispired by: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 44 LONG-TERM DEPENDENCIES  Due to vanishing gradient, RNN are uncapable of learning long-term dependencies.  Some applications require both short and long term dependencies.  Example: Natural Language Processing (NLP).  In some cases, short-term dependencies are sufficient to make predictions.  Consider to train the RNN to make 1-step ahead prediction.  To predict the word ‘sea’ it is sufficient to look only 3 step back in time.  In this case, it is sufficient to backpropagate the gradient 3 step back to succesfully learn this task. 45  Let’s stick to 1-step ahead prdiction.  Consider the sentence: I am from Rome. I was born 30 years ago. I like eating good food and riding my bike. My native language is Italian.  When we want to predict the word Italian, we have to look back several time steps, up to the word Rome.  In this case, the short-term memory of the RNN would not do the trick.  As the gap in time grows, the harder for an RNN become to handle the problem.  Let’s see how Long-Short Term Memory (LSTM) can handle this difficulty. 46 LSTM OVERVIEW  Introduced by Hochreiter & Schmidhuber (1997).  Work very well on many different problems and are widely used nowadays.  Like RNN, they must be unfolded in time to be trained and understood.  Let’s recall the unfolded version of a RNN.  A very simple processing unit is repeated each time. 47  The processing unit of the LSTM is more complex and is called cell.  An LSTM cell is composed of 4 layers, interacting with each other in a special way. 48 CELL STATE AND GATES  The state of a cell at time 𝑡 is 𝐶𝑡.  The LSTM modify the state only through linear interactions: information flows smoothly across time.  LSTM protect and control the information in the cell through 3 gates.  Gates are implemented by a sigmoid and a pointwise multiplication. Cell state Gate 49 FORGET GATE  Decide what information should be discarded from the cell state. 0 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑔𝑒𝑡 𝑟𝑖𝑑 𝑜𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑖𝑛 𝐶𝑡 𝑓𝑡 = 𝜎 𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 = ቊ 1 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑘𝑒𝑒𝑝 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑖𝑛 𝐶𝑡  Gate controlled by current input 𝑥𝑡 and past cell output ℎ𝑡−1.  NLP example: cell state keep the gender of the present subject to use correct pronouns.  When sees a new subject, forget the gender of the old subject. 50 UPDATE GATE  With forget gate we decided wheter or not to forget cell content.  After, with update gate we decide how much to update the old state 𝐶𝑡−1 with a new candidate 𝐶ሚ𝑡. 0 → 𝑛𝑜 𝑢𝑝𝑑𝑎𝑡𝑒  Update gate: 𝑖𝑡 = 𝜎 𝑊𝑖 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 = ቊ 1 → 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑢𝑝𝑑𝑎𝑡𝑒  New candidate: 𝐶ሚ𝑡 = tanh 𝑊𝑐 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶.  New state: 𝐶𝑡 = 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶ሚ𝑡.  In the NLP example, we update the cell state as we see a new subject.  Note that the new candidate is computed exactly like the state in traditional RNN (same difference 51 equation). OUTPUT GATE  The output is a filtered version of the cell state.  Cell state is fed into a tanh, which squashed its values between -1 and 1.  Then, the gate select the part to be returned as output. 0 → 𝑛𝑜 𝑜𝑢𝑡𝑝𝑢𝑡  Output gate: 𝑜𝑡 = 𝜎 𝑊𝑜 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 = ቊ 1 → 𝑟𝑒𝑡𝑢𝑟𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝑐𝑒𝑙𝑙 𝑠𝑡𝑎𝑡𝑒  Cell output: ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )  In NLP example, after having seen a subject, if a verb comes next, the cell outputs information about being singluar or plural. 52 COMPLETE FORWARD PROPAGATION STEP  𝑓𝑡 = 𝜎 𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 - forget gate  𝑖𝑡 = 𝜎 𝑊𝑖 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 - input gate  𝑜𝑡 = 𝜎 𝑊𝑜 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 - output gate  𝐶ሚ𝑡 = tanh 𝑊𝑐 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶 - new state candidate  𝐶𝑡 = 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶ሚ𝑡 - new cell state  ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 ) – cell output  Parameters of the model: 𝑊𝑖 , 𝑊𝑐 , 𝑊𝑜 , 𝑏𝑖 , 𝑏𝑐 , 𝑏𝑜  Note that the weight matrix have a larger size than in normal RNN (they multiply the concatentation of 𝑥𝑡 and ℎ𝑡 ). 53 LSTM DOWNSIDES  Gates are never really 1 or 0. The content of the cell is inevitably corrupted after long time.  Even if LSTM provides a huge improvement w.r.t. RNN, it still struggles with very long time dependencies.  Number of parameters: 4 ∙ 𝑁𝑖 + 1 ∙ 𝑁𝑜 +𝑁𝑜2.  Example: input = time series of 100 elements, LSTM units = 256 → 168960 parameters.  Scales up quickly!! Lot of parameters = lot of training data.  Rule of thumb: data elements must always be one order of magnitude greater than the number of parameters.  Memory problems when dealing with lot of data.  Long training time (use GPU computing). 54 GATED RECURRENT UNITS  Several LSTM variants exists.  GRU is one of the most famous alternative architectures (Cho, et al. (2014).  It combines the forget and input gates into a single update gate.  It also merges the cell state and hidden state, and makes some other changes.  The cell is characterize by fewer parameters. 55 GRU FORWARD PROPAGATION STEP  𝑟𝑡 = 𝜎 𝑊𝑟 ∙ ℎ𝑡−1 , 𝑥𝑡 - reset gate (merge of input and forget gate).  𝑧𝑡 = 𝜎 𝑊𝑧 ∙ ℎ𝑡−1 , 𝑥𝑡 - output gate.  ℎ෨ 𝑡 = tanh 𝑊 ∙ 𝑟𝑡 ∗ ℎ𝑡−1 , 𝑥𝑡 - new candidate output (merge internal state and output).  ℎ𝑡 = 1 − 𝑧𝑡 ∗ ℎ𝑡−1 + 𝑧𝑡 ∗ ℎ෨ 𝑡 - cell output  Performs better LSTM or GRU?  Depends on the problem at hand: Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling.“, 2014.  ‘No free lunch’ theorem  56

RNN Model and Applications PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue