04-rnn-lms.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Recurrent Neural Network (RNN) Language Models CS 263: Natural Language Processing Nanyun (Violet) Peng Course website: https://violetpeng.github.io/cs263_fall24.html syllabus,...
Recurrent Neural Network (RNN) Language Models CS 263: Natural Language Processing Nanyun (Violet) Peng Course website: https://violetpeng.github.io/cs263_fall24.html syllabus, announcements, slides, homework 1 Important Announcements Quiz 1: next Wednesday, in class, for the first 30 minutes. A practice quiz will be released on gradescope by the end of today The quiz will be on gradescope. We will hand out GCP free credit this Friday once the enrollment is finalized Project planning report dues next Wednesday, 11:59pm. Should work on teaming ASAP We’re looking for teaming information and which project you pick. What we have learned so far? Word embeddings Language models What’s the similarities and differences of the two? Similarities: Both are backed up by the distributional semantics theory Both can be learned by statistical methods (count and divide) and neural networks Both learn word representations Differences Word embeddings learns “semantic representation” (vectors) for words Language models estimate probabilities of sentences (and generate sentences) How about Long-Term Dependencies? What does it mean by “long-term dependency” in language modeling? Consider an example Kevin's eye was caught by one number that was highlighted in the last report; this number represented the new algorithm's ability to create lengthy and coherent text on its own. This is unprecedented. The text contained more than 500 words, but the algorithm had generated many more than that – several thousands in fact. Kevin opened up two other files that contained several thousand words of AI generated text each....[52 words]... Kevin rubbed his hands together in ___ 4 Outline Recurrent Neural Networks (RNNs) Long Short-Term Memory Network (LSTMs) RNNs/LSTMs for Language Modeling Decoding Algorithms Rcall Feedforward Neural language model Model 𝑝(𝑥!"# |𝑥!:!"#%& ) with a neural network. Obtain (y|x) by performing non- Y. Bengio et al., JMLR’03 linear projection and softmax: ⃑ 𝑊!" ℎ + 𝑏! 𝑠= 𝑝⃑ = softmax(𝑠) ⃑ Non-linear function e.g., ℎ = tanh(𝑊#" 𝑐⃑ + 𝑏# ) Concatenate projected vectors Word representations to project inputs into to get multi-word contexts. 6 low-dimensional vectors Recurrent Neural Networks (RNNs) Main idea: make use of sequential information How RNN is different from feedforward neural network? Feedforward neural networks assume all inputs are independent of each other Feedforward neural networks look at a fixed context window In many cases (especially for language), these are not idea What does RNNs do? Perform the same computation at every step of a sequence (that’s what recurrent stands for) Inputs to the next step depend on previous computations Another way of interpretation – RNNs have a “memory” To store previous computations 7 Recurrent Neural Networks (RNNs) Hidden state at time step t Output state at time step t Activation function ht-1 ht ht+1 h Input at time step 𝑡 − 1 Parameters (recurrently used) 8 Recurrent Neural Networks (RNNs) Hidden state at time step t Output state at time step t Activation function ht-1 ht ht+1 h Input at time step 𝑡 − 1 Parameters (recurrently used) 9 Recurrent Neural Networks (RNNs) Mathematically, the computation at each time step: Linear combination of the current input and previous step’s “memory” Activation function (add non-linearity) h Linear transformation to convert current “memory” to output vec. Make predictions based on the output vec. Problem of Long-Term Dependencies What if we want to predict the next word in a long sentence? Do we know which past information is helpful to predict the next word? In theory, RNNs are capable of handling long-term dependencies. But in practice, they are not! Reading: vanishing gradient problem. 11 Long Short Term Memory (LSTM) A special type of recurrent neural networks. Explicitly designed to capture the long-term dependency. It solves/mitigates the vanishing gradient problem of vanilla RNN. So, what is the structural difference between RNN and LSTM? 12 Difference between RNN and LSTM 13 Core Idea Behind LSTM Key to LSTMs is the memory cell state LSTMs memory cells add and remove information as the sequence goes. Mathematically, it turns the cascading multiplications in vanilla RNNs into additions. How? Through a structure called gate. LSTM has three gates to control the memory in the cells 14 Step-by-Step LSTM Walk Through The input gate decides what information will be taken from the current input and stored in the cell state Two parts – A sigmoid layer (input gate layer): decides what values we’ll update A tanh layer: creates a vector of new candidate values, 𝐶%! Example: add the gender of the new subject to the cell state Replace the old one we’re forgetting Input gate layer tanh layer 15 Step-by-Step LSTM Walk Through The forget gate decides what information will be thrown away A sigmoid layer decides what values we’ll forget Looks at ℎ!"# and 𝑥! and outputs a vector of numbers between 0 and 1 1 represents completely keep this, 0 represents completely get rid of this Example: forget the gender of the old subject, when we see a new subject 16 Step-by-Step LSTM Walk Through Next step: update old state by 𝐶345 into the new cell state 𝐶3 Multiply old state by 𝑓3 Forgetting the things we decided to forget earlier Then we add 𝑖3 ∗ 𝐶%3 Adding in new information 17 Step-by-Step LSTM Walk Through Final step: decide what we’re going to output First, we compute an output gate Which decides what parts of the cell state we’re going to output Then, we put the cell state through tanh and multiply it by the output of the sigmoid gate 18 Putting Everything Together Calculate the gates Input information Central memory Output information LSTMs Summary LSTMs is an (advanced) variation of RNNs. It captures long-term dependencies of the inputs. Shown to be efficient in many NLP tasks. A standard component to encode text inputs during 2014 - 2018. 20 RNN/LSTM Language Models RNN/LSTM Language Models Model the probability of the next word at each step. PRNN(wt|wt-1, ht-2) PRNN(wt+1|wt, ht-1) PRNN(wt+2|wt+1, ht) h ht-1 ht ht+1 æ T t -1 ö log P( w1 , w2 ,..., wt -1 , wT ) = logçç Õ P( wt | w1 ) ÷÷ = åt =1 log P( wt | w1t -1 ) T è t =1 ø Handles infinite context in theory Training a RNN Language Model = negative log prob of “students” Loss … Corpus the students opened their exams … 28 2/1/18 Training a RNN Language Model = negative log prob of “opened” Loss … Corpus the students opened their exams … 29 2/1/18 Training a RNN Language Model = negative log prob of “their” Loss … Corpus the students opened their exams … 30 2/1/18 Training a RNN Language Model = negative log prob of “exams” Loss … Corpus the students opened their exams … 31 2/1/18 Training an RNN Language Model Get a corpus of text which contains a sequence of words w1, …, wt Feed the sequence into an RNN, compute output distribution 𝑤 !) for every step t. i.e., predict the probability distribution for every next word, given all previous context words Negative log-likelihood of the target word, or cross-entropy loss at each step L(θ) = - log P(wt|w1:t-1 ; θ) Average each step’s loss to get the loss for the entire training set. Learning neural language models Maximize the log-likelihood (minimize negative log-likelihood) of observed data, w.r.t. parameters θ of the neural language model ( ) Lt = log P wt = w | w1t -1 = sθ (w) - log åv =1 e sθ (v ) V ( ( arg max log P wt = w | w1t -1 , θ θ )) Parameters θ (in an RNN language model): Word embedding matrix R RNN weights: W, U, b, V, c LSTM weights: Wi, Wf, Wo, Wg, Ui, Uf, Uo, Ug (optional, bi, bf, bo, bg) Gradient descent with learning rate η: ¶Lt θ ¬ θ -h auto-gradient ¶θ Decoding algorithms for LMs Decoding: selecting the word to generate at each time step Greedy search: select word with the highest 𝑝(𝑥) |𝑥*:),* ) given by the softmax layer 𝑝(𝑥) |𝑥*:),* ) Issues? Beam search: explore several different hypotheses instead of just a single one keep track of k most probable partial translations at each decoder step instead of just one! choose k words with the highest p(yi) at each time step. k – beam size (typically 5-10) Beam search decoding: example Beam size = 2. Showing log-likelihood: Beam search decoding: example Beam size = 2. Showing log-likelihood: Beam search decoding: example Beam size = 2. Showing log-likelihood: Beam search decoding: example Beam size = 2. Show log-likelihood: A few steps later….. Does beam search always return the most probable sequence? What are the termination conditions of beam search? What’s the effect of changing beam size k? Small k has similar problems to greedy decoding (k=1) Unnatural, incorrect (due to errors in earlier decoding stage), generic Larger k means you consider more hypotheses Increasing k reduces some of the problems above Larger k is more computationally expensive Increasing k can also introduce other problems E.g., in open-ended tasks like chit-chat dialogue, large k can make output more generic Sampling-based decoding Pure sampling at each step t, randomly sample from the probability distribution 𝑝(𝑥3 |𝑥5:345) to obtain your next word. Like greedy decoding, but sample instead of argmax. Top-k sampling at each step t, instead of randomly sampling from 𝑝(𝑥3 |𝑥5:345), restricted to just the top-k most probable words (k = number of words) Like pure sampling, but truncate the probability distribution k=1 is greedy search, k=V is pure sampling Increase k to get more diverse/risky output Decrease k to get more generic/safe output Sampling-based decoding Top-P sampling at each step t, instead of randomly sampling from 𝑝(𝑥3 |𝑥5:345), restricted to the most probable words up to the total probability mass of P. Like top-k sampling, truncate the probability distribution, but in a different way P=1 is pure sampling Increase P to get more diverse/risky output Decrease P to get more generic/safe output Comparing different sampling strategies The Curious Case of Neural Text Degeneration, Holtzman et al., 2020 Decoding algorithms: summary Greedy decoding is a simple method; usually gives low quality output Beam search (especially with large beam size) searches for high probability output Delivers better quality than greedy, but if beam size is too large, can return high-probability but unsuitable output (e.g. generic, short) Sampling methods are a way to get more diversity and randomness Good for open-ended / creative generation (chit-chat, poetry, stories) Top-k and top-p sampling allows you to control diversity Next: Sequence-to-sequence modeling Sequence-to-sequence tasks RNN/LSTM-based seq2seq models Seq2seq with attention