RNNs PDF 3/14/2021
Document Details
Uploaded by PolitePraseodymium8322
The American University in Cairo
2021
Moustafa Youssef
Tags
Summary
These are lecture notes for a course on Recurrent Neural Networks (RNNs). The presentation covers various topics, like introductions, contents, RNN architectures (including GRUs, LSTMs, and BRRNs), and practical aspects of using the technology in different domains. The document is a comprehensive explanation of different RNN types and includes examples to help the reader visually understand the concepts.
Full Transcript
3/14/2021 Recurrent Neural Networks MOUSTAFA YOUSSEF 1 Acknowledgment Many slides taken from Andrew Ng’s Stanford course, Adriana Kovashka UPenn Course, and Google Coursera course (c) 202...
3/14/2021 Recurrent Neural Networks MOUSTAFA YOUSSEF 1 Acknowledgment Many slides taken from Andrew Ng’s Stanford course, Adriana Kovashka UPenn Course, and Google Coursera course (c) 2021, Moustafa Youssef 2 1 3/14/2021 Contents Introduction Traditional RNNs GRU LSTM BRRNs DRRNs (c) 2021, Moustafa Youssef 3 Recurrent Neural Networks A family of networks for processing sequence data Speech recognition → The quick brown fox jumped over the lazy dog Sentiment classification: “There is nothing I like in this movie → DNA sequence analysis: AGCCCCTGTGAGGAACTAG → AGCCCCTGTGAGGAACTAG Machine translation: Voulez-vouz chanter avec moi? → Do you want to sing with me? Video activity recognition: → Running Named entity recognition: Yesterday, Mohamed Ali met me in Alexandria → Yesterday, Mohamed Ali met me in Alexandria Location tracking Handling sparse data … (c) 2021, Moustafa Youssef 4 2 3/14/2021 Recurrent Neural Networks A family of networks for processing sequence data Humans don’t start their thinking from scratch every second Counting backward or alphabet in reverse Can scale to much longer sequences compared to other networks Can process sequences with variable length Share parameters across time Makes it possible to extend the model to examples of different forms “I went to Nepal in 2009” and “In 2009, I went to Nepal.” Q: In what year did I go to Nepal? Regular FF network: Need to learn the rule at each feature (here word position) RNN: Shares the same weights across time Note: time index doesn’t need to reflect physical time (c) 2021, Moustafa Youssef 5 Recurrent Neural Networks Traditional neural network assume all inputs are independent of each other Want to predict the next word in a sentence → better know which words came before it RNNs Perform the same task for every element of a sequence Output dependant on the previous computations They have a “memory” which captures information about what has been calculated so far (c) 2021, Moustafa Youssef 6 3 3/14/2021 Typical Recurrent Neural Networks Parameters sharing through time Source: Nature (c) 2021, Moustafa Youssef 7 Typical Recurrent Neural Networks xt is the input at time step t For example, x1 could be a one-hot vector corresponding to the second word of a sentence (c) 2021, Moustafa Youssef 8 4 3/14/2021 Typical Recurrent Neural Networks st is the hidden state at time step t It’s the “memory” of the network st is calculated based on the previous hidden state and the input at the current step: st=f(Uxt + Wst-1) f usually is a nonlinearity such as tanh or ReLU. s-1 required to calculate the first hidden state is typically initialized to all zeroes (c) 2021, Moustafa Youssef 9 Typical Recurrent Neural Networks ot is the output at step t E.G., if we wanted to predict the next word in a sentence A vector of probabilities across our vocabulary ot = softmax(Vst) (c) 2021, Moustafa Youssef 10 5 3/14/2021 Note Parameters tying in time W, U, V are all the same in the same layer across all time steps (c) 2021, Moustafa Youssef 11 Notes You can think of the hidden state st as the memory of the network Captures information about what happened in all the previous time steps The output at step ot is calculated solely based on the memory at time t We don’t need to have an input or output at each step: E.g., predicting the sentiment of a sentence → care about the final output The main feature of an RNN is its hidden state, which captures some information about a sequence (c) 2021, Moustafa Youssef 12 6 3/14/2021 Examples (Different Types of RNNs) One-to-One One-to-Many Many-to-One Encoder Decoder Many-to-Many Many-to-Many Same length Different length (c) 2021, Moustafa Youssef 13 Notes When training language models we set ot = xt+1 Problem with this model: Only uses history He said, “Teddy Roosevelt was a great president” He said, “Teddy bears are on sale” Need future to detect whether Teddy is Named entity or not Solution: Bi-directional RNNs More later (c) 2021, Moustafa Youssef 14 7 3/14/2021 Notations (c) 2021, Moustafa Youssef 15 Notation I met Mohamed Ali in Alexandria X X... X …… X 0 0 1 1 0 1 Y Y... Y …… Y Xi t’s input of training sequence example i Yi t’s output for training example i (c) 2021, Moustafa Youssef 16 8 3/14/2021 Word Representation One-hot encoded Vocabulary a an … apple size typically, 10K, 100K, or even a million … Zulu for new words not in the vocabulary (c) 2021, Moustafa Youssef 17 Feedforward (c) 2021, Moustafa Youssef 18 9 3/14/2021 Activation Functions For a: Tanh (most common in RNNs) or ReLU For y Depending on the estimated quantity: Sigmoid, Softmax, etc (c) 2021, Moustafa Youssef 19 Back Propagation Through Time (c) 2021, Moustafa Youssef 20 10 3/14/2021 BPTT Image source: https://www.techleer.com/articles/185-backpropagation-through-time- recurrent-neural-network-training-technique/ (c) 2021, Moustafa Youssef 21 Vanishing Gradient Problem With deep networks, Gradient decreases with layers Hard for a layer to affect the weights way away from it Exploding gradient Gradient clipping Rescale gradient vectors if they are above a threshold Vanishing gradient is harder to handle One solution: Gated RNNs (c) 2021, Moustafa Youssef 22 11 3/14/2021 Gated Recurrent Unit (c) 2021, Moustafa Youssef 23 GRU Normal RNN Unit (c) 2021, Moustafa Youssef 24 12 3/14/2021 GRU GRU Unit (c) 2021, Moustafa Youssef 25 GRU GRU Unit * (c) 2021, Moustafa Youssef 26 13 3/14/2021 GRU GRU Unit (Simplified) Allows NN to learn very long term dependency Helps in addressing the vanishing gradient problem (c) 2021, Moustafa Youssef 27 GRU GRU Unit (Full) New gate Reset or relevant gate * (c) 2021, Moustafa Youssef 28 14 3/14/2021 GRU GRU Unit (Full) New gate Reset or relevant gate (c) 2021, Moustafa Youssef 29 LSTM Unit 1997 (c) 2021, Moustafa Youssef 30 15 3/14/2021 LSTM (c) 2021, Moustafa Youssef 31 LSTM Use the input or not? Pass the output to the next unit or not? Forget the data or write it back? (c) 2021, Moustafa Youssef 32 16 3/14/2021 Long Short Term Memory Add more gates GRU LSTM Update Forget Output (c) 2021, Moustafa Youssef 33 Long Short Term Memory Add more gates GRU LSTM Update Write Forget Erase Output Read (c) 2021, Moustafa Youssef 34 17 3/14/2021 Notes Each gate is controlled by a simple logistic regression network GRU is more recent than LSTMs! Pros and Cons? (c) 2021, Moustafa Youssef 35 Bi-Directional RNNs (c) 2021, Moustafa Youssef 47 18 3/14/2021 BRNNs Allow one to process both past and future He said, “Teddy Roosevelt was a great president” He said, “Teddy bears are on sale” (c) 2021, Moustafa Youssef 48 BRNNs Forward and backward units Acyclic graph: Feedforward path then prediction (c) 2021, Moustafa Youssef 49 19 3/14/2021 Note Works with GRU and LSTM units too Need full input before processing Cannot do real-time processing (c) 2021, Moustafa Youssef 50 Deep RNNs (c) 2021, Moustafa Youssef 51 20 3/14/2021 DRNNs Multiple layers of RNNs (c) 2021, Moustafa Youssef 52 Notes Typically 2-3 layers – Computational efficiency May be followed by a normal deep network At each output Each block can be any type of RNN Unit GRU, LSTM, BRRN (c) 2021, Moustafa Youssef 53 21 3/14/2021 Regularization with RNNs L2 is OK. Dropout is OK On input and output layers only Not recurrent connections (c) 2021, Moustafa Youssef 54 Demo Source: https://techcrunch.com/2019/02/17/openai-text-generator-dangerous “OpenAI built a text generator so good, it’s considered too dangerous to release” When GPT-2 was tasked with writing a response to the prompt, “Recycling is good for the world, no, you could not be more wrong,” the machine spat back: (c) 2021, Moustafa Youssef 55 22 3/14/2021 Demo “Recycling is NOT good for the world. It is bad for the environment, it is bad for our health, and it is bad for our economy. I’m not kidding. Recycling is not good for the environment. It is destructive to the earth and it is a major contributor to global warming. Recycling is not good for our health. It contributes to obesity and diseases like heart disease and cancer. Recycling is bad for our economy. It increases the cost of a product, and in turn, the price of everything that is made with that product. Recycling is not good for our nation. We pay a tremendous price for the privilege of having the world’s most advanced and efficient recycling system. Recycling is a huge, colossal waste of time, energy, money, and resources.” (c) 2021, Moustafa Youssef 56 Demo GPT-2, was trained to predict the next word in a sample of 40 gigabytes of Internet text (c) 2021, Moustafa Youssef 57 23