BT4222-Topic 6 RNN and Transformer.pdf
Document Details
Uploaded by BalancedHope1987
National University of Singapore
Full Transcript
BT4222 Mining Web Data for Business Insights Topic 6. Recurrent Neural Network and Transformer Dr. Qiuhong Wang AY 2024-25 Term 1 𝑝 𝑥𝑗−2|𝑥𝑗 𝑝 𝑥𝑗+2|𝑥𝑗 Mining web data...
BT4222 Mining Web Data for Business Insights Topic 6. Recurrent Neural Network and Transformer Dr. Qiuhong Wang AY 2024-25 Term 1 𝑝 𝑥𝑗−2|𝑥𝑗 𝑝 𝑥𝑗+2|𝑥𝑗 Mining web data for business insights is an useful course Input 𝑝 𝑥𝑗−1|𝑥𝑗 𝑝 𝑥𝑗+1|𝑥𝑗 In word embedding like Word2Vec, the 𝑗 window size refers to the number of 𝑥𝑗 … … 0 0 1 0 0 … words on either side of the central 𝑽 word that are considered its context. Embedding vector of 𝑥𝑗 ℎ𝑗 𝑈𝑗 = ℎ𝑗 ×W’ 𝑵 × = 𝒖𝟏𝒋 𝒖𝟐𝒋 𝒖𝟑𝒋 … … … … 𝒖𝑽𝒋 𝑒 𝑢1𝑗 𝑒 𝑢𝑉𝑗 𝑦ො𝑗 = , … , 𝑉 𝑢𝑖𝑗 Context matrix W’ σ𝑉𝑖=1 𝑒 𝑢𝑖𝑗 σ𝑖=1 𝑒 𝑦ො𝑗 ෝ𝟏𝒋 𝒚 𝒚 ෝ𝟐𝒋 𝒚 ෝ𝟑𝒋 … … … ෝ𝑽𝒋 … 𝒚 𝑗−2 𝑥𝑗−2 … … 𝑦ො𝑗−2𝑗 𝑦ො𝑗−1𝑗 𝑦ො𝑗𝑗 𝑦ො𝑗+1𝑗 𝑦ො𝑗+2𝑗 … 𝑦ො𝑗 … … 1 0 0 0 0 … = 𝑦𝑗−2𝑗 Cross entropy: ෝ𝒋−𝟐𝒋 − 𝑝 𝑥𝑖 |𝑥𝑗 log 𝑝Ƹ 𝑥𝑖 |𝑥𝑗 = −𝑝 𝑥𝑗−2𝑗 |𝑥𝑗 log 𝑝Ƹ 𝑥𝑗−2𝑗 |𝑥𝑗 = −log 𝒚 𝑖∈𝑉 𝑗−1 𝑥𝑗−1 … … 𝑦ො𝑗−2𝑗 𝑦ො𝑗−1𝑗 𝑦ො𝑗𝑗 𝑦ො𝑗+1𝑗 𝑦ො𝑗+2𝑗 … 𝑦ො𝑗 … … 0 1 0 0 0 … = 𝑦𝑗−1𝑗 Cross entropy: ෝ𝒋−𝟏𝒋 −log 𝒚 𝑗+1 𝑥𝑗+1 … … 𝑦ො𝑗−2𝑗 𝑦ො𝑗−1𝑗 𝑦ො𝑗𝑗 𝑦ො𝑗+1𝑗 𝑦ො𝑗+2𝑗 … 𝑦ො𝑗 … … 0 0 0 1 0 … = 𝑦𝑗+1𝑗 Cross entropy: ෝ𝒋+𝟏𝒋 −log 𝒚 𝑗+2 𝑥𝑗+2 … … 𝑦ො𝑗−2𝑗 𝑦ො𝑗−1𝑗 𝑦ො𝑗𝑗 𝑦ො𝑗+1𝑗 𝑦ො𝑗+2𝑗 … 𝑦ො𝑗 … … 0 0 0 0 1 … = 𝑦𝑗+2𝑗 Cross entropy: ෝ𝒋+𝟐𝒋 −log 𝒚 𝐻 𝑦ො 𝑗, 𝑦𝑗 = − log 𝑝Ƹ 𝑥𝑖 |𝑥𝑗 , 𝐶 = {𝑗 − 2, 𝑗 − 1, 𝑗 + 1, 𝑗 + 2} 𝑖∈𝐶 Item and User Embeddings in Collaborative Filtering Latent user matrix PNK i1 i2 i3 i4 i5 i6 … iM K 0 u1 5 4 0 5 0 0 … 0 User (u) 1 𝑝𝑢 u2 0 1 4 0 2 2 … 3 ⋮ N u3 4 2 5 0 4 0 … 4 0 𝑦ො 𝑢𝑖 𝑦𝑢𝑖 u4 0 0 0 0 4 3 … 0 Latent item vector QMK u5 0 2 4 0 0 5 … 0 0 K ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ … ⋮ 0 M uN 1 5 3 5 3 5 … 2 Item (i) 1 𝑞𝑖 ⋮ 0 User to item affinity matrix 𝑌𝑁𝑀 Matrix Factorization + Neural Collaborative Filtering = Neural Matrix Factorization 𝑦ො 𝑢𝑖 Activation 𝑦𝑢𝑖 NeuMF Layer(s) Concatenation or … MLP Layer J … GMF Layer: dot product Activation MLP Layer 1 Concatenation MF user MLP user MF item MLP item embedding embedding embedding embedding 0 1 … 0 0 0 1 … 0 User (u) https://doi.org/10.48550/arXiv.1708.05031 Item (i) Agenda RNN RNN Language Model Stacked RNN RNN: Vanishing / Exploding Gradient and Long-Term Dependency Problem Long Short Term Memory (LSTM) Attention Transformer ○ Self-Attention ○ Transformer Blocks ○ Multihead Attention ○ positional embeddings ○ Transformers as Language Models Assignment 2 (10 marks) Three tasks on LSTM, DNN and CNN Due by October 10 Suggestion: ○ Complete the tasks on LSTM and DNN by the recess week, to alleviate your workload after the recess week on the project and midterm. Reading Material https://web.stanford.edu/~jurafsky/slp3/9.pdf https://web.stanford.edu/~jurafsky/slp3/10.pdf https://web.stanford.edu/~jurafsky/slp3/11.pdf Examples BT4222_Sentiment_Analysis_LSTM.ipynb Very naive tokenizer but focus on how to compile data into a LSTM model and a model structure containing LSTM layer and hidden states BT4222_Time_Series_Forecasting_LSTM.ipynb An application of LSTM on time series data Get familiar with the input tensors with dimensions (batch_size, sequence_length, num_features) BT4222 Sentiment Analysis-NER-BERT.ipynb How to use a pretrained BERT model in NER and to finetuning a classification task “The concert was boring for the first 20 minutes while the band warmed up but then was terribly exciting.” Introduction Recurrent neural networks (RNN), are a family of neural networks (typically) for processing sequential data, e.g., time series, text, audio A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward The most effective sequence models used in practical applications are called gated RNNs, e.g., long short-term memory (LSTM) X Y Xt Yt Recurrent Neural Network The simplest possible RNN, composed of one neuron receiving inputs producing an output and sending that output back to itself More generally, one RNN layer consists of Multiple neuron receiving inputs Every neuron receives both the input vector x(t) and the output vector from the previous time step y(t-1) Source: Hands-on Machine Learning with Scikit-Learn & Tensorflow by Aurelien Geron RNN: State Update and Output Memory Cell: the practical building component of RNN, which preserves some state across time steps A cell will preserve state h(t-1), take input x(t), then produce output y(t) and update the preserved state to be h(t) Output vector Softmax function 𝑦𝑡 = 𝑓 𝑉ℎ𝑡 + 𝑏𝑦 Update Hidden State Tanh function ℎ𝑡 = 𝑔 𝑊𝑥𝑡 + 𝑈ℎ𝑡−1 + 𝑏ℎ Input Vector 𝑥𝑡 RNN: State Update and Output 𝑂𝑢𝑡𝑝𝑢𝑡 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑑𝑜𝑢𝑡 Function FORWARDRNN(x, network) returns output sequence y ℎ0 ← 0 ∈ ℝ𝑑𝑜𝑢𝑡 ×𝑑ℎ 𝐅𝐨𝐫 𝑡 ← 1 to LENGTH 𝑥 𝒅𝒐 ℎ𝑡 ← 𝑔 𝑊𝑥𝑡 + 𝑈ℎ𝑡−1 𝑦𝑡 ← 𝑓 𝑉ℎ𝑡 𝒓𝒆𝒕𝒖𝒓𝒏 𝑦 ∈ ℝ𝑑ℎ ×𝑑ℎ ∈ ℝ𝑑ℎ ×𝑑𝑖𝑛 𝐻𝑖𝑑𝑑𝑒𝑛 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑑ℎ 𝐼𝑛𝑝𝑢𝑡 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑑𝑖𝑛 RNN: State Update and Output A simple recurrent neural network shown unrolled in time. Network layers are recalculated for each time step, while the weights U, V and W are shared across all time steps. Time t RNN: Across individual time steps to get the total loss 𝑳 𝐿0 𝐿1 𝐿2 𝐿𝑡 𝑦𝑡 𝑦0 𝑦1 𝑦2 𝑦𝑡 … RNN = ℎ0 ℎ1 ℎ2 ℎ𝑡 𝑥𝑡 𝑥0 𝑥1 𝑥2 𝑥𝑡 RNN Language Model RNN language model A feedforward neural moving through a text language model Output: y, a vector representing a probability distribution over the vocabulary. softmax softmax input embedding matrix 𝐸 ∈ ℝ𝑑ℎ×|V| Input: a sequence X = [x1...xt...,] each represented as a one-hot vector of size |V|×1 RNN Language Model: Self-Supervision and Teacher Forcing Self-Supervision: The training data is a corpus of text and at each time step t the model predict the next word. Teacher forcing: The model is given the correct history sequence to predict the next word, rather than feeding the model its prediction from the previous time step). Teacher Forcing: 𝑦𝑡 = 𝑥𝑡 = 0, ⋯ , 1, ⋯ , 0 |V| 𝐿𝐶𝐸 𝑦ො𝑡 , 𝑦𝑡 = −𝑦𝑡 log 𝑦ො𝑡 𝐿𝐶𝐸 𝑦ො𝑡 , 𝑦𝑡 = − log 𝑦ො𝑡 RNN Language Model: Sequence Classification SoftMax FNN ℎ0 ℎ1 ℎ2 …… ℎ𝑛 𝑦1 𝑦2 𝑦3 𝑦𝑛 Stacked RNN 3 2 1 Why do Stacked RNNs generally outperform single-layer networks? the network induces representations at differing levels of abstraction across layers. RNN: Vanishing / Exploding Gradient and Long-Term Dependency Problem When many values >1, 𝐿0 𝐿1 𝐿2 𝐿𝑡 Exploding gradients Computing the gradient with aspect to ℎ(0) involves many factors of 𝑈 𝑦𝑡 When many values