COMP9414 Artificial Intelligence 2024 T3 UNSW PDF

Summary

These notes cover weeks 8 and 9 of COMP9414 Artificial Intelligence at UNSW for 2024 T3. They focus on the application of neural networks, including word embeddings and transformers, to natural language processing (NLP) tasks. The document includes various examples and concepts related to the topics.

Full Transcript

10/31/24 A single neuron in the brain is an incredibly complex machine that even today we don’t understand. Andrew Ng COMP9414 | 20...

10/31/24 A single neuron in the brain is an incredibly complex machine that even today we don’t understand. Andrew Ng COMP9414 | 2024 T3 | UNSW All images from Wikimedia commons, unless specified. 1 C O M P 9 41 4 : A r t i f i c i a l I n t e l l i g e n c e Weeks 8 & 9: Natural Language Processing (NLP) Lecturer Module Schedule Dr. Aditya Joshi Neural Networks for NLP 2024 Term 3 [email protected] 2 1 10/31/24 Module Neural Networks for NLP Shallow networks to learn word representations Attention & Transformer Sequential networks to learn sentence Transformer-based models representations Applications and frontiers in NLP Parallelised neural networks via Attention 3 Recap: Feedforward neural networks Neuron: Input, weights, activation function Feedforward: Hidden layers, expressive ability Learning: Backpropagation 4 2 10/31/24 Part 0 Word Embeddings 5 Word Embeddings Word Embedding Dense vectors that represent words The vectors capture specific properties Can be used as feature vectors in a statistical classifier Example: word2vec I love the movie. [2.3,4.5 …..] Statistical/Neural 1 [3.6, 67.2….] Model 0 I hate the movie. Word embeddings can be averaged to get sentence embeddings. Effective to obtain “structured” representations for unstructured (text) input.. Schnabel, T., Labutov, I., Mimno, D. and Joachims, T., 2015, September. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 298-307). 6 3 10/31/24 Let’s think of a specific task with textual data Sentence Task Output The boy The boy ___ rice -> eats 1 eats rice The boy ___ rice -> drinks 0 The boy ____ rice -> banana 0 7 Let’s be more specific... P(w|c) → Center word, given other words (Continuous bag-of-words) P(c|w) → Other words, given center word (Skip-gram) 8 4 10/31/24 Neural network representation of word2vec Remember backpropagation?? The task is to learn to predict contextual words. Learned through a shallow neural network https://israelg99.github.io/2017-03-23-Word2Vec-Explained/ 9 word2vec training (skip-gram) Loss function: wo w(t+j) wc w(t) word as a context word word as a center word Word vectors as center words are used as word embeddings. 10 5 10/31/24 Interpretation of word embeddings Vector arithmetics vector(“king”) – vector(“man”) + vector(“ woman”) = vector(“ queen”) Vector similarity cosine_similarity(vector(“king”), vector(“crown”)) > cosine_similarity(vector(“king”), vector(“apple”)) Usage: Initialise text with embeddings for use in statistical or neural models Demo time! https://www.technologyreview.com/2015/09/17/166211/king-man-woman-queen-the-marvelous-mathematics-of-computational-linguistics/ 11 Part 0 Sequential neural networks 12 6 10/31/24 Let’s progress from words to longer text. For sale: Baby shoes. Never worn. Trigger Warning: Morbid. https://en.wikipedia.org/wiki/For_sale:_baby_shoes,_never_worn 13 Probabilistic language models - Compute the probability of a text - “I ate cereal for breakfast” - “I ate a dinosaur for breakfast” Sentence: w1, w2, w3….. Likelihood: Sentence: “The girl eats rice” Likelihood: P(“The girl eats rice”) = P(“rice” | “The girl eats”). P(“eats” | ”The girl”). P(“girl” | “The”). P(“The” | $ep$) How can these terms be computed? 14 7 10/31/24 Prob -> Recurrent Auto-regressive models: Why? w1 w2 w3 w4 What. if …..? https://d2l.ai/chapter_recurrent-neural-networks/rnn.html 15 ”Unrolling an RNN” o_t h_{t-1} h_t x_t 16 8 10/31/24 What do ‘hidden states’ look like? Batch of inputs n sequence examples 17 Recurrent Neural Network (RNN) Deep learning models that capture the dynamics of sequences via recurrent connection Input is a sequence; the hidden states store intermediate information for a sequence Same underlying parameters applied at each step: recurrence. Parameters: 18 9 10/31/24 RNNs for different tasks DT NN VB dog barks. The dog barks The dog barks POS Tagging Next word prediction 19 Training an RNN Forward pass: Apply transformations as required Backpropagation… is non-trivial as compared to a feedforward neural network: Why??? Remember: Parameters are shared across timesteps! How do we compute gradient?? 20 10 10/31/24 Backpropagation through time (BPT) Since RNNs involve timesteps, backpropagation for RNNs is called BPT Backpropagate gradient through unrolled network Due to recurrence, weight updates must be summed across all places in the network (in addition to data points). 21 Issue with BPT t-1 2 t-1 2 2 Vanishing gradient (for long sequences) Phenomenon: Gradient decreases towards/to zero Model underfits Exploding gradient Phenomenon: Gradient increases exponentially Model overfits 22 11 10/31/24 Specialised RNNs: LSTM LSTMs: Long Short-term Memory (Can we increase the distant memory of an RNN?) “Maintain a memory of the text and pass it on..” ”Collect what you ”Retrieve need in memory” information from “Forget what memory” you need to..” “Obtain output 23 Sequential networks for text representations boy eats rice $ Backpropagate the loss the boy eats rice Demo time! One-hot or word2vec representation Softmax over the vocabulary 24 12 10/31/24 Part 1 Transformer 25 “Attention is all you need” - My ex - Transformer A transducer that transforms a sentence into another Sequence-to-sequence (seq2seq) “eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output” Vaswani, A. et al "Attention is all you need." Advances in Neural Information Processing Systems (2017). 26 13 10/31/24 Architecture Probability over vocabulary Transformer Head Decoder(s) Encoder(s) What is this when the input sentence is first passed? Input sentence Incomplete output sentence 27 Passing input through encoder… [[1.3, 2.3, 5.3…],…] The boy drinks milk. 28 14 10/31/24.. an d th en d eco d er. Input: The boy drinks milk Example: Machine Translation Output: Il ragazzo beve il latte X = encoder(“The boy drinks milk”) il latte Il ragazzo beve X X X X X Il Il ragazzo beve Il ragazzo Il ragazzo beve il 29 Let’s look deeper at the encoder… Tokenization: Input text is tokenized: word -> subwords example: dehumanization -> de#, #human#, #isation Algorithms: Byte-pair encoding, WordPiece encoding, etc. Embeddings for tokens Position encoding: A static encoding for every position (i.e., word at position k has same encoding across all inputs) Fixed or learnable: What does ‘learnable’ mean? + -> element-wise addition 30 15 10/31/24 Let’s look deeper at the encoder… Attention in Transformer Modeled as Key, Query, Value pairs Key 2 Key 1 Key 3 Query Key Value Key 1 Value 1 Key 2 Value 2.. … Multi-head: Several parallel computations of attention. 31 Attention Intuition: Instead of last state influencing the next state, can information from ALL previous states be combined as required? Each token has a key-value pair (key: representation, value: “Information”) Query is the token with respect to which attention is computed. 32 16 10/31/24 When applied to text…. Self-attention: Attention for a token with respect to all other tokens in the sentence. e0 v0 e1 v1 vj ej en vn Instead of recurrence where one hidden state leads to the next, Transformer uses attention to compute relative importance w.r.t. all tokens. Challenge: What is the time complexity of attention computation? 33 Let’s look deeper at the decoder… Masked attention Words only in the sentence so far. Input: “What is the capital of Australia?” Output: “The capital of” Multi-head: Several parallel computations of attention. 34 17 10/31/24 Let’s look deeper at the decoder… Masked attention Words only in the sentence so far. Output: “The capital of” Multi-head attention (input from encoder) Cross-attention Input: “What is the capital of Australia?” Output: “The capital of” Multi-head: Several parallel computations of attention. 35 Recap: Components of Transformer Encoder: A stack of encoders receive the input sentence and encode them into a hidden representation Decoder: A stack of decoders receive the hidden representation from the encoder AND the text generated so far and produce the next token in the sequence. Tokenizer: Represents a sentence as a set of tokens Positional Encoding: Injects sequence information into the Transformer: a key reason why it can get rid of recurrence Transformer Head: Attaches the prediction formulation used to optimize the model 36 18 10/31/24 Recap: Pseudocode Encoder Decoder 37 Transformer Training All pseudocodes from: Phuong et al., Formal Algorithms for Transformers, https://arxiv.org/abs/2207.09238. 38 19 10/31/24 Part 3 Encoder & Decoder Models 39 How can Transformer be used for NLP tasks? Versatility of Transformer Transformer-based models 40 20 10/31/24 How can Transformers really be used? Language Models Encoder models Decoder models Use the encoder of the Transformer Use the decoder of the Transformer Current word is estimated from Current word is estimated from previous neighbouring words. words. *Encoder-decoder models such as XLNet and BART have also shown to be effective. 41 Typical steps in Transformer-based models Pre-trained models: Basis: Large Corpora Fine-tuned model: Task-specific: Specific labeled corpora Weight updates (remember backpropagation?) Pre-training Fine-tuning Fine-tuned Pre-trained Model Model Large unlabeled corpus 42 21 10/31/24 Encoder models Example: Bidirectional Encoder Representations from Transformers (BERT) Several variants: RoBERTa, Longformer, etc. Look up HuggingFace Pre-trained on self-supervised data What is self-supervision? Supervision created using unlabeled data to learn representations 43 Pre-training encoder models BERT uses two pre-training objectives: NSP MLM 1. Masked language modeling (MLM) 1 room table Word prediction: Learn to predict missing words in a sentence 2. Next sentence prediction (NSP) Classification: Learn to predict if a sentence follows another BERT He entered the room He saw his ex sitting at the table. He entered the ____. He saw his ex sitting at the ___. 44 22 10/31/24 Special tokens MLM BERT uses (at least) three special tokens: NSP 1 room table Special Position Purpose Token [CLS] Beginning of a pair of Representation sentences. used to learn NSP [SEP] Between the pair. Indicate new BERT sentence. [MASK] Masked words Representation used to learn MLM [CLS] He entered the [MASK]. [SEP] He saw his ex sitting at the [MASK]. What do you think [PAD] token does? He entered the ____. He saw his ex sitting at the ___. 45 Decoder models Example: Generative Pre-trained Transformers Other variants: Open Pre-trained Transformer LlaMA …. …. Pre-trained on self-supervised data, again. What is the task this time? https://jalammar.github.io/illustrated-gpt2/ 46 23 10/31/24 Pre-training decoder models Next word prediction NSP MLM 1 room table Decoder models He entered the room He saw his ex sitting at the table. He entered the ____. He saw his ex sitting at the ___. 47 Fine-tuning encoder models 48 24 10/31/24 Fine-tuning decoder models……..? Shown to be effective as-is: Prompting, in-context learning, etc. Remember using ChatGPT? Can be adapted: Instruction tuning, etc. https://arxiv.org/pdf/2308.10792 49 Prompting Utilize the text completion ability of decoder-based models The question and input are incorporated into a ‘prompt’ “What is the sentiment of the following sentence? Sentence: I enjoyed this movie. Answer:” Chain-of-thought prompting Decompose a complex problem into a sequence of simpler steps. Demo time! 50 25 10/31/24 In-context learning A paradigm that allows language models to learn tasks given only a few examples in the form of demonstrations. Input: x Natural language construction of example and instruction. https://arxiv.org/pdf/2301.00234 51 Enhancing language models for better in-context learning Exercise: What does the argmax equation for pre-training look 52 26 10/31/24 Instruction fine-tuning è Convert tasks to “instruction: input -> output” format è Fine-tune a decoder model to learn to complete the output given instruction and input 53 Optimizing fine-tuning Massive number of parameters in modern large language models Reduce number of weight updates Examples: freeze certain layers Example: Store only weight updates Additional reading (not in the syllabus): LoRA (Low-Rank Adaptation) 54 27 10/31/24 Part 4 Applications & Frontiers 55 Key Challenge with Transformer-based models: Hallucination Hallucination: Undesirable generation that results in an output that is either nonsensical or unfaithful to the provided source input Document: The first vaccine for Ebola was approved by the FDA in 2019 in the US, five years “The first Ebola vaccine was approved in 2021” after the initial outbreak in 2014. To produce the vaccine, scientists had to sequence the DNA, then identify possible vaccines, and finally show “China has already started clinical trials of the COVID-19 successful clinical trials. Scientists say a vaccine vaccine.” for COVID- 19 is unlikely to be ready this year, although clinical trials have already started. Can be reduced using ‘retrieval-augmented’ generation (RAG) 56 28 10/31/24 Retrieval-augmented generation (RAG) Retrieve a set of related documents Use the documents to generate the response 57 Mathematical formulation Let x be the question; y be the response. z be the top k documents relevant to x 58 29 10/31/24 NLP & Cybersecurity Motlagh, Farzad Nourmohammadzadeh, et al. "Large Language Models in Cybersecurity: State-of-the-Art." arXiv preprint arXiv:2402.00891 (2024). 59 NLP & Mobility Xue, Hao, et al. "Prompt Mining for Language-based Human Mobility Forecasting." arXiv preprint arXiv:2403.03544 (2024). 60 30 10/31/24 NLP & Public Health Olawade DB, Wada OJ, David-Olawade AC, Kunonga E, Abaire O and Ling J (2023) Using artificial intelligence to improve public health: a narrative review. Front. Public Health 11:1196397. doi: 10.3389/fpubh.2023.1196397 61 Summary Key Ideas Demos Word Embeddings Shallow neural networks to learn representations Word2Vec Sequential neural Recurrent neural networks and derivatives to LSTM-based classifier networks accommodate sequential nature of text Transformer Attention, architectural details - Transformer-based Encoder and decoder models, pre-training and Fine-tuning encoder & models fine-tuning decoder models Applications and Retrieval-augmented generation, applications to - Frontiers cybersecurity, public health, etc. 62 31 10/31/24 Where will the Generative AI revolution head next? https://techcommunity.microsoft.com/t5/educator-developer-blog/understanding-the-difference-in-using-different-large-language/ba-p/3919444 63 Interested in NLP? Take a look at COMP6713. Note: Lecture by Wayne next Monday. Thank you! 64 32

Use Quizgecko on...
Browser
Browser