2_1_Embeddings.pdf
Document Details
Uploaded by AthleticClarinet
Tags
Related
- 5-Contextual-Information.pdf
- 26-Encoder-Architectures-and-Sentence-Embeddings.pdf
- Natural Language Processing (NLP) PDF
- Microsoft Azure AI Fundamentals Natural Language Processing(1).pdf
- Microsoft Azure AI Fundamentals Natural Language Processing(3).pdf
- COMP9414 Artificial Intelligence 2024 T3 UNSW PDF
Full Transcript
School of CIT – Social Computing Research Group Advanced Natural Language Processing CIT4230002 Prof. Dr. Georg Groh Carolin Schuster, M.Sc. 1 Lecture 2.1 Embeddings: Analysis & Appl...
School of CIT – Social Computing Research Group Advanced Natural Language Processing CIT4230002 Prof. Dr. Georg Groh Carolin Schuster, M.Sc. 1 Lecture 2.1 Embeddings: Analysis & Applications 2 Lecture Outline Quick Recap (NLP1) on Semantics and Word2Vec Overview on Embeddings: Types, Levels, Training Tasks Methods for analyzing Embeddings Applications => Focus on Topic Modeling 3 Embeddings: Recap NLP1 & Overview 4 Recap NLP1 | Lexical Semantics The study of word meaning Word relatedness / association semantic field: set of semantically related items e.g. mammals: rodents, bats, primates, etc. Distinction between word senses WordNet Examples for “bat” o S: (n) bat, chiropteran (nocturnal mouselike mammal with forelimbs modified to form membranous wings and anatomical adaptations for echolocation by which they navigate) o S: (n) bat (a club used for hitting a ball in various games) Relations between word senses Synonymy vs. antonymy, hyponymy vs. Hypernymy Affective meaning of words (connotation): sentiment, opinions, evaluations 5 Recap NLP1 | Distributional Semantics “you shall know a word by the company it keeps” – Firth (1957) words that occur in similar contexts tend to have similar meanings Distributional semantics: meaning of a word is computed from the distribution of words around it word embedding: a vector representation of a word semantic similarity of words => similarity of word vectors Simplest version: co-occurrence matrices of words as vectors the bat is flying bird bites sings bat 2 2 1 1 0 1 0 bird 2 0 1 1 2 0 1 6 Recap NLP1 | Word2Vec Word2Vec Training a shallow neural network to predict the current word from surrounding words (CBOW) or predict context words from the current word (Skip-gram) Word embedding: Neural network weights 7 Recap NLP1 | Contextual Embeddings Contextual Embeddings from LLMs Deeper neural networks produce contextualized representations => meaning of a word in its context Often a transformer encoder, e.g. BERT Word Embedding: Neural network weights for a word in a specific context 8 Static vs. Contextual Embeddings Static Embeddings: Contextual Embeddings: Word2Vec, GloVe, FastText, etc. BERT, ElMo, DeBERTa etc. 0.4 2.1 0.3 7.5 7.5 A bat flies by. 0.4 2.1 0.3 7.5 7.5 The bat bites. 0.4 2.1 0.3 7.5 7.5 Have you seen the bat? 0.4 2.1 0.3 7.5 7.5 9 Static vs. Contextual Embeddings Static Embeddings Contextual Embeddings Simple representation More expressive representation of One vector per word language: Context not considered Unlimited possible representations No representation of for each word polysemous words Represent words in their contexts Distinguish word senses Fast training, does not require as Training requires huge amounts of much data data and computing resources Can be trained from scratch Finetuning for adaption BUT Pre-trained models introduce their own biases => Even more challenges regarding interpretability 10 Subword Embeddings Current paradigm of Subword Tokenization Byte-Pair Encoding (BPE): iteratively merging frequent pairs of tokens until desired vocabulary size is reached Algorithms such as BPE depend on a training dataset Resulting tokenization may not be optimal for every application and domain, important words might be split-up! 11 Pooling Token Embeddings Strategies for higher-order embeddings: Embedding of special tokens, e.g. [CLS] for BERT (sentence representations) Averaging sub-embeddings => best simple strategy for obtaining a full word representation from subwords Finetuning with extended domain-specific vocabulary: New word embeddings can be initialized by similar words or averaging of subtoken embeddings Another dimension of pooling concerns the layers, e.g. we might average the embedding over the last four layers (e.g. common for semantic shift analysis). 12 Training Tasks for Contextual Embeddings Training Representations Token Level Training for Sequences of Tokens characters/subwords/words phrases, sentences, documents Cosine Similarity Language Modeling (only one-directional context) Masked Language Modeling Contrastive Learning Triplet Loss Replaced Token Detection SimCSE Whole Word Masking PromptBERT Span Boundary Objective (SpanBERT) Sequential Denoising Auto-Encoder (TSDAE) 13 Example: Sentence-BERT Training Cosine Similarity Loss Triplet Loss (Contrastive Learning) Euclidian distance minimize: sa anchor sentence embedding sp positive sentence embedding sn negative sentence embedding 14 Specialized and Extended Contextual Embeddings Domain-specific: BERTweet, SciBERT, PatentBERT, BioBERT Monolingual: CamemBERT, FinBERT, TiBERT, IndoBERT,… Multilingual: mBERT, XLM-RoBERTa, multiligual SBERT, multilingual Universal Sentence Encoder Multi-Modal: SpeechBERT, CLIP, BLIB Knowledge Enhanced: e.g. KnowBERT with WordNet and Wikipedia 15 Adapting Contextual Embeddings Adaption with Training Continued Pre-Training on in-domain data Finetuning with task-specific data, e.g. STS, NLI Instruction Tuning (INSTRUCTOR) Embeddings based on both text input and the task description e.g. “Represent the review comment for classifying the emotion as positive or negative” Adaption without any training 16 Embedding Benchmark MTEB: Massive Text Embedding Benchmark 8 Tasks, 56 datasets, 117 languages (Bi-Text Mining) Sentence and paragraph level tasks Limitations: Task and language imbalance, no word level tasks … … Obtaining a general idea of embedding capacity, but there is no one size fits all solution 17 Analysis of Contextual Embeddings 18 Why analyze contextual embeddings? What did the neural networks learn? What information is encoded? About language About the world 0.4 2.1 0.3 7.5 7.5 8.1 1.1 3.5 0.5 1.0 Is there any harmful information encoded? What can the embeddings be used for? 19 How to analyze contextual embeddings? Dimensionality Reduction Clustering „Probing“ with Classifiers 0.4 2.1 0.3 7.5 7.5 8.1 1.1 3.5 0.5 1.0 Inspiration from the Social Sciences: Association Tests Semantic Differentials 20 Dimensionality Reduction PCA on BERT sentence embeddings reveal a moral direction (Schramowski, 2022) 21 Clustering Layer-wise alignment of pre-defined concepts in BERT (Dalvi, 2022) Example Concepts 22 Probing | Predicting a property of interest 0.4 2.1 0.3 7.5 7.5 8.1 1.1 3.5 0.5 1.0 “Estimating the mutual information Probing classifier between (linear,non-linear) representations and a property of interest” Property of interest: (Part of speech, semantic roles/relations, etc.) 23 Probing | Dis(Advantages) Advantage: Flexibility Any type of embedding Any property Caveats Accuracy-complexity trade-off: What classifier to use? Correlations with property of interest => Designing accurate probing task is difficult 24 Association Tests | Inspiration Implicit Association Test (Greenwald, 1998) flowers insects flowers insects or or or or pleasant unpleasant unpleasant pleasant ant ant Measuring differential association of two target concepts with an attribute in a choice task Response time as implicit measure of association Faster response time when concepts are associated (e.g. flowers and pleasant) 25 Association Tests | Transfer to Embedding Space Measuring relative distances of two target concepts to an attribute caress PC1 Pleasant c health freedom hyacinth clover Flowers c flea c caterpillar aster Insects ant abuse crash c Unpleasant filth PC2 26 Association Tests | WEAT Word Embedding Association Test (Caliskan, 2017) Cosine similarity to measure associations o X target set 1 (Insects) o Y target set 2 (Flowers) o A attribute set 1 (Pleasant Words) o B attribute set 2 (Unpleasant Words) Findings => Word embeddings contain human-like biases Morally neutral bias (insects vs. flowers) Racial and gender bias Example: Math and Arts o X target set 1 (math, algebra, etc.) o Y target set 2 (poetry, art, etc.) o A attribute set 1 (male, man, boy, etc.) o B attribute set 2 (female, woman, girl, etc.) 27 Association Tests | CEAT Contextualized Embedding Association Test (Guo, 2021) Sampling contexts from a reference corpus Table excerpt from Guo (2021). Two rows per test: Completely random samples vs. identical sentences across all neural language models. 28 Polar Embeddings | Inspiration Semantic Differentials (Osgood, 1952) Measuring the meaning of concepts with polar scales 29 Polar Embeddings | POLAR Framework POLAR framework (Mathew, 2020) Embedding polar opposites (antonyms) to identify an interpretable subspace Step 1: Definition of the polar embedding space with a set of polar opposites Step 2: Projection of a word to the interpretable embedding space 30 Polar Embeddings | SensePOLAR Extension to Contextual Embeddings: SensePOLAR (Engler, 2022) Sense embeddings instead of word embeddings Context examples for word senses from dictionaries (e.g. WordNet) Context examples for projection Polar embeddings are interpretable with similar performance 31 Polar embeddings | Extensions and Limitations Extension to broader concepts Word list for each pole Example: Stereotype dimensions (Fraser, 2021) Word lists for warmth high: friendly, warm, pleasant, etc. low: cold, repellent, disliked, etc. Limitations Requirement of Opposites Dependency on dictionaries (antonym relations, word lists) Requirement of quality context examples (potential source of bias) 32 Analysis of Contextual Embeddings | Opportunities Opportunities Getting a sense of what LLMs learn (+ Reality check of what LLMs are) Bias Measurement, BUT measurements in embeddings and downstream tasks may not give the same results De-Biasing, same caveat as above Optimizing embedding spaces Using embeddings to study (digital) society 33 Analysis of Contextual Embeddings | Caveats Caveats Distributive properties of embeddings make comprehensive analysis difficult High dimensionality + distribution across layers Anisotropy, artifacts (rogue dimensions) Entanglement of meaning, facts, syntax.. High variability depending on context (what context examples should be employed in the analysis?) 34 Applications of Embeddings Topic Modeling 35 Applications of Embeddings Pre-Trained Contexual Embeddings beyond LLM Analysis Semantic search RAG ( will be covered later in the lecture) Exploratory data analysis Studying meaning and semantic shift of words Document clustering (sentence/document embeddings) Topic modeling