Introduction to Word Embeddings PDF

Document Details

BrainiestWilliamsite3306

Uploaded by BrainiestWilliamsite3306

International Burch University

IBU

Dželila Mehannović

Tags

word embeddings natural language processing nlp artificial intelligence

Summary

This presentation covers the fundamentals of word embeddings, a crucial technique in natural language processing. It details word embeddings, Bag-of-Words models, and techniques like TF-IDF and Word2Vec. The presentation explains the underlying concepts, applications, and provides examples to illustrate the methods.

Full Transcript

Introduction to Natural Language Processing Intro to Word Embeddings Assist. Prof. Dr. Dželila MEHANOVIĆ Learning Objectives Understand the concept of word embeddings Explore why word embeddings are important in NLP Learn popular word embedding techniques Word2Vec, GloVe, FastText) V...

Introduction to Natural Language Processing Intro to Word Embeddings Assist. Prof. Dr. Dželila MEHANOVIĆ Learning Objectives Understand the concept of word embeddings Explore why word embeddings are important in NLP Learn popular word embedding techniques Word2Vec, GloVe, FastText) Visualize word embeddings with examples Bag of Words (BoW) The Bag of Words model represents text as a collection (or "bag") of its words, disregarding grammar, word order, and syntax, while preserving word frequency. Text Representation: ○ Each text document is represented as a vector of word counts or occurrences. ○ The position in the vector corresponds to a specific word from the vocabulary. Vocabulary Creation: ○ Build a vocabulary of all unique words across the dataset. ○ Assign each word in the vocabulary a unique index. Frequency Count: ○ For each document, count how often each word in the vocabulary appears. ○ Create a vector for each document based on these counts. Bag of Words (BoW) - How It Works Preprocessing: ○ Tokenize the text into words. ○ Normalize text (e.g., convert to lowercase, remove punctuation/stopwords). Build Vocabulary: ○ Collect all unique words from the dataset. Vectorize: ○ Create a vector of word counts for each document using the vocabulary. Bag of Words (BoW) - Example Consider two simple documents: ○ Document 1 "I love NLP." ○ Document 2 "I love programming in Python." Step 1 Build vocabulary: ○ “Iˮ, “loveˮ, “NLPˮ, “programmingˮ, “inˮ, “Pythonˮ Step 2 Count Word Frequencies I love NLP programming in Python Document 1 1 1 1 0 0 0 Document 2 1 1 0 1 1 1 Overview Word embeddings are a way to represent words as dense vectors. These vectors capture semantic relationships based on context. They transform textual data into a numerical format suitable for machine learning models. Examples: ○ Words with similar meanings (e.g., "king" and "queen") are closer in the vector space. Word Embedding Word Embedding Word Embedding Analogies By Vector Arithmetic Analogies express the relationships between concepts. For example, "man is to king as woman is to _____". To arrive at the answer we first find the relationship between man and king. Measuring Euclidean Distance We can compare two words by drawing a vector from one to the other and measuring its length. The vector from "boy" to "infant" (child) can be computed by starting with "infant" 5,1 and subtracting "boy" 1,2, giving the vector 4, 1. The length of a vector [x, y] is given by the formula sqrt(x2 + y2). Why Are Word Embeddings Important? Traditional NLP relied on sparse vectors (e.g., one-hot encoding), which were high-dimensional and lacked semantic meaning. Word embeddings solve this by creating dense, low-dimensional representations that encode semantic similarity. Applications: ○ Enhanced machine translation, ○ better sentiment classification, and ○ improved document retrieval systems. One-Hot Encoding One-hot encoding is a technique used to convert categorical data into a numerical format that can be provided as input to machine learning algorithms. It represents each category as a unique binary vector, where all values are 0 except for a single 1 that indicates the presence of a specific category. Consider a categorical variable, Color, with the values: ["Red", "Blue", "Green"]. Categories: Red, Blue, Green One-Hot Encoding: ○ Red → 1, 0, 0 ○ Blue → 0, 1, 0 ○ Green → 0, 0, 1 One-Hot Encoding If we have a dataset like table on left, the one-hot encoded version would be like table on right side Color Red Blue Green Red 1 0 0 Blue 0 1 0 Green 0 0 1 Red 1 0 0 Types of Word Embeddings Frequency-Based Embeddings: ○ Methods: Count Vectorization, Term Frequency-Inverse Document Frequency TFIDF. ○ Focus on word co-occurrence and context. ○ Limitation: Do not capture deep semantic relationships. Prediction-Based Embeddings: ○ Methods: Word2Vec CBOW, Skip-Gram), GloVe Global Vectors). ○ Train models to predict context words or target words. ○ Strength: Capture richer semantic information. Term Frequency-Inverse Document Frequency (TF-IDF) TFIDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). Purpose: ○ Highlights words that are unique to a document while reducing the impact of common words (e.g., "the," "and"). Applications: ○ Search Engines: Rank documents based on query relevance. ○ Text Summarization: Extract key terms to summarize content. ○ Spam Filtering: Identify spam keywords in emails. ○ Recommendation Systems: Suggest items based on term similarity in descriptions. How TF-IDF Works Term Frequency TF ○ Measures how frequently a term occurs in a document. TF(t, d)= Frequency of term t in document d / Total terms in document d Inverse Document Frequency IDF ○ Measures how unique a term is across the corpus. IDF(t, d)=log(Number of documents /Total number of documents containing t) TFIDF Score: ○ Combines TF and IDF to determine the importance of a term in a document. TFIDF(t,d)=TF(t,d)×IDF(t, d) Example of TF-IDF Calculation Document 1 "The cat sat on the mat." Document 2 "The dog barked at the cat." Term: "cat" ○ TF“catˮ, Document 1 = ⅙ ≈ 0.167 ○ TF“catˮ, Document 2 = ⅙ ≈ 0.167 ○ IDF"cat"log( 2/2)=log(10, but in Scikit-learn, the IDF formula adds smoothing to avoid division by zero: IDF(t, d)=log(Number of documents 1 /Total number of documents containing t + 1 + 1 ○ IDF"cat"log( 21/21 + 1 = log(1 + 1  1, ○ TFIDF"cat",Document 10.16710.167 ○ TFIDF"cat",Document 20.16710.167 Word2Vec Predict context words given a target word CBOW or predict the target word given context words Skip-gram). Training Objective: ○ Maximize the probability of context words appearing given a target word. Example: ○ Sentence: "The cat sits on the mat." ○ CBOW Predict "cat" from "The... sits." ○ Skip-gram: Predict "The" and "sits" from "cat." How Word Embeddings Are Created - Skip-gram Step 1 Build a text corpus, such as Wikipedia articles, news stories, or Shakespeare's works. Step 2 Select a vocabulary size M, keeping only the most frequent words (e.g., top 50,000) to simplify the embedding task. This excludes infrequent words, typos, and low-importance elements. Decide how to handle punctuation, contractions, special characters, and capitalization variations. Step 3 Set a context window size C. For example, with C2, the context for each word includes the 2 words to its left and 2 words to its right, forming a five-word chunk. How Word Embeddings Are Created - Skip-gram Step 4 Build a co-occurrence dictionary by scanning the corpus word by word and recording the words that appear within the context window C of each target word. ○ For example, with C2, given the text "Thou shalt not make a machine in the image of a human mind", the co-occurrence dictionary would look like this: How Word Embeddings Are Created - Skip-gram Step 5 Select an embedding size N, which determines the number of dimensions in each word's representation. Common values are 100 (small), 300 (typical), or 700 (large). Larger N encodes more information but requires more computation and memory. Step 6 Make two tables each containing M rows and N columns. Each row represents a word. ○ For example, if we have a 50,000 word vocabulary and are constructing 300-element embeddings, the table would be of size 50,000  300. One table, E, is for the target words we're trying to embed. The second table, U, is for words when they are used as context words. Initialize both tables with small random numbers. How Word Embeddings Are Created - Skip-gram Step 7 Training slides a window over the corpus, adjusting target and context word embeddings using gradient descent based on a sigmoid-transformed dot product. Positive examples aim for an output of 1, negative examples for 0, repeated over the corpus for 350 iterations. GloVe (Global Vectors) GloVe Global Vectors) uses word co-occurrence matrix across the entire corpus. Captures both local (contextual) and global (corpus-wide) statistics. Comparison to Word2Vec: ○ Word2Vec: Focuses on local context. ○ GloVe: Incorporates corpus-level co-occurrences Example: For the words king, queen, and man, GloVe embeddings capture relationships such as: Vector("king") - Vector("man") + Vector("woman") ≈ mVector("queen") This demonstrates how GloVe embeddings encode analogies. FastText Motivation: ○ Improve handling of rare words and subword information. How it Works: ○ Represents words as combinations of subword units (character n-grams). ○ Example: "cat" → "ca," "at," "cat." ○ Builds embeddings for n-grams and combines them. Advantages: ○ Handles out-of-vocabulary OOV) words. ○ Encodes morphological information (e.g., prefixes and suffixes). Example Applications Sentiment Analysis: ○ Enables models to understand nuances in text sentiment (e.g., reviews, tweets). ○ Example: Identifying positive/negative sentiment in customer feedback. Machine Translation: ○ Improves word alignment across languages by capturing semantic similarities. ○ Example: Translating idiomatic phrases more accurately. Information Retrieval: ○ Helps search engines retrieve documents based on semantic content. ○ Example: Searching for synonyms or related terms in queries. Named Entity Recognition NER ○ Identifying entities in text. Challenges Bias in Training Data: ○ Embeddings can reflect and perpetuate societal biases in the data. ○ Example: Gender bias in word associations. Polysemy: ○ Words with multiple meanings (e.g., "bank") can confuse models. ○ Solution: Contextual embeddings like BERT. Out-of-Vocabulary Words: ○ Embeddings struggle with new or rare words. ○ Solution: Subword embeddings (e.g., FastText). Conclusion Word embeddings are a cornerstone of modern NLP, enabling dense and semantic-rich word representations. They have transformed tasks like sentiment analysis, translation, and information retrieval. Addressing challenges like bias and polysemy is key to advancing this field. Future innovations will enhance the adaptability and fairness of word embeddings. Thank you

Use Quizgecko on...
Browser
Browser