NLP Chatbot Notes PDF
Document Details
Uploaded by FieryTrigonometry9697
Tags
Summary
These notes provide an introduction to natural language processing (NLP), focusing on the fundamental components and their applications, including language modeling techniques. They explore concepts like bigrams, trigrams, and smoothing techniques within NLP. The document is a good starting point for understanding NLP concepts and applications.
Full Transcript
Natural Language Processing and Chatbot Course Code: 20AIPC503 UNIT I INTRODUCTION TO NLP Overview and advantages of NLP NLP Libraries Language Modeling: Unigram Language Model Bigram Trigram N-gram Advanced smoothin...
Natural Language Processing and Chatbot Course Code: 20AIPC503 UNIT I INTRODUCTION TO NLP Overview and advantages of NLP NLP Libraries Language Modeling: Unigram Language Model Bigram Trigram N-gram Advanced smoothing for language modeling Empirical Comparison of Smoothing Techniques Applications of Language Modeling Overview of NLP Natural language processing (NLP) is a branch of artificial intelligence (AI) that enables computers to comprehend, generate, and manipulate human language. Natural language processing has the ability to interrogate the data with natural language text or voice. Natural language processing (NLP) combines computational linguistics, machine learning, and deep learning models to process human language. – Computational linguistics is the science of understanding and constructing human language mod- els with computers and software tools. 1 NLP and its Components Natural Language Natural language is the way that people communicate with each other on a daily basis. However, that doesn’t only include everyday conversations. Speech and text are around us all of the time. Everywhere you look, there is text that someone is using to communicate with you. Whether that is personal, through text messaging, emails, letters, etc., or if it is less personal, in the way of signs, instructions on things that you have bought, or websites. 1 Natural Language Processing Natural language processing (NLP) is a branch of linguistics, artificial intelligence, information engineering, and computer science. Computers use that method to understand and interpret the language that humans use for interaction with computer technology. Furthermore, it is how a program analyses and processes large amounts of natural language data. Natural language processing, while not new, is seeing a new era. The studies began in the 1950s, and honestly, it didn’t get as far as expected in the time they wanted it to. As time moves forward, we will see huge leaps in ability, coinciding with the computing power that is now available. When we look back at the differences between the computers of the 1950s and today, it is easy to see why there is such an increase in capability. For example, Lisp (Locator/Identifier Separation Protocol), one of the earliest forms of NLP, ran on an IBM 704. Not only are computers today vastly smaller in size, but they are also immensely more powerful. Some of the most common applications of NLP algorithms that you will already know about are Siri, Bixby, and Alexa. They all use NLP to take intention from your speech and turn it into action, whether that is in the form of creating a note, sending a message, or playing a song. Furthermore, they will analyze and use previous interactions to understand the user intent on a better level. However, that action and understanding come not only through one piece of technology. Along with machine learning and deep learning, there are five other components to NLP: Components of NLP 1.1 Morphological Analysis The morphological analysis looks at word components and meanings based on machine learning. It examines word formations and components. For example: Unhappy: [un prefix – not], [happy root – pleased]. Uncooked: [un prefix – not], [cook root – to heat something enough to be not raw], [ed suffix – confirming that it was in the past]. 1.2 Lexical Analysis Lexical analysis in NLP involves breaking down text into its fundamental units, such as words, punctuation, and whitespace. This process is often referred to as tokenization. Sentence Tokenization: The text is split into sentences. For example: Input: “Hello, world! This is a test.” Output: [’Hello, world!’, ’This is a test.’] Word Tokenization: Each sentence is split into words (tokens). For example: Input: “Hello, world! This is a test.” Output: [’Hello’, ’,’, ’world’, ’ !’, ’This’, ’is’, ’a’, ’test’, ’.’] 1.3 Syntactic Analysis Syntactic analysis, or parsing, involves analyzing the grammatical structure of a sentence. This process identifies the syntactic relationships between words and phrases in a sentence. Example text: ”The cat sat on the mat.” First, we need to perform part-of-speech (POS) tagging, then we can parse the sentence to analyze its syntactic structure. POS Tagging: Each word is tagged with its part of speech. Input: [’The’, ’cat’, ’sat’, ’on’, ’the’, ’mat’] Output: [(’The’, ’DT’), (’cat’, ’NN’), (’sat’, ’VBD’), (’on’, ’IN’), (’the’, ’DT’), (’mat’, ’NN’)] Page 2 — Prepared by: Jeena AT/AP/AIDS/SEC Here, DT is a determiner, NN is a noun, VBD is a verb (past tense), and IN is a preposition. Parsing: The sentence is parsed using a simple grammar that identifies noun phrases (NP). Input: [(’The’, ’DT’), (’cat’, ’NN’), (’sat’, ’VBD’), (’on’, ’IN’), (’the’, ’DT’), (’mat’, ’NN’)] This output shows that ”The cat” and ”the mat” are identified as noun phrases (NP). 1.4 Semantic Analysis Semantic analysis in NLP involves understanding the meaning of words and sentences. This process often includes tasks such as named entity recognition, word sense disambiguation, and semantic role labeling. For example: example of semantic analysis focusing on named entity recognition (NER) Input text = ”Apple is looking at buying U.K. startup for $1 billion.” Output: Named Entities: [(’Apple’, ’ORG’), (’U.K.’, ’GPE’), (’$1 billion’, ’MONEY’)] Apple is recognized as an ORG (organization). U.K. is recognized as a GPE (geopolitical entity). $1 billion is recognized as MONEY. 1.5 Discourse Integration Discourse integration in NLP refers to understanding how sentences in a text are connected to each other, capturing the context and coherence across sentences. A common task for discourse integration is coreference resolution, which involves identifying when different expressions refer to the same entity. Example text: “John bought a new car. He loves it.” Coreference Clusters: [John: [John, He], new car: [a new car, it]] Resolved Text: “John bought a new car. John loves a new car.” 1.6 Pragmatic Analysis Pragmatic analysis in NLP involves understanding the intended meaning and context of a statement beyond its literal meaning. This often includes interpreting idioms, sarcasm, and understanding context-specific implications. We can use the VADER (Valence Aware Dictionary and sEntiment Reasoner) tool from the nltk library for sentiment analysis. VADER is designed to understand the context of the sentiment, including sarcasm. For example: “Yeah, right. That was just what I needed.” Sentiment Scores: ’neg’: 0.65, ’neu’: 0.35, ’pos’: 0.0, ’compound’: -0.6588 The sentence ”Yeah, right. That was just what I needed.” is likely sarcastic. The VADER tool detects the negative sentiment despite the seemingly positive words ”just what I needed,” showing that the intended meaning is negative. 1.7 Machine Learning (ML) with NLP ML: A subset of artificial intelligence (AI) that enables systems to learn and improve from experience without being explicitly programmed. NLP: A field of AI that focuses on the interaction between computers and humans through natural language. Key Applications Text Classification: Categorizing text into predefined categories (e.g., spam detection, sentiment analysis). Language Translation: Translating text from one language to another (e.g., Google Translate). Page 3 — Prepared by: Jeena AT/AP/AIDS/SEC Speech Recognition: Converting spoken language into text (e.g., Siri, Alexa). Text Summarization: Condensing long documents into shorter versions while preserving key infor- mation. Question Answering: Systems that can answer questions posed in natural language (e.g., chatbots). Popular Techniques Bag of Words (BoW): Represents text as a set of word counts. TF-IDF (Term Frequency-Inverse Document Frequency): Measures the importance of a word in a document relative to a corpus. Word Embeddings: Represent words in continuous vector space (e.g., Word2Vec, GloVe). Recurrent Neural Networks (RNNs): Designed for sequential data, useful for tasks like language modeling. Transformers: Advanced models like BERT and GPT that excel in various NLP tasks. Challenges Ambiguity: Words and sentences can have multiple meanings. Context Understanding: Capturing the context of words in different scenarios. Data Quality: High-quality, annotated data is essential for training effective models. Computational Resources: Large models require significant computational power. Libraries and Tools NLTK: A comprehensive library for building Python programs to work with human language data. spaCy: Industrial-strength NLP library with pre-trained models and advanced features. Transformers by Hugging Face: Library providing state-of-the-art pre-trained models. 1.8 Deep Learning (DL) with NLP DL: A subset of machine learning involving neural networks with many layers (deep neural networks) that can learn representations of data. NLP: A field of AI that focuses on the interaction between computers and humans through natural language. Key Applications Text Generation: Creating human-like text (e.g., GPT-3, GPT-4). Machine Translation: Translating text between languages (e.g., Google Translate using Transform- ers). Sentiment Analysis: Identifying sentiment in text (e.g., product reviews). Named Entity Recognition (NER): Identifying entities like names, dates, and locations in text. Question Answering: Systems that answer questions posed in natural language (e.g., chatbots, virtual assistants). Page 4 — Prepared by: Jeena AT/AP/AIDS/SEC Popular Architectures Recurrent Neural Networks (RNNs): Designed for sequential data, though limited by short-term memory issues. Long Short-Term Memory (LSTM) Networks: A type of RNN designed to remember long-term dependencies. Gated Recurrent Units (GRUs): A variant of LSTMs, simpler and faster to train. Convolutional Neural Networks (CNNs): Originally for image processing, adapted for text clas- sification. Transformers: State-of-the-art architecture that processes entire sequences in parallel, addressing the limitations of RNNs (e.g., BERT, GPT). Key Techniques Word Embeddings: Representing words in continuous vector space (e.g., Word2Vec, GloVe). Attention Mechanism: Allows models to focus on relevant parts of the input sequence (e.g., Self- Attention in Transformers). Transfer Learning: Using pre-trained models on large datasets and fine-tuning them for specific tasks (e.g., BERT, GPT). Challenges Data Requirements: Deep learning models require large amounts of annotated data. Computational Resources: Training deep learning models can be resource-intensive. Model Interpretability: Deep learning models, especially large ones, are often seen as ”black boxes”. Overfitting: Risk of models performing well on training data but poorly on unseen data. Libraries and Frameworks TensorFlow: An open-source deep learning framework by Google. PyTorch: A flexible and widely-used deep learning framework by Facebook. Transformers by Hugging Face: Provides state-of-the-art pre-trained models and tools for NLP. Recent Advances BERT (Bidirectional Encoder Representations from Transformers): Pre-trained model de- signed to understand context in both directions. GPT (Generative Pre-trained Transformer): A series of models for text generation, with GPT-3 and GPT-4 being the most advanced. T5 (Text-to-Text Transfer Transformer): Converts all NLP tasks into a text-to-text format, enhancing versatility. NLP is progressing rapidly. Systems like Call Criteria utilize the latest technology to understand more words, phrases, and sentences within the English language, regardless of the caller’s accent. These systems learn from thousands of calls across various industries, evolving into better systems with each interaction. Page 5 — Prepared by: Jeena AT/AP/AIDS/SEC 2 NLP Applications Natural language processing tools can help businesses analyze data and discover insights, automate time- consuming processes, and help them gain a competitive advantage. Some of the most interesting applications of natural language processing in business are: Sentiment Analysis: Analyzes text to determine the sentiment expressed (positive, negative, or neutral). It is commonly used to gauge customer opinions in reviews, social media, and feedback. Text Classification: Categorizes text into predefined categories. Applications include spam detection, topic labeling, and document organization. Chatbots & Virtual Assistants: Automates customer interactions by understanding and responding to user queries in natural language. Examples include customer support bots and personal assistants like Siri and Alexa. Text Extraction: Identifies and extracts specific information from text, such as names, dates, prices, or product features. This is useful for data mining and knowledge discovery. Machine Translation: Automatically translates text from one language to another. Services like Google Translate use this technology to break language barriers. Text Summarization: Generates a concise summary of a larger body of text, preserving key infor- mation. This helps in quickly understanding large documents or articles. Market Intelligence: Analyzes various text sources (e.g., news articles, social media, reports) to gather insights about market trends, competitor activities, and consumer behavior. Auto-Correct: Automatically corrects spelling and grammatical errors in text. This is widely used in word processors, email clients, and messaging apps. Intent Classification: Determines the intent behind a user’s query or statement, such as making a purchase, seeking information, or requesting support. This is essential for improving user experience in conversational systems. Urgency Detection: Identifies the urgency level of a message or request, enabling prioritization in customer support and emergency response systems. Speech Recognition: Converts spoken language into text. This is used in voice-controlled applica- tions, transcription services, and accessibility tools for people with disabilities. 3 Advantages of NLP Perform large-scale analysis NLP technology allows for text analysis at scale on all manner of documents, internal systems, emails, social media data, online reviews, and more. Process huge amounts of data in just seconds or minutes, that would take days or weeks of manual analysis. Get a more objective and accurate analysis When performing repetitive (and frankly boring) tasks, like reading and analyzing open-ended survey responses and other text data, humans are prone to mistakes or may have inherent biases that can skew the results. NLP-powered tools can be trained to the language and criteria of our business, often in just a few steps. So, once we have them up and running, they perform much more accurately than humans ever could. Streamline processes and reduce costs NLP tools work at whatever scale you need, 24/7, in real time. When we connect NLP tools to our data, we’ll be able to analyze our customer feedback on the go, so we’ll know right away when customers are Page 6 — Prepared by: Jeena AT/AP/AIDS/SEC having problems with our product or service. Automate ticket tagging and routing with NLP tools to streamline processes and free our agents from repetitive tasks. And remain on top of emerging trends just as they arise. Improve customer satisfaction NLP tools allow you to automatically analyze and sort customer service tickets by topic, intent, urgency, sentiment, etc., and route them directly to the proper department or employee, so you never leave a customer in the cold. Better understand your market Natural language processing is having a huge impact on marketing. When we put NLP to work to un- derstand the language of our customer base, we’ll have a better understanding of market segmentation, be better equipped to target our customers directly, and decrease customer churn. Empower your employees With all the human hours you’ll save by automating processes and using data analysis to its full potential, our employees will be able to focus on what matters: their actual jobs. Gain real, actionable insights The unstructured data of open-ended survey responses and online reviews and comments requires an extra level of analysis – we have to break down the text so it can be understood by machines. But AI-guided NLP tools can make it easy. 4 NLP Python Libraries Regex (Regular Expressions) Role: Pattern matching, text extraction, text cleaning, tokenization, validation. Use Cases: Extracting email addresses, removing punctuation, identifying specific patterns (e.g., phone numbers, dates). NLTK (Natural Language Toolkit) Role: Tokenization, stemming, lemmatization, part-of-speech tagging, parsing, named entity recogni- tion. Use Cases: Building Python programs for human language data processing, academic research, and education. spaCy Role: Tokenization, named entity recognition, dependency parsing, part-of-speech tagging, word vec- tors. Use Cases: Fast and efficient NLP tasks, suitable for production use with pre-trained models. TextBlob Role: Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, text classifica- tion. Use Cases: Beginners and rapid prototyping, high-level interface for common NLP tasks. Page 7 — Prepared by: Jeena AT/AP/AIDS/SEC Textacy Role: Preprocessing, linguistic feature extraction, topic modeling, sentiment analysis, keyword extrac- tion. Use Cases: Simplifying text analysis tasks, built on top of spaCy and scikit-learn. VADER (Valence Aware Dictionary and sEntiment Reasoner) Role: Rule-based sentiment analysis, sentiment intensity analysis. Use Cases: Analyzing social media text sentiments, handling informal language, slang, emojis. Gensim Role: Text preprocessing, document representation, word embeddings, topic modeling, document similarity and retrieval. Use Cases: Discovering semantic structures in large text corpora, topic modeling, and document similarity analysis. AllenNLP Role: Pre-built models, PyTorch integration, modular components. Use Cases: Deep learning for NLP tasks, including text classification, named entity recognition, semantic role labeling, machine reading comprehension. Stanza Role: Tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, dependency parsing. Use Cases: Accessing Stanford CoreNLP functionality with a user-friendly Python interface. Pattern Role: Part-of-speech tagging, sentiment analysis, word lemmatization, language translation, web scraping, data visualization. Use Cases: Web mining, natural language processing, and machine learning tasks. PyNLPl Role: Corpus processing, morphological analysis, syntactic parsing, multilingual support. Use Cases: Multilingual text analysis projects, computational linguistics, and NLP research. Hugging Face Transformers Role: Pre-trained models, fine-tuning capabilities, inference support, compatibility with PyTorch and TensorFlow. Use Cases: Working with transformer-based models (e.g., BERT, GPT), text classification, question answering, text generation. Page 8 — Prepared by: Jeena AT/AP/AIDS/SEC Flair Role: Named entity recognition, part-of-speech tagging, text classification, deep learning techniques. Use Cases: High accuracy NLP tasks, pre-trained models for multiple languages, domain-specific tasks. FastText Role: Word embeddings, text classification, efficient processing, neural network architectures, multi- lingual support. Use Cases: Large text datasets, high-speed processing for sentiment analysis, document classification, language identification. Polyglot Role: Tokenization, named entity recognition, sentiment analysis, language detection, translation. Use Cases: Multilingual NLP tasks, analyzing text data from diverse sources, over 130 languages supported. 4.1 Importance of Text Analysis Libraries in Python Diverse Functionality: Specialized tools for various NLP tasks like tokenization, NER, sentiment analysis, and topic modeling. Ease of Use: User-friendly interfaces and intuitive APIs suitable for both beginners and experienced practitioners. Deep Learning Integration: State-of-the-art performance on complex text data using deep learning techniques. Efficiency and Scalability: Handling large text datasets efficiently, supporting analysis in multiple languages. Specialized Applications: Catering to specific use cases like social media sentiment analysis and multilingual text analysis. Open-Source Community: Active collaboration, innovation, and continuous improvement in text analysis. 5 NLP Tasks or Techniques NLP encompasses a wide array of techniques aimed at enabling computers to process and understand human language. These tasks can be categorized into several broad areas, each addressing different aspects of language processing. Here are some of the key NLP techniques: 5.1 Text Processing and Preprocessing in NLP Tokenization: Dividing text into smaller units, such as words or sentences. Stemming and Lemmatization: Reducing words to their base or root forms. Stopword Removal: Removing common words (like “and”, “the”, “is”) that may not carry significant meaning. Text Normalization: Standardizing text, including case normalization, removing punctuation, and correcting spelling errors. Page 9 — Prepared by: Jeena AT/AP/AIDS/SEC 5.2 Syntax and Parsing in NLP Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence (e.g., noun, verb, adjective). Dependency Parsing: Analyzing the grammatical structure of a sentence to identify relationships between words. Constituency Parsing: Breaking down a sentence into its constituent parts or phrases (e.g., noun phrases, verb phrases). 5.3 Semantic Analysis Named Entity Recognition (NER): Identifying and classifying entities in text, such as names of people, organizations, locations, dates, etc. Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a given context. Coreference Resolution: Identifying when different words refer to the same entity in a text (e.g., “he” refers to “John”). 5.4 Information Extraction Entity Extraction: Identifying specific entities and their relationships within the text. Relation Extraction: Identifying and categorizing the relationships between entities in a text. 5.5 Text Classification in NLP Sentiment Analysis: Determining the sentiment or emotional tone expressed in a text (e.g., positive, negative, neutral). Topic Modeling: Identifying topics or themes within a large collection of documents. Spam Detection: Classifying text as spam or not spam. 5.6 Language Generation Machine Translation: Translating text from one language to another. Text Summarization: Producing a concise summary of a larger text. Text Generation: Automatically generating coherent and contextually relevant text. 5.7 Speech Processing Speech Recognition: Converting spoken language into text. Text-to-Speech (TTS) Synthesis: Converting written text into spoken language. 5.8 Question Answering Retrieval-Based QA: Finding and returning the most relevant text passage in response to a query. Generative QA: Generating an answer based on the information available in a text corpus. Page 10 — Prepared by: Jeena AT/AP/AIDS/SEC 5.9 Dialogue Systems Chatbots and Virtual Assistants: Enabling systems to engage in conversations with users, pro- viding responses and performing tasks based on user input. 5.10 Sentiment and Emotion Analysis in NLP Emotion Detection: Identifying and categorizing emotions expressed in text. Opinion Mining: Analyzing opinions or reviews to understand public sentiment toward products, services, or topics. 6 Working of Natural Language Processing (NLP) Working in natural language processing (NLP) typically involves using computational techniques to ana- lyze and understand human language. This can include tasks such as language understanding, language generation, and language interaction. 6.1 Text Input and Data Collection Data Collection: Gathering text data from various sources such as websites, books, social media, or proprietary databases. Data Storage: Storing the collected text data in a structured format, such as a database or a collection of documents. 6.2 Text Preprocessing Preprocessing is crucial to clean and prepare the raw text data for analysis. Common preprocessing steps include: Tokenization: Splitting text into smaller units like words or sentences. Lowercasing: Converting all text to lowercase to ensure uniformity. Stopword Removal: Removing common words that do not contribute significant meaning, such as “and,” “the,” “is.” Punctuation Removal: Removing punctuation marks. Stemming and Lemmatization: Reducing words to their base or root forms. Stemming cuts off suffixes, while lemmatization considers the context and converts words to their meaningful base form. Text Normalization: Standardizing text format, including correcting spelling errors, expanding contractions, and handling special characters. 6.3 Text Representation Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and word order but keeping track of word frequency. Term Frequency-Inverse Document Frequency (TF-IDF): A statistic that reflects the impor- tance of a word in a document relative to a collection of documents. Word Embeddings: Using dense vector representations of words where semantically similar words are closer together in the vector space (e.g., Word2Vec, GloVe). Page 11 — Prepared by: Jeena AT/AP/AIDS/SEC 6.4 Feature Extraction Extracting meaningful features from the text data that can be used for various NLP tasks. N-grams: Capturing sequences of N words to preserve some context and word order. Syntactic Features: Using parts of speech tags, syntactic dependencies, and parse trees. Semantic Features: Leveraging word embeddings and other representations to capture word meaning and context. 6.5 Model Selection and Training Selecting and training a machine learning or deep learning model to perform specific NLP tasks. Supervised Learning: Using labeled data to train models like Support Vector Machines (SVM), Random Forests, or deep learning models like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Unsupervised Learning: Applying techniques like clustering or topic modeling (e.g., Latent Dirichlet Allocation) on unlabeled data. Pre-trained Models: Utilizing pre-trained language models such as BERT, GPT, or transformer- based models that have been trained on large corpora. 6.6 Model Deployment and Inference Deploying the trained model and using it to make predictions or extract insights from new text data. Text Classification: Categorizing text into predefined classes (e.g., spam detection, sentiment anal- ysis). Named Entity Recognition (NER): Identifying and classifying entities in the text. Machine Translation: Translating text from one language to another. Question Answering: Providing answers to questions based on the context provided by text data. 6.7 Evaluation and Optimization Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision, recall, F1-score, and others. Hyperparameter Tuning: Adjusting model parameters to improve performance. Error Analysis: Analyzing errors to understand model weaknesses and improve robustness. 6.8 Iteration and Improvement Continuously improving the algorithm by incorporating new data, refining preprocessing techniques, exper- imenting with different models, and optimizing features. Page 12 — Prepared by: Jeena AT/AP/AIDS/SEC 7 Technologies related to Natural Language Processing There are a variety of technologies related to natural language processing (NLP) that are used to analyze and understand human language. Some of the most common include: Machine Learning: NLP relies heavily on machine learning techniques such as supervised and unsu- pervised learning, deep learning, and reinforcement learning to train models to understand and generate human language. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source library in Python that provides tools for NLP tasks such as tokenization, stemming, and part-of-speech tagging. Other popular libraries include spaCy, OpenNLP, and CoreNLP. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as dependency parsing and constituency parsing. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written text into spoken words, while STT systems convert spoken words into written text. Named Entity Recognition (NER) systems: NER systems identify and extract named entities such as people, places, and organizations from the text. Sentiment Analysis: A technique to understand the emotions or opinions expressed in a piece of text, by using various techniques like Lexicon-Based, Machine Learning-Based, and Deep Learning- based methods. Machine Translation: NLP is used for language translation from one language to another through a computer. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans through auditory or textual methods. AI Software: NLP is used in question-answering software for knowledge representation, analytical reasoning as well as information retrieval. 8 Language Models Models that assign probabilities to upcoming words, or sequences of words in general, are called language models or LMs. There are two types of Language Modelings: Statistical Language Modelings: Statistical Language Modeling, or Language Modeling, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede. Examples such as N-gram language modeling. Neural Language Modelings: Neural network methods are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A way of performing a neural language model is through word embeddings. Language models can also assign a probability to an entire sentence. For example, they can predict that the following sequence has a much higher probability of appearing in a text: all of a sudden I notice three guys standing on the sidewalk than does this same set of words in a different order: on guys all I of notice sidewalk three a sudden standing the In many NLP applications, we can use the probability as a way to choose a better sentence or word over a less-appropriate one. For example, we can correct grammar or spelling errors like Their are two midterms, in which There was mistyped as Their, or Everything has improve, in which improve should have Page 13 — Prepared by: Jeena AT/AP/AIDS/SEC been improved. The phrase There are will be much more probable than Their are, and has improved than has improve, allowing a language model to help users select the more grammatical variant. Language models can also help in augmentative and alternative communication systems (Trnka et al. 2007, Kane et al. 2017). People often use such AAC devices if they are physically unable to speak or sign but can instead use eye gaze or other specific movements to select words from a menu. Word prediction can be used to suggest likely words for the menu. 9 N-Grams An n-gram is a sequence of n words. A 2-gram (which we’ll call bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”. A 3-gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. 9.1 N-gram Language Model N-gram language models are probabilistic models for word prediction. They estimate the probability of a word given the n-1 previous words. Estimators like N-grams that assign a conditional probability to possible next words can be used to assign a joint probability to an entire sentence. Whether estimating probabilities of next words or of whole sequences, the N-gram model is one of the most important tools in speech and language processing. 9.2 Applications of N-gram Language Model Speech recognition: For example, the input speech sounds are very confusable and many words sound extremely similar. N-grams are essential in any task in which we have to identify words in noisy, ambiguous input. Handwriting recognition: In the movie Take the Money and Run, Woody Allen tries to rob a bank with a sloppily written hold-up note that the teller incorrectly reads as “I have a gub”. Any speech and language processing system could avoid making this mistake by using the knowledge that the sequence “I have a gun” is far more probable than the non-word “I have a gub” or even “I have a gull”. Machine translation: Suppose we are translating Hindi source sentence Put out (Extinguish) the fire just after a while or else (otherwise) it will spread all around. Spelling Correction: A spellchecker can use a probability estimator both to detect these errors and to suggest higher probability functions. Ex: They are leaving in about fifteen minuets to go to her house. An N-gram model predicts that “in about fifteen minuets” is much less probable than “in about fifteen minutes”. Augmentative communication: Word prediction can also help the disabled. People who are unable to use speech or sign language to communicate, like the physicist Steven Hawking, can communicate by using simple body movements to select words from a menu that are spoken by the system. Word prediction can be used to suggest likely words for the menu. N-grams are also crucial in NLP tasks like part-of-speech tagging, natural language generation, and word similarity, as well as in applications from authorship identification and sentiment extraction to predictive text input systems for cell phones. Page 14 — Prepared by: Jeena AT/AP/AIDS/SEC 9.3 Corpora Probabilities are based on counting things. Counting of things in natural language is based on a corpus (plural corpora), an on-line collection of text or speech. Let’s look at two popular corpora, Brown and Switchboard. The Brown corpus is a 1 million word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.), assembled at Brown University in 1963-64 (Kuˇcera and Francis, 1967; Francis, 1979; Francis and Kuˇcera, 1982). The Switchboard corpus of telephone conversations between strangers was collected in the early 1990s and contains 2430 conversations averaging 6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey et al., 1992). Example from Brown Corpus He stepped out into the hall, was delighted to encounter a water brother. Example has 13 words if we don’t count punctuation marks as words, 15 if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as words depends on the task. Punctuation is critical for finding boundaries of things (commas, periods, colons), and for identifying some aspects of meaning (question marks, exclamation marks, quotation marks). For some tasks, like part-of-speech tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if they were separate words. Example: An utterance from Switchboard [An utterance is the spoken correlate of a sentence] ‘I do uh main-mainly business data processing’ This utterance has two kinds of disfluencies. The broken off word main- is called a fragment. The words like uh and um are called fillers or filled pauses. If we are building an automatic dictation system based on automatic speech recognition, we might want to eventually strip out the disfluencies. But sometimes we also keep disfluencies around. How disfluent a person is can be used to identify them, or to detect whether they are stressed or confused. Disfluencies also often occur with particular syntactic structures, so they may help in parsing and word prediction. Stolcke and Shriberg (1996) found for example that treating uh as a word improves next-word prediction (why might this be?), and so most speech recognition systems treat uh and um as words. 10 Simple Unsmoomthed N Grams 10.1 Estimating the Probability of a Word Given a History Our goal is to compute the probability of a word w given some history h, denoted as P (w|h). For example, if h is “its water is so transparent that” and we want to know the probability that the next word is ”the”: P (the|its water is so transparent that) (1) One way to estimate this probability is from relative frequency counts: C(its water is so transparent that the) P (the|its water is so transparent that) = (2) C(its water is so transparent that) However, even with a large corpus, it may be challenging to find counts for all possible sentences due to the creative nature of language. This motivates the need for better methods to estimate these probabilities. Page 15 — Prepared by: Jeena AT/AP/AIDS/SEC 10.2 The Chain Rule of Probability The joint probability of a sequence of words can be decomposed using the chain rule of probability: n Y P (X1 , X2 ,... , Xn ) = P (Xk |X1k−1 ) (3) k=1 Applying this to words, we get: n Y P (w1 , w2 ,... , wn ) = P (wk |w1k−1 ) (4) k=1 Since computing the exact probability P (wn |w1n−1 ) is impractical, we use approximations like the N-gram model. 10.3 N-gram Models The bigram model approximates P (wn |w1n−1 ) by P (wn |wn−1 ): P (wn |w1n−1 ) ≈ P (wn |wn−1 ) (5) This Markov assumption can be generalized to the trigram and higher-order N-gram models: P (wn |w1n−1 ) ≈ P (wn |wn−N n−1 +1 ) (6) Using the bigram assumption, the probability of a word sequence is: n Y P (w1 , w2 ,... , wn ) ≈ P (wk |wk−1 ) (7) k=1 10.4 Maximum Likelihood Estimation (MLE) To estimate N-gram probabilities, we use Maximum Likelihood Estimation (MLE): C(wn−1 wn ) P (wn |wn−1 ) = P (8) w C(wn−1 w) Simplified as: C(wn−1 wn ) P (wn |wn−1 ) = (9) C(wn−1 ) For a mini-corpus with sentences: I am Sam Sam I am I do not like green eggs and ham The bigram probabilities are: 2 P (I| < s >) = = 0.67 3 1 P (Sam| < s >) = = 0.33 3 2 P (am|I) = = 0.67 3 1 P (< /s > |Sam) = = 0.5 2 1 P (Sam|am) = = 0.5 2 1 P (do|I) = = 0.33 3 Page 16 — Prepared by: Jeena AT/AP/AIDS/SEC General MLE for N-gram: n−1 n−1 C(wn−N +1 wn ) P (wn |wn−N +1 ) = n−1 (10) C(wn−N +1 ) This equation estimates the N-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ratio is called a relative frequency. The use of relative frequencies to estimate probabilities is an example of Maximum Likelihood Estima- tion (MLE). In Maximum Likelihood Estimation, the resulting parameter set maximizes the likelihood of the training set T given the model M : P (T |M ) (11) 11 Evaluating Language Models: Extrinsic and Intrinsic Evalua- tion Extrinsic Evaluation: – Embed the language model in an application. – Measure how much the application improves. – Compare performance by running the application twice, once with each language model. – Determines if an improvement in the language model helps the task at hand. Intrinsic Evaluation: – Measures the quality of a model independent of any application. – Useful for quickly evaluating potential improvements in a language model. – Perplexity: The standard intrinsic metric for measuring language model performance. Data Sets for Evaluation: – Training Set: ∗ Used to learn the parameters of the model. ∗ For n-gram language models, it’s the corpus from which counts are normalized into probabil- ities. – Test Set: ∗ Held-out set of data, not overlapping with the training set. ∗ Used to evaluate the model’s generalization to new, unknown data. ∗ Should reflect the language intended for the model’s use. – Development Set (Devset): ∗ Used for testing during model development. ∗ Helps avoid tuning the model to the test set characteristics. Choosing Training and Test Sets: – Test set should reflect the language of the intended application. – For general-purpose models, the test set should be drawn from a wide variety of texts. – Important to divide the data carefully to ensure representative training and test sets. Evaluating Models: – Compare performance by fitting the test set. – The model that assigns a higher probability to the test set is considered better. Page 17 — Prepared by: Jeena AT/AP/AIDS/SEC Avoiding Training on the Test Set: – Training on the test set introduces bias and inaccuracies in evaluation metrics. – Only run the model on the test set once or a very few times to avoid overfitting. Dividing Data: – Test set should be as large as possible for statistical significance. – Devset should be similar to the test set to measure performance accurately. 12 Evaluating Language Models: Perplexity Definition of Perplexity: Perplexity is the inverse probability of the test set, normalized by the number of words. Perplexity is a metric used to evaluate language models in natural language processing (NLP). Formula for Perplexity: For a test set W = w1 w2... wN , the perplexity is defined as: 1 perplexity(W ) = P (w1 w2... wN )− N Using the chain rule, it can be expanded to: N ! N1 Y 1 perplexity(W ) = i=1 P (wi | w1... wi−1 ) Interpretation: Higher probability of the word sequence results in lower perplexity. Lower perplexity indicates a better language model. Minimizing perplexity is equivalent to maximizing the probability of the test set according to the language model. Reason for Using Inverse Probability: The inverse relationship in perplexity arises from its original definition in terms of cross-entropy rate in information theory. Perplexity Calculation for Different Language Models: Unigram Model: Perplexity is the geometric mean of the unigram probabilities: N ! N1 Y 1 perplexity(W ) = i=1 P (wi ) Bigram Model: Perplexity is the geometric mean of the bigram probabilities: N ! N1 Y 1 perplexity(W ) = i=1 P (wi | wi−1 ) Inclusion of Sentence Markers: Page 18 — Prepared by: Jeena AT/AP/AIDS/SEC Sentence markers like , , and can be included in the probability computation. These markers are also included in the total count of word tokens N. Comparison of Different N-Gram Models: Perplexity is used to compare different n-gram models. Example from Wall Street Journal corpus: – Unigram: Perplexity = 962 – Bigram: Perplexity = 170 – Trigram: Perplexity = 109 More information from higher-order n-grams (e.g., trigrams) results in higher probabilities and lower perplexities. Model Training and Testing: N-gram models must be constructed without any knowledge of the test set to avoid artificially low perplexity. Perplexity comparisons are valid only when models use identical vocabularies. Perplexity and Task Performance: Intrinsic improvement in perplexity does not guarantee extrinsic improvement in tasks like speech recognition or machine translation. Perplexity usually correlates with task improvements and is a common evaluation metric. Improvements in perplexity should be confirmed by end-to-end evaluation on real tasks. 13 Generalization and Zeros Dependence on Training Corpus: The n-gram model’s probabilities heavily depend on the training corpus used to build it. This means the probabilities encoded in the model reflect specific linguistic patterns and frequencies present in the training data. Improvement with Larger N : As the value of N (the order of the n-gram) increases, the model better captures and models the intricacies of the training corpus. Larger N allows the model to incorporate more context, leading to more accurate predictions based on longer sequences of words. Visualization through Sampling: Sampling from an n-gram model provides a method to visualize these facts. By generating sentences based on their likelihood as defined by the model, we can observe how well the model reflects the linguistic patterns and frequencies of the training corpus. Example Suppose we have a corpus of movie reviews and we train n-gram models with different values of N : For N = 2 (bigram model), the model captures immediate word dependencies (e.g., ”good movie”). Page 19 — Prepared by: Jeena AT/AP/AIDS/SEC For N = 3 (trigram model), the model can capture more complex dependencies (e.g., ”special effects were”). By sampling from these models, we can see how sentences generated align with the style and content of the original movie reviews. This approach not only helps in understanding how n-gram models encode specific corpus facts but also illustrates their ability to improve with higher N , thereby modeling the training data more accurately. Sparse Data Issues: N-gram models encounter sparsity issues due to the vast number of possible n-grams compared to the limited data available. This sparsity leads to situations where some valid word sequences have zero occurrences in the training data. Genre and Dialect Matching: To build effective language models, it’s crucial to use training data that matches the genre or dialect of the target application. Different genres or dialects may have distinct linguistic features and vocabulary, affecting the perfor- mance of the model. Handling Zero Probabilities: Zero probabilities for word sequences in the test set can cause problems such as inability to compute perplexity. Solutions like smoothing algorithms redistribute probability mass to unseen events to mitigate the impact of zero probabilities. 14 Handling Unknown Words in Language Models Handling Unknown Words (OOV): For models using word-based vocabularies, encountering unknown words (OOV) in the test set poses a challenge. An token can be introduced to represent unknown words during both training and testing phases. Most contemporary language models prefer closed vocabulary approaches where a fixed vocabulary is defined beforehand to avoid the use of an token. Creating an Open Vocabulary System: To accommodate unknown words in an open vocabulary system: 1. Choose a fixed vocabulary list before training. 2. Normalize the training text by replacing any word not in the fixed vocabulary with the token. 3. Estimate probabilities for the token based on its occurrences in the training data. Effect on Perplexity: The approach used to handle unknown words (e.g., using ) affects the perplexity of the lan- guage model. Perplexity comparisons between models using are valid only if they share identical vocabu- laries. Page 20 — Prepared by: Jeena AT/AP/AIDS/SEC Closed Vocabulary vs. Open Vocabulary: In a closed vocabulary system, the test set is restricted to words from a predefined lexicon known during training, avoiding issues with unknown words (out-of-vocabulary, OOV). Modern neural language models often utilize subword tokenization (e.g., Byte Pair Encoding, BPE) to represent unknown words as sequences of smaller subword units, ensuring comprehensive coverage even for previously unseen words. Closed vs. Open Vocabulary Understanding the difference between closed and open vocabulary is essential in the context of language modeling and natural language processing. Here’s a detailed breakdown: Closed Vocabulary Definition: A closed vocabulary consists of a fixed set of words that the model is limited to using. Characteristics: – The vocabulary is predefined and does not change. – Any word not in the vocabulary is typically treated as an unknown token (e.g., ). – Suitable for applications with a limited and specific domain, where the vocabulary is known in advance. Advantages: – Simpler to manage and implement. – Memory and storage requirements are fixed and predictable. – Faster lookup and processing since the vocabulary size is limited. Disadvantages: – Unable to handle out-of-vocabulary (OOV) words effectively. – Not suitable for dynamic or diverse domains with constantly evolving vocabulary. Open Vocabulary Definition: An open vocabulary allows for the inclusion of new words that were not seen during the initial training of the model. Characteristics: – The vocabulary can grow and adapt over time as new words are encountered. – Utilizes techniques such as subword units, byte-pair encoding (BPE), or character-level models to handle new words. – Suitable for applications with a broad or dynamic domain, where new terms frequently emerge. Advantages: – Capable of handling OOV words effectively. – Better suited for real-world applications where language is dynamic and evolving. – More flexible in dealing with rare or unique words. Disadvantages: – More complex to implement and manage. – Higher memory and storage requirements, as the vocabulary size can grow indefinitely. – Potentially slower lookup and processing due to a larger and dynamic vocabulary. Page 21 — Prepared by: Jeena AT/AP/AIDS/SEC 15 Handling Unseen Contexts in Language Models 15.1 Handling Unseen Contexts Even if words are in the vocabulary, they might appear in unseen contexts in the test set. A language model can assign zero probability to these unseen events, which is problematic. To avoid this, smoothing or discounting techniques are used. 15.2 Smoothing (Discounting) Smoothing adjusts the probability distribution by reallocating some probability mass from frequent events to unseen events. This helps provide non-zero probabilities to unseen n-grams. Types of Smoothing 1. Laplace Smoothing (Add-One Smoothing) – Adds a small constant (usually 1) to the counts of all n-gram counts. – Ensures no event has zero probability. – Normalizes probabilities by dividing by the total number of events plus the constant. – Easy to implement and understand. – Drawbacks: ∗ Assumes all events are equally likely, which is unrealistic for natural language. ∗ Can overestimate the probabilities of rare events. 2. Add-K Smoothing – A generalization of Laplace smoothing where a small value k is added to the n-gram counts. – Allows for more flexible adjustment of counts compared to add-one smoothing. 3. Backoff Smoothing (Katz Smoothing) – Uses lower-order models to estimate the probabilities of higher-order models. – Example: Uses a bigram model when there is enough evidence and falls back to a unigram model otherwise. – Avoids zero probabilities for unseen bigrams and discounts the probabilities of seen bigrams. – More flexible and realistic than additive smoothing. – Requires more parameters and computation. 4. Kneser-Ney Smoothing – A more complex and effective smoothing method. – Considers the frequency of the lower-order n-grams (back-off n-grams) to better estimate the probability of unseen n-grams. – Adjusts the probability based on the frequency of n-grams that can follow given contexts, improv- ing performance in handling unseen events. Page 22 — Prepared by: Jeena AT/AP/AIDS/SEC 5. Interpolation Smoothing (Jelinek-Mercer Smoothing) – Combines lower-order and higher-order models using a weighted average. – Example: An interpolated bigram model uses a linear combination of bigram and unigram prob- abilities. – Weights are usually determined by maximizing the likelihood of a held-out data set. – More effective and efficient than backoff smoothing. – Requires more data and tuning. 6. Neural Smoothing (Neural Language Modeling) – Uses neural networks to learn the probabilities of language models. – Neural networks encode the context of a word or sequence and decode the probability of the next word or sequence. – Captures long-range dependencies and semantic similarities that other techniques cannot. – Can generate novel and coherent texts. – Most advanced and powerful smoothing technique. – Requires the most data and resources. Importance of Smoothing Smoothing is essential for enhancing the generalization ability of language models. It prevents zero probabilities for unseen n-grams, which is crucial for calculating metrics like perplexity and improving model performance. 15.3 Laplace Smoothing In the case of Unigram Models: (i) Unsmoothed Maximum Likelihood Estimate (MLE) The unsmoothed MLE of the unigram probability P (wi ) is the count ci of the word wi normalized by the total number of word tokens N : ci P (wi ) = N (ii) Laplace Smoothing Laplace smoothing adds 1 to each count. This adjustment ensures that no probability is zero, which is particularly useful for dealing with unseen words in the training data. 1. Add one to each count: cLaplace i = ci + 1 2. Adjust the denominator to account for the V additional counts (where V is the vocabulary size): ci + 1 PLaplace (wi ) = N +V Page 23 — Prepared by: Jeena AT/AP/AIDS/SEC (iii) Adjusted Count Representation It can be useful to define an adjusted count c′i , which allows us to see the effect of smoothing on the counts directly. 1. Calculate the adjusted count: N c′i = (ci + 1) N +V 2. Normalize the adjusted count to get the probability: c′i PLaplace (wi ) = N In the case of Bigram Models (i) Unsmoothed Maximum Likelihood Estimate (MLE) The unsmoothed MLE of the bigram probability P (wn |wn−1 ) is the count C(wn−1 wn ) normalized by the count of the preceding word C(wn−1 ): C(wn−1 wn ) P (wn |wn−1 ) = C(wn−1 ) (ii) Laplace Smoothing For bigrams, we again add 1 to each count and adjust the denominator to account for the V possible following words. 1. Add one to each bigram count: CLaplace (wn−1 wn ) = C(wn−1 wn ) + 1 2. Adjust the denominator by adding the vocabulary size V to the count of the preceding word: C(wn−1 wn ) + 1 PLaplace (wn |wn−1 ) = C(wn−1 ) + V (iii) Discounting and Conceptual Understanding 1. Discounting and Redistribution: - Smoothing can be viewed as discounting the observed counts and redistributing the probability mass to the zero counts. - For instance, in a unigram model, the count ci is adjusted to c′i = (ci + 1) NN +V. 2. Relative Discount dc : - The relative discount dc is the ratio of the discounted counts to the original counts: c′ dc = i ci Why Use Laplace Smoothing? - Zero Probabilities: - Without smoothing, any unseen word or n-gram in the training data would have a probability of zero in the model, which is problematic for many applications. - Baseline Method: - Though not the most effective smoothing method for modern n-gram models, Laplace smoothing provides a simple and intuitive way to ensure all probabilities are non-zero. Page 24 — Prepared by: Jeena AT/AP/AIDS/SEC 15.4 Add-k Smoothing Add-k smoothing is a variation of smoothing techniques used in language modeling to handle unseen events (or events with zero counts) more gracefully than add-one smoothing. The main idea behind add-k smooth- ing is to add a fractional count k to each observed count before normalizing probabilities. This helps in distributing the probability mass more evenly between seen and unseen events. The probability PAdd-k (wn | wn−1 ) for a word wn given the previous word wn−1 is computed using the formula: C(wn−1 wn ) + k PAdd-k (wn | wn−1 ) = C(wn−1 ) + kV where: C(wn−1 wn ) is the count of the bigram (wn−1 , wn ) in the training corpus. C(wn−1 ) is the count of the unigram wn−1 in the training corpus. V is the vocabulary size (total number of unique words). k is a smoothing parameter, typically a small positive value. 1. Count Adjustment: Instead of adding 1 (as in add-one smoothing), add-k smoothing adds k to each observed count C(wn−1 wn ). This helps in mitigating the harshness of assigning zero probability to unseen events, especially when k is chosen carefully. 2. Normalization: After adjusting the counts, probabilities are normalized to ensure they sum up to 1. This is achieved by dividing the adjusted count C(wn−1 wn ) + k by C(wn−1 ) + kV. 3. Smoothing Parameter k: Choice of k: k is a hyperparameter that needs to be chosen. It is often determined empirically or through optimization on a development set (devset). Impact of k: When k is very small (e.g., k = 0.01), it adds a minimal amount of smoothing, which might still result in sharp changes in probability estimates. Larger values of k (e.g., k = 0.5) smooth more aggressively, reducing the impact of observed counts and moving probabilities closer to uniform distribution. 4. Effectiveness: Add-k smoothing is useful in scenarios where a minimal amount of smoothing is necessary but add-one smoothing is too aggressive. It can help in tasks like text classification where smoothing helps avoid zero probabilities for unseen words. Limitations Variance: Add-k smoothing can lead to poor variance in probability estimates, especially when k is too small or too large. Inappropriate Discounts: The choice of k can affect how probabilities are distributed, sometimes leading to underestimation or overestimation of probabilities. In summary, add-k smoothing provides a flexible approach to smoothing probabilities in language mod- eling by introducing a smoothing parameter k that balances between observed and unseen events more effectively than add-one smoothing in some contexts. However, its performance can vary depending on the task and the optimal choice of k. Page 25 — Prepared by: Jeena AT/AP/AIDS/SEC 15.5 Backoff and Interpolation Backoff Approach Definition: In a backoff model, if a higher-order n-gram (e.g., trigram) has zero counts in the training data, we approximate its probability by ”backing off” to a lower-order n-gram (e.g., bigram or unigram) that does have non-zero counts. Purpose: This approach is used to generalize better in contexts where the model hasn’t learned enough about higher-order n-grams. Usage: We only resort to a lower-order n-gram if we have insufficient evidence (zero counts) for the higher-order n-gram. Interpolation Approach Definition: Interpolation involves mixing the probability estimates from multiple n-gram models (trigram, bigram, and unigram) by assigning weights to each model. Linear Interpolation: The trigram probability P̂ (wn | wn−2 wn−1 ) is estimated by combining the unigram, bigram, and trigram probabilities with weights λ1 , λ2 , and λ3 respectively: P̂ (wn | wn−2 wn−1 ) = λ1 P (wn ) + λ2 P (wn | wn−1 ) + λ3 P (wn | wn−2 wn−1 ) where λ1 + λ2 + λ3 = 1. Context-Conditioned Interpolation: Weights λi can be conditioned on the context to improve accuracy, especially when certain n-grams are more reliable based on available counts. Setting λ Values Held-Out Corpus: To determine optimal weights λi , a held-out corpus (not used for training) is used. Optimization: The weights λi are chosen to maximize the likelihood of the held-out corpus, ensuring the model generalizes well to unseen data. Backoff with Discounting (Katz Backoff ) Discounting: In backoff models like Katz backoff, probabilities of higher-order n-grams are discounted to reserve some probability mass for lower-order n-grams when backing off. Katz Backoff Formula: For a backoff n-gram PBO (wn | wn−N +1:n−1 ): ( PMLE (wn | wn−N +1:n−1 ) if C(wn−N +1:n ) > 0 PBO (wn | wn−N +1:n−1 ) = a(wn−N +1:n−1 )PBO (wn | wn−N +2:n−1 ) otherwise where PMLE is the maximum likelihood estimate and a(wn−N +1:n−1 ) is a function to distribute the probability mass to lower-order n-grams. Good-Turing Smoothing with Katz Backoff Integration: Katz backoff is often combined with Good-Turing smoothing, which adjusts probability estimates to handle unseen events more effectively. In summary, while backoff and interpolation both address the issue of unseen n-grams in language mod- eling, they do so with different strategies. Backoff simplifies to lower-order n-grams when higher-order n-grams are unseen, while interpolation combines probabilities from multiple n-grams based on weighted averages. These techniques improve the robustness and generalization capability of n-gram language models in handling unseen contexts effectively. Page 26 — Prepared by: Jeena AT/AP/AIDS/SEC 15.6 Stupid Backoff Algorithm Definition: Stupid Backoff is a simplified backoff algorithm used in language modeling. It simplifies the model by not attempting to maintain a proper probability distribution. Handling Unseen N-grams: If a higher-order n-gram (e.g., trigram or higher) has zero counts in the training data, instead of using smoothing techniques or complex backoff strategies, Stupid Backoff directly ”backs off” to a lower-order n-gram (e.g., bigram or unigram). This is done with a fixed, context-independent weight l. Probability Estimation: For an n-gram wi wi−1... wi−N +1 : i count(wi−N +1 ) i if count(wi−N i−1 i−1 count(wi−N +1 ) > 0 S(wi | wi−N +1 ) +1 ) = l · S(w | wi−1 ) otherwise i i−N +2 Here, i i – count(wi−N +1 ) represents the frequency of the n-gram wi−N +1. i−1 i−1 – count(wi−N +1 ) represents the frequency of the (n-1)-gram wi−N +1. – l is a fixed weight typically chosen empirically. Termination: count(wi ) The backoff process terminates at the unigram level, where the score S(wi ) is estimated as N. Here, N is the length of the n-gram. Advantages and Limitations: Advantages: – Simplicity: Easy to implement and understand. – Efficiency: Requires less computational overhead compared to more complex smoothing tech- niques. Limitations: – Not a Probability Distribution: Does not ensure that the estimated scores sum up to 1, hence not suitable for tasks requiring true probability distributions. – Context-Independence: Uses a fixed weight l irrespective of the specific context, which may not capture nuanced language patterns effectively. Empirical Findings: According to Brants et al. (2007), a value of l = 0.4 has been found to work well in practice for various language modeling tasks. 15.7 Kneser-Ney Smoothing Kneser-Ney smoothing is an advanced technique used in language modeling to handle the problem of zero probabilities and better capture the distribution of n-grams. It is particularly effective for large-scale language models. Here’s a detailed breakdown of the technique: Page 27 — Prepared by: Jeena AT/AP/AIDS/SEC 1. Absolute Discounting Kneser-Ney smoothing builds on the concept of absolute discounting. The formula for interpolated absolute discounting for bigrams is: C(wi−1 wi ) − d PAbsoluteDiscounting (wi | wi−1 ) = + l(wi−1 )P (wi ) (12) C(wi−1 ) where: C(wi−1 wi ) is the count of the bigram (wi−1 , wi ). d is the discount value, typically in the range 0 ≤ d ≤ 1. C(wi−1 ) is the count of the unigram wi−1. l(wi−1 ) is the interpolation weight for the unigram P (wi ). Setting the Discount Value: A common choice is d = 0.75 for all counts. Another approach is to use different discount values based on unigram counts of 1 and 2. For example: n1 d= (13) n1 + 2n2 where n1 is the number of unigrams with count 1 and n2 is the number with count 2. 2. Kneser-Ney Discounting Kneser-Ney smoothing enhances absolute discounting by introducing a sophisticated method to handle lower- order unigrams. The main idea is to improve the unigram model to better capture the likelihood of words appearing in new contexts. Continuation Probability: Define PCONTINUATION (w) as the probability of a word w being a novel continuation: PCONTINUATION (w) ∝ |{v : C(vw) > 0}| (14) Normalize this count by the total number of bigram types: |{v : C(vw) > 0}| PCONTINUATION (w) = (15) |{(u0 , w0 ) : C(u0 w0 ) > 0}| Interpolated Kneser-Ney Smoothing: max(C(wi−1 wi ) − d, 0) PKN (wi | wi−1 ) = + l(wi−1 )PCONTINUATION (wi ) (16) C(wi−1 ) where: The first term is the discounted bigram probability. The second term is the interpolated continuation probability. Interpolation Weight: P d v C(wi−1 v) l(wi−1 ) = (17) |{w : C(wi−1 w) > 0}| General Recursive Formulation: i i−1 max(cKN (wi−N +1 ) − d, 0) i−1 i−1 PKN (wi | wi−N +1 ) = i−1 + l(wi−N +1 )PKN (wi | wi−N +2 ) (18) cKN (wi−N +1 ) where: Page 28 — Prepared by: Jeena AT/AP/AIDS/SEC cKN (wi−N i +1 ) is the count for the highest-order n-gram. cKN (wi−N i−1 +1 ) is the count for the lower-order n-gram. Termination at Unigrams: max(cKN (w) − d, 0) PKN (w) = P ′ (19) w′ cKN (w ) If including an unknown word < U N K >, its probability will be calculated similarly, using a uniform distribution weighted by l(∅). Modified Kneser-Ney Smoothing: Uses different discount values d1 , d2 , and d3 for counts of 1, 2, and 3 or more, respectively. Provides improved performance over basic Kneser-Ney by adjusting discount values based on n-gram counts. 16 Problem Exercises Problem 1: Bigrams and Maximum Likelihood Estimation Suppose we have a corpus of text containing the following sentences: 1. ”I love natural language processing.” 2. ”Natural language understanding is fascinating.” 3. ”Processing text is a fundamental task.” We want to compute the probabilities of bigrams (2-grams) using maximum likelihood estimation. Solution: 1. Tokenization and Counting: Tokenize the corpus into words and count the occurrences of each word and each bigram. 2. Calculate Probabilities: Compute the probabilities of each bigram using maximum likelihood estimation: Count(wi−1 , wi ) P (wi | wi−1 ) = Count(wi−1 ) Where: – wi−1 is the first word of the bigram, – wi is the second word of the bigram, – Count(wi−1 , wi ) is the number of times the bigram wi−1 followed by wi occurs, – Count(wi−1 ) is the number of times the word wi−1 occurs. 3. Example Calculation: For the bigram ”natural language”: Count(”natural language”) = 2 (from sentences 1 and 2) Count(”natural”) = 2 2 P (”language” | ”natural”) = =1 2 Page 29 — Prepared by: Jeena AT/AP/AIDS/SEC For the bigram ”language processing”: Count(”language processing”) = 1 (from sentence 1) Count(”language”) = 2 1 P (”processing” | ”language”) = = 0.5 2 Problem 2: We are given the following corpus, modified from the one in the chapter: I am Sam Sam I am I am Sam I do not like green eggs and Sam Using a bigram language model with add-one smoothing, what is P (Sam | j, am)? Include ⟨s⟩ and ⟨/s⟩ in your counts just like any other token. Solution: Step 1: Tokenize the corpus and create bigram counts Tokenized sentences with counts: ⟨s⟩ I am Sam ⟨/s⟩ ⟨s⟩ Sam I am ⟨/s⟩ ⟨s⟩ I am Sam ⟨/s⟩ ⟨s⟩ I do not like green eggs and Sam ⟨/s⟩ Unigrams and their counts: ⟨s⟩: 4 I: 4 am: 3 Sam: 3 ⟨/s⟩: 4 do: 1 not: 1 like: 1 green: 1 eggs: 1 and: 1 Vocabulary V = 11 (total unique tokens) Page 30 — Prepared by: Jeena AT/AP/AIDS/SEC Step 2: Calculate bigram counts and probabilities with add-one smoothing Bigram counts: (⟨s⟩, I): 3 (I, am): 3 (am, Sam): 2 (Sam, ⟨/s⟩): 3 (Sam, I): 1 (I, do): 1 (do, not): 1 (not, like): 1 (like, green): 1 (green, eggs): 1 (eggs, and): 1 (and, Sam): 1 With add-one smoothing, the bigram probability P (wi | wi−1 ) is: C(wi−1 , wi ) + 1 P (wi | wi−1 ) = C(wi−1 ) + V Step 3: Calculate P (Sam | j, am) Since ”j” does not appear in the corpus, its unigram count C(j) is 0. Using add-one smoothing: C(j, Sam) + 1 0+1 1 P (Sam | j) = = = C(j) + V 0 + 11 11 1 The probability P (Sam | j, am) is 11 using add-one smoothing in a bigram language model. Problem 3: Suppose we train a trigram language model with add-one smoothing on a given corpus. The corpus contains V word types. Express a formula for estimating P (w3 | w1 , w2 ), where w3 is a word which follows the bigram (w1 , w2 ), in terms of various n-gram counts and V. Use the notation c(w1 , w2 , w3 ) to denote the number of times that trigram (w1 , w2 , w3 ) occurs in the corpus, and so on for bigrams and unigrams. Solution: To estimate P (w3 | w1 , w2 ) using a trigram language model with add-one smoothing, we can use the following formula. Add-one smoothing (also known as Laplace smoothing) adjusts the counts to ensure that no probability is zero, even for trigrams that do not appear in the training corpus. Let’s denote: C(w1 , w2 , w3 ): the count of the trigram (w1 , w2 , w3 ) in the corpus C(w1 , w2 ): the count of the bigram (w1 , w2 ) in the corpus Page 31 — Prepared by: Jeena AT/AP/AIDS/SEC V : the number of unique word types (vocabulary size) With add-one smoothing, the estimated probability P (w3 | w1 , w2 ) is given by: C(w1 , w2 , w3 ) + 1 P (w3 | w1 , w2 ) = C(w1 , w2 ) + V Explanation: Numerator: C(w1 , w2 , w3 ) + 1 – This adds 1 to the count of the trigram (w1 , w2 , w3 ) to ensure that every possible trigram has a non-zero count, even if it did not appear in the corpus. Denominator: C(w1 , w2 ) + V – This adjusts the total count of the bigram (w1 , w2 ) by adding V , the number of unique word types in the corpus. This accounts for the smoothing applied to all possible next words w3. Final Formula: C(w1 , w2 , w3 ) + 1 P (w3 | w1 , w2 ) = C(w1 , w2 ) + V This formula ensures that the probability distribution is smoothed, preventing zero probabilities for unseen trigrams and maintaining a valid probability distribution across the vocabulary. Problem 4: We are given the following corpus, modified from the one in the chapter: I am Sam Sam I am I am Sam I do not like green eggs and Sam If we use linear interpolation smoothing between a maximum-likelihood bigram model and a maximum- likelihood unigram model with λ1 = 21 and λ2 = 21 , what is P (Sam | am)? Include ⟨s⟩ and ⟨/s⟩ in your counts just like any other token. Solution: To calculate P (Sam | am) using linear interpolation smoothing, we follow these steps: Step 1: Calculate the Maximum-Likelihood Bigram Probability The maximum-likelihood estimate for the bigram probability PML (Sam | am) is given by: C(am, Sam) PML (Sam | am) = C(am) From the corpus: I am Sam Sam I am I am Sam I do not like green eggs and Sam Page 32 — Prepared by: Jeena AT/AP/AIDS/SEC Count the occurrences: C(am, Sam) = 2 (from ”I am Sam”) C(am) = 3 (from ”I am Sam” twice and ”Sam I am”) Thus, 2 PML (Sam | am) = 3 Step 2: Calculate the Maximum-Likelihood Unigram Probability The maximum-likelihood estimate for the unigram probability PML (Sam) is given by: C(Sam) PML (Sam) = N where N is the total number of tokens in the corpus. Count the occurrences: C(Sam) = 3 (from ”I am Sam” twice and ”I do not like green eggs and Sam”) N = 18 (4 sentences with 4 tokens each, plus 2 tokens from the second sentence) Thus, 3 1 PML (Sam) = = 18 6 Step 3: Combine the Probabilities Using Linear Interpolation 1 Using the given interpolation weights λ1 = 2 and λ2 = 21 , we combine the probabilities: P (Sam | am) = λ1 · PML (Sam | am) + λ2 · PML (Sam) Substitute the values: 1 2 1 1 P (Sam | am) = · + · 2 3 2 6 Simplify the expression: 1 2 1 1 1 1 4 1 5 P (Sam | am) = · + · = + = + = 2 3 2 6 3 12 12 12 12 5 The probability P (Sam | am) using linear interpolation smoothing is 12 Problem 5: You are given a training set of 100 numbers that consists of 91 zeros and 1 each of the other digits 1-9. Now we see the following test set: 0 0 0 0 0 3 0 0 0 0. What is the unigram perplexity? Solution: Step 1: Calculate the Unigram Probabilities from the Training Set The training set consists of: 91 zeros 1 each of the digits 1 through 9 Page 33 — Prepared by: Jeena AT/AP/AIDS/SEC Total number of digits in the training set N is: N = 91 + 1 × 9 = 100 The probability of each digit d is: C(d) P (d) = N Where C(d) is the count of digit d. For d = 0: 91 P (0) = = 0.91 100 For d = 1, 2, 3, 4, 5, 6, 7, 8, 9: 1 P (d) = = 0.01 100 Step 2: Compute the Probability of the Test Set The test set is: 0, 0, 0, 0, 0, 3, 0, 0, 0, 0. The probability of the test set W = w1 , w2 ,... , w10 is: 10 Y P (W ) = P (wi ) i=1 From the test set: 0 appears 9 times. 3 appears 1 time. Thus: P (W ) = P (0)9 × P (3) P (W ) = 0.919 × 0.01 Step 3: Calculate the Perplexity Perplexity is calculated as: 1 Perplexity(W ) = P (W )− N Here, N = 10 (the number of digits in the test set). Let’s compute the probability: P (W ) = 0.919 × 0.01 Now compute the perplexity: 1 − 10 Perplexity(W ) = 0.919 × 0.01 First, let’s calculate 0.919 : 0.919 ≈ 0.3894161185 Then calculate: P (W ) = 0.3894161185 × 0.01 = 0.003894161185 Finally: 1 − 10 1 Perplexity(W ) = (0.003894161185) = 1 0.003894161185 10 Compute the 10th root: 1 0.003894161185 10 ≈ 0.5741 Thus: 1 Perplexity(W ) ≈ ≈ 1.74185 0.5741 The unigram perplexity of the test set is approximately 1.74185. Page 34 — Prepared by: Jeena AT/AP/AIDS/SEC