FINAL NLP PDF
Document Details

Uploaded by EffusiveCelebration6888
Tags
Summary
This document provides notes on natural language processing topics, including parts-of-speech (POS) tagging, probabilistic models like Hidden Markov Models (HMMs), and algorithms like the Viterbi and Forward algorithms. It also includes examples and applications.
Full Transcript
TOPIC 6 TOPIC 7 1. Part-of-Speech (POS) Tagging: - The process of assigning lexical class markers (tags) to words in a corpus. - Useful for speech recognition, word sense disambiguation, and other NLP tasks. 2. Word Classes: - Close...
TOPIC 6 TOPIC 7 1. Part-of-Speech (POS) Tagging: - The process of assigning lexical class markers (tags) to words in a corpus. - Useful for speech recognition, word sense disambiguation, and other NLP tasks. 2. Word Classes: - Closed Class: Fixed set of grammatical function words (e.g., prepositions, conjunctions). - Open Class: Expanding the set of content words (e.g., nouns, verbs, adjectives) 3. Tagsets: - Penn Treebank Tagset: Most commonly used, with 45 tags. - C5 Tagset: Used for the British National Corpus (BNC), with 61 tags. 4. Ambiguities in POS Tagging: - Words like "book" or "like" can have multiple POS tags depending on context. - Context-based algorithms resolve these ambiguities 5. Approaches to POS Tagging: - Rule-Based: Uses handcrafted linguistic rules. - Learning-Based: Relies on annotated corpora and machine learning (e.g., Naïve Bayes, Neural Networks, HMMs). 6. Probabilistic Models: - Hidden Markov Models (HMMs): Assume the next state depends only on the current state. - Conditional Random Fields (CRFs): Consider global dependencies for sequence labeling. 7. Training and Evaluation: - The training phase estimates probabilities for word-tag and tag transitions. - Evaluation involves calculating metrics like precision, recall, and F-measure. 8. Sequence Labeling Problem: - Classifies each token in a sequence, considering dependencies between neighboring tokens. FILL IN THE BLANK 1. The process of assigning a lexical class marker to each word in a corpus is called _______. Answer: Part-of-Speech Tagging 2. Words like "in" and "on" are part of the _______ class, which has a fixed membership. Answer: Closed 3. The _______ tagset consists of 45 tags and is widely used in NLP. Answer: Penn Treebank 4. Rule-based POS tagging relies on _______ crafted based on linguistic knowledge. Answer: Human 5. In probabilistic sequence models, _______ assumes the next state depends only on the current state. Answer: Hidden Markov Model (HMM) 6. Training data is typically split into _______ for model training and _______ for testing. Answer: 90%, 10% 7. The metric that calculates the harmonic mean of precision and recall is called _______. Answer: F-measure 8. _______ is a problem where the contexts to be tagged do not appear in the training data. Answer: Sparse data 9. The _______ model uses probabilities of word-tag pairs and tag transitions Answer: Bigram Tagger 10.When using a sliding window approach, the classification of a token considers its _______ tokens. Answer: Neighboring TOPIC 8 1. Markov Chain: - A model representing transitions between states with associated probabilities. - Transition probabilities leaving a state must sum to 1. - It cannot represent ambiguity. 2. Hidden Markov Model: - An extension of Markov Chains where the states are hidden (non-observable). - Example: POS tagging, where words are observed, but tags (states) are hidden. - Components: - States (Q): Hidden variables (e.g., Hot/Cold, POS tags). - Observations (O): Observable data (e.g., ice creams eaten, words in a sentence). - Transition Probabilities (A): Likelihood of moving from one state to another. - Emission Probabilities (B): Likelihood of observations given a state. - Initial Probabilities (π): Probability of starting in each state. 3. Fundamental Problems in HMM: - Likelihood Problem: Compute the probability of an observation sequence (solved using the Forward algorithm). - Decoding Problem: Find the most probable sequence of hidden states (solved using the Viterbi algorithm). - Learning Problem: Estimate HMM parameters given observed sequences 4. Forward Algorithm: - Purpose: Calculate the probability of an observation sequence efficiently. - Approach: - Dynamic programming with a forward trellis. - Sums probabilities over all possible paths leading to each state. 5. Viterbi Algorithm: - Purpose: Determine the most probable sequence of hidden states. - Approach: - It is similar to the Forward algorithm but uses the max function instead of summation. - Includes backtracking to reconstruct the best path. - Backtracking pointers are used to trace the best path through the trellis. 6. Applications of HMM: - Weather Prediction: Infer temperature (Hot/Cold) based on ice cream consumption. - POS Tagging: Assign grammatical tags to words in a sentence. - Activity Recognition: Predict activities (e.g., walking, shopping) from observed behavior. - Health Monitoring: Infer health conditions based on symptoms or activities. 7. Variants of HMM: - Bakis HMM: Left-to-right transitions (e.g., modeling speech). - Ergodic HMM: Fully connected; transitions allowed between all states. FILL IN THE BLANK 1. A Markov Chain cannot represent _______ as it uniquely determines the path through states Answer: Ambiguity 2. The extension of Markov Chains that includes hidden states is called _______. Answer: Hidden Markov Model (HMM) 3. In HMM, the probability of observing a specific output given a state is known as _______. Answer: Emission Probability 4. The _______ algorithm is used to compute the probability of an observation sequence. Answer: Forward 5. The _______ algorithm is used to find the most probable sequence of hidden states. Answer: Viterbi 6. A left-to-right HMM commonly used in speech recognition is called a _______ HMM. Answer: Bakis 7. Transition probabilities describe the likelihood of moving from one _______ to another. Answer: State 8. The dynamic programming structure used in Forward and Viterbi algorithms is called a _______. Answer: Trellis 9. In the Forward algorithm, probabilities are computed by _______ over all possible paths leading to a state. Answer: Summing 10.In the Viterbi algorithm, probabilities are computed by taking the _______ over all possible paths leading to a state. Answer: Maximum 11.The HMM component that specifies the probability of starting in each state is called _______. Answer: Initial Probability (π) 12.The _______ pointers in the Viterbi algorithm trace the best path through the states. Answer: Backtracking 13.The Forward and Viterbi algorithms both use _______ programming to improve computational efficiency. Answer: Dynamic TOPIC 9 STATISTICAL PARSING 1. Overview: - Statistical parsing uses probabilistic models to assign probabilities to parse trees. - Helps resolve syntactic ambiguity and supports supervised and unsupervised parser learning. 2. Probabilistic Context-Free Grammar (PCFG): - A CFG variant where each production rule has an associated probability. - Probabilities define distributions for non-terminals. - Example grammar includes rules for English sentence structures with associated probabilities. 3. Treebanks: - Corpora annotated with parse trees, e.g., the Penn Treebank. - Treebanks provide a foundation for supervised learning of PCFGs. 4. Parsing Techniques with PCFG - Use of NLTK libraries (e.g., InsideChartParser, ViterbiParser) for probabilistic parsing. - Steps include defining grammar, generating parse trees, and calculating probabilities. 5. Evaluation Metrics: - PARSEVAL metrics: Recall, Precision, and F1-score measure how well parse trees align with gold standards. - Example calculations for labeled precision and recall were provided. 6. Dependency Grammar (PSG): - Represents syntactic structure through dependencies rather than phrases. - Directed graphs between words, suitable for free word-order languages. SYNTACTIC PARSING 1. Phrase Structure Grammar (PSG): - Introduced by Noam Chomsky; sentences are generated via rewrite rules. - Focuses on deriving correct syntactic parse trees for sentences. 2. Parsing as Search: - Explores all derivations to derive a given string: - Top-Down Parsing: Starts from the root (start symbol). - Bottom-Up Parsing: Begins with terminal symbols, moving toward the root. 3. Parsing Strategies: Top-down parsers explore inconsistent options early but may generate invalid trees. - Bottom-up parsers avoid inconsistent options but might not reach complete parses. 4. Examples of Parsing: - Examples illustrate parsing structures using given sentences, with CFG rules and NLTK tools. 5. Comparison of Parsing Approaches: - Efficiency and error tendencies compared for top-down and bottom-up methods. FILL-IN-THE BLANK QUESTIONS 1. Statistical parsing uses a _______ model to assign probabilities to parse trees. Answer: Probabilistic 2. A probabilistic version of CFG is called _______. Answer: Probabilistic Context-Free Grammar (PCFG) 3. The _______ grammar is a corpus annotated with parse trees, commonly used for supervised learning. Answer: Treebank 4. In statistical parsing, the probability of a sentence is the _______ of the probabilities of all its derivations. Answer: Sum 5. The _______ algorithm helps efficiently determine the most probable derivation for a sentence in PCFG. Answer: Viterbi 6. _______ parsing starts from the root of the parse tree and applies grammar rules to generate possible trees. Answer: Top-down 7. _______ parsing starts from terminal symbols and works backward to find the root. Answer: Bottom-up 8. The F1 score is the harmonic mean of _______ and _______. Answer: Precision, Recall 9. Dependency grammar represents syntactic structure as _______ between words. Answer: Dependencies 10.An alternative to phrase structure grammar is _______ grammar, which focuses on labeled binary relations. Answer: Dependency TOPIC 11 Text Analytics & Sentiment Analysis 1. Definition of Sentiment Analysis: - Focuses on analyzing people's opinions, sentiments, and emotions in text. - Uses NLP, statistics, and machine learning to identify sentiment content. - Known as opinion mining. 2. Key Sentiment Analysis Concepts: - Semantic Orientation and Polarity: Indicate positive, negative, or neutral sentiment. - Subjective Impressions: Based on personal judgments, emotional state, or contextual polarity. 3. Levels of Sentiment Analysis: - Document Level: Analyzes overall sentiment in a document. - Sentence Level: Identifies sentiment for each sentence, distinguishing objective from subjective content. - Entity and Aspect Level: Examines finer details, like opinions on specific product features. 4. Challenges in Sentiment Analysis: - Complexity of opinions in text. - Issues like negation, sarcasm, and rhetorical devices. 5. Steps in Sentiment Analysis Using NLTK: - Train classifiers with labeled data. - Use feature extraction (e.g., Bag of Words model) to classify sentiment. 6. Evaluation Techniques: - Sentiment lexicons and Pointwise Mutual Information (PMI). Information Extraction & Named Entity Recognition (NER) 1. Definition of NER: - Locates and classifies entities in text into categories like names, organizations, locations, etc. - Enhances the meaning of text documents through information extraction. 2. Applications of NER: - Customer support systems for identifying issues. - Resume filtering by extracting required skills. - Healthcare data analysis for identifying symptoms and diseases. 3. Types of NER Systems: - Dictionary-Based: Uses predefined lists. - Rule-Based: Relies on morphological and contextual patterns. - Machine Learning-Based: Trains models with annotated data. - Deep Learning-Based: Leverages non-linear feature representation. 4. NER Implementation: - Techniques include tokenization, part-of-speech tagging, noun phrase chunking, and IOB tagging. - Libraries like NLTK and spaCy are used for implementation. 5. spaCy Highlights: - Pre-trained on the OntoNotes 5 corpus. - Supports multiple entity types and requires minimal setup. Fill in the blanks: 1. Question: Sentiment analysis, also known as ________, uses natural language processing to identify and classify emotions in text. Answer: Opinion mining 2. Question: The three levels of sentiment analysis are ________, ________, and ________. Answer: document level, sentence level, and entity and aspect level. 3. Question: Sentiment analysis aims to classify text into three categories: ________, ________, and ________. Answer: Positive, negative, neutral 4. Question: The ________ level of sentiment analysis evaluates the overall sentiment of a document. Answer: Document 5. Question: ________ analysis examines specific attributes of an entity to determine sentiment, such as a product’s features. Answer: Aspect 6. Question: The ________ model represents text as a collection of unordered words, sorted by frequency of occurrence. Answer: Bag of Words 7. Question: One major challenge in sentiment analysis is understanding ________, such as "not bad," which can invert the sentiment. Answer: Negation 8. Question: Challenges in sentiment analysis include ________, ________, and rhetorical devices such as sarcasm. Answer: negation, sarcasm 9. Question: Sentiment analysis is often applied to sources like social media posts, ________, and ________. Answer: Product reviews, news articles 10.Question: ________ is an information-theoretic measure used to identify word associations or collocations in text. Answer: Pointwise Mutual Information (PMI) 11.Question: ________ is a Python library commonly used for implementing sentiment analysis using tools like classifiers and feature extraction. Answer: NLTK 12.Question: Positive sentiment words include ________, ________, and ________. Answer: Love, amazing, helpful 13.Question: Named Entity Recognition seeks to classify entities in text into predefined categories like ________, ________, and ________. Answer: Persons, locations, organizations 14.Question: The ________ approach to NER uses predefined vocabulary lists to match entities in text. Answer: Dictionary-based 15.Question: ________ NER models rely on statistical techniques and feature-based representations to detect entities. Answer: Machine learning-based 16.Question: The ________ approach to NER uses non-linear data representations to capture complex relationships in text. Answer: Deep learning-based 17.Question: IOB tagging represents tokens as ________, ________, or ________ in a chunking structure. Answer: Inside, outside, beginning 18.Question: Tools like ________ and ________ are commonly used for implementing Named Entity Recognition. Answer: NLTK, spaCy 19.Question: A rule-based NER system uses ________ rules and ________ rules to identify entities in text. Answer: Morphological, contextual 20.Question: In spaCy, NER is pre-trained on the ________ corpus, which supports multiple entity types. Answer: OntoNotes 5 21.Question: The process of grouping words into chunks like noun phrases is called ________. Answer: Chunking 22.Question: The spaCy command to extract named entities involves using the function ________. Answer: nlp() 23.Question: Named Entity Recognition seeks to locate and classify entities into categories such as ________, ________, and ________. Answer: names of persons, organizations, locations 24.Question: ________ is a Python library widely used for NER and natural language processing. Answer: spaCy TOPIC 12 1. Definition of Speech Recognition Interdisciplinary field combining Computer Science and Computational Linguistics. Converts human speech into text using algorithms and technologies. Known as Automatic Speech Recognition (ASR) or Speech-to-Text. 2. Trends in Speech Recognition (2024 and Beyond) Replacement of chat-based AI interfaces with voice input. Improved AI-powered voice assistants (e.g., integration of Large Language Models). Accessibility improvements: automatic captions for social media. Enhanced collaboration tools like Google DuetAI. 3. Applications Voice Assistants: Phones, smart devices, and cars. Speech-to-Text tools: Automated transcription for meetings. Accessibility tools: Benefiting those with disabilities. Security: Speaker recognition for authentication. 4. How It Works Speech is digitized using a microphone and analog-to-digital converter. Core techniques involve Neural Networks, Hidden Markov Models (HMMs), and Voice Activity Detectors (VADs). Speech signals are analyzed at 10-millisecond intervals to generate cepstral coefficients (vectors representing signal features). 5. Challenges in Speech Recognition Variability in pronunciation (e.g., dialects, accents). Homophones (e.g., "bear" vs. "bare"). Impact of noise and emotion. Difficulty in identifying pauses or prosody. 6. Data and Formats Common audio formats: WAV, MP3, M4A, WMA. Sampling rates: - Telephony systems use 8 kHz. - Human hearing ranges between 20 Hz–20,000 Hz. 7. Speech Analysis Applications Speaker Diarization: Identifying “who spoke when.” Emotional Classification: Detecting speech emotions like happiness or anger. Text-to-Speech: Generating natural-sounding speech. 8. Python Packages for Speech Recognition SpeechRecognition (Google Web Speech API wrapper). Pocketsphinx (offline recognition). Other APIs: Google Cloud Speech, IBM Speech to Text, Whisper (OpenAI). 9. Self-Exercise and Implementation Record sentences as.wav files. Use Python libraries like SpeechRecognition to recognize speech. Measure transcription accuracy. FILL IN THE BLANK QUESTIONS 1. Speech recognition is an interdisciplinary subfield combining __________ and __________. Answer: Computer Science, Computational Linguistics 2. Speech signals can be represented in two domains: __________ and __________. Answer: time, frequency 3. The reverse of speech recognition, converting text to speech, is known as __________. Answer: speech synthesis 4. The __________ is commonly used in modern speech recognition systems for decoding audio into text. Answer: Hidden Markov Model (HMM) 5. AI-driven speech recognition is expected to replace traditional __________ interfaces. Answer: chat-based AI 6. The vector representation of speech signal fragments is known as __________ coefficients. Answer: cepstral 7. The typical sampling rate for telephony systems is __________ kHz. Answer: 8 8. Tools like __________ are used for offline speech recognition. Answer: Pocketsphinx 9. In Python, the __________ package provides a wrapper for the Google Web Speech API. Answer: SpeechRecognition 10.Speaker diarization involves identifying __________ within a speech signal. Answer: “who spoke when”