Document Details

EffusiveCelebration6888

Uploaded by EffusiveCelebration6888

Tags

natural language processing parts-of-speech tagging hidden markov model speech recognition

Summary

This document provides notes on natural language processing topics, including parts-of-speech (POS) tagging, probabilistic models like Hidden Markov Models (HMMs), and algorithms like the Viterbi and Forward algorithms. It also includes examples and applications.

Full Transcript

TOPIC 6 TOPIC 7 1.​ Part-of-Speech (POS) Tagging: -​ The process of assigning lexical class markers (tags) to words in a corpus. -​ Useful for speech recognition, word sense disambiguation, and other NLP tasks. 2.​ Word Classes: -​ Close...

TOPIC 6 TOPIC 7 1.​ Part-of-Speech (POS) Tagging: -​ The process of assigning lexical class markers (tags) to words in a corpus. -​ Useful for speech recognition, word sense disambiguation, and other NLP tasks. 2.​ Word Classes: -​ Closed Class: Fixed set of grammatical function words (e.g., prepositions, conjunctions). -​ Open Class: Expanding the set of content words (e.g., nouns, verbs, adjectives) 3.​ Tagsets: -​ Penn Treebank Tagset: Most commonly used, with 45 tags. -​ C5 Tagset: Used for the British National Corpus (BNC), with 61 tags. 4.​ Ambiguities in POS Tagging: -​ Words like "book" or "like" can have multiple POS tags depending on context. -​ Context-based algorithms resolve these ambiguities 5.​ Approaches to POS Tagging: -​ Rule-Based: Uses handcrafted linguistic rules. -​ Learning-Based: Relies on annotated corpora and machine learning (e.g., Naïve Bayes, Neural Networks, HMMs). 6.​ Probabilistic Models: -​ Hidden Markov Models (HMMs): Assume the next state depends only on the current state. -​ Conditional Random Fields (CRFs): Consider global dependencies for sequence labeling. 7.​ Training and Evaluation: -​ The training phase estimates probabilities for word-tag and tag transitions. -​ Evaluation involves calculating metrics like precision, recall, and F-measure. 8.​ Sequence Labeling Problem: -​ Classifies each token in a sequence, considering dependencies between neighboring tokens. FILL IN THE BLANK 1.​ The process of assigning a lexical class marker to each word in a corpus is called _______. Answer: Part-of-Speech Tagging 2.​ Words like "in" and "on" are part of the _______ class, which has a fixed membership. Answer: Closed 3.​ The _______ tagset consists of 45 tags and is widely used in NLP. Answer: Penn Treebank 4.​ Rule-based POS tagging relies on _______ crafted based on linguistic knowledge. Answer: Human 5.​ In probabilistic sequence models, _______ assumes the next state depends only on the current state. Answer: Hidden Markov Model (HMM) 6.​ Training data is typically split into _______ for model training and _______ for testing. Answer: 90%, 10% 7.​ The metric that calculates the harmonic mean of precision and recall is called _______. Answer: F-measure 8.​ _______ is a problem where the contexts to be tagged do not appear in the training data. Answer: Sparse data 9.​ The _______ model uses probabilities of word-tag pairs and tag transitions Answer: Bigram Tagger 10.​When using a sliding window approach, the classification of a token considers its _______ tokens. Answer: Neighboring TOPIC 8 1.​ Markov Chain: -​ A model representing transitions between states with associated probabilities. -​ Transition probabilities leaving a state must sum to 1. -​ It cannot represent ambiguity. 2.​ Hidden Markov Model: -​ An extension of Markov Chains where the states are hidden (non-observable). -​ Example: POS tagging, where words are observed, but tags (states) are hidden. -​ Components: -​ States (Q): Hidden variables (e.g., Hot/Cold, POS tags). -​ Observations (O): Observable data (e.g., ice creams eaten, words in a sentence). -​ Transition Probabilities (A): Likelihood of moving from one state to another. -​ Emission Probabilities (B): Likelihood of observations given a state. -​ Initial Probabilities (π): Probability of starting in each state. 3.​ Fundamental Problems in HMM: -​ Likelihood Problem: Compute the probability of an observation sequence (solved using the Forward algorithm). -​ Decoding Problem: Find the most probable sequence of hidden states (solved using the Viterbi algorithm). -​ Learning Problem: Estimate HMM parameters given observed sequences 4.​ Forward Algorithm: -​ Purpose: Calculate the probability of an observation sequence efficiently. -​ Approach: -​ Dynamic programming with a forward trellis. -​ Sums probabilities over all possible paths leading to each state. 5.​ Viterbi Algorithm: -​ Purpose: Determine the most probable sequence of hidden states. -​ Approach: -​ It is similar to the Forward algorithm but uses the max function instead of summation. -​ Includes backtracking to reconstruct the best path. -​ Backtracking pointers are used to trace the best path through the trellis. 6.​ Applications of HMM: -​ Weather Prediction: Infer temperature (Hot/Cold) based on ice cream consumption. -​ POS Tagging: Assign grammatical tags to words in a sentence. -​ Activity Recognition: Predict activities (e.g., walking, shopping) from observed behavior. -​ Health Monitoring: Infer health conditions based on symptoms or activities. 7.​ Variants of HMM: -​ Bakis HMM: Left-to-right transitions (e.g., modeling speech). -​ Ergodic HMM: Fully connected; transitions allowed between all states. FILL IN THE BLANK 1.​ A Markov Chain cannot represent _______ as it uniquely determines the path through states Answer: Ambiguity 2.​ The extension of Markov Chains that includes hidden states is called _______. Answer: Hidden Markov Model (HMM) 3.​ In HMM, the probability of observing a specific output given a state is known as _______. Answer: Emission Probability 4.​ The _______ algorithm is used to compute the probability of an observation sequence. Answer: Forward 5.​ The _______ algorithm is used to find the most probable sequence of hidden states. Answer: Viterbi 6.​ A left-to-right HMM commonly used in speech recognition is called a _______ HMM. Answer: Bakis 7.​ Transition probabilities describe the likelihood of moving from one _______ to another. Answer: State 8.​ The dynamic programming structure used in Forward and Viterbi algorithms is called a _______. Answer: Trellis 9.​ In the Forward algorithm, probabilities are computed by _______ over all possible paths leading to a state. Answer: Summing 10.​In the Viterbi algorithm, probabilities are computed by taking the _______ over all possible paths leading to a state. Answer: Maximum 11.​The HMM component that specifies the probability of starting in each state is called _______. Answer: Initial Probability (π) 12.​The _______ pointers in the Viterbi algorithm trace the best path through the states. Answer: Backtracking 13.​The Forward and Viterbi algorithms both use _______ programming to improve computational efficiency. Answer: Dynamic TOPIC 9 ​ STATISTICAL PARSING 1.​ Overview: -​ Statistical parsing uses probabilistic models to assign probabilities to parse trees. -​ Helps resolve syntactic ambiguity and supports supervised and unsupervised parser learning. 2.​ Probabilistic Context-Free Grammar (PCFG): -​ A CFG variant where each production rule has an associated probability. -​ Probabilities define distributions for non-terminals. -​ Example grammar includes rules for English sentence structures with associated probabilities. 3.​ Treebanks: -​ Corpora annotated with parse trees, e.g., the Penn Treebank. -​ Treebanks provide a foundation for supervised learning of PCFGs. 4.​ Parsing Techniques with PCFG -​ Use of NLTK libraries (e.g., InsideChartParser, ViterbiParser) for probabilistic parsing. -​ Steps include defining grammar, generating parse trees, and calculating probabilities. 5.​ Evaluation Metrics: -​ PARSEVAL metrics: Recall, Precision, and F1-score measure how well parse trees align with gold standards. -​ Example calculations for labeled precision and recall were provided. 6.​ Dependency Grammar (PSG): -​ Represents syntactic structure through dependencies rather than phrases. -​ Directed graphs between words, suitable for free word-order languages. SYNTACTIC PARSING 1.​ Phrase Structure Grammar (PSG): -​ Introduced by Noam Chomsky; sentences are generated via rewrite rules. -​ Focuses on deriving correct syntactic parse trees for sentences. 2.​ Parsing as Search: -​ Explores all derivations to derive a given string: - Top-Down Parsing: Starts from the root (start symbol). - Bottom-Up Parsing: Begins with terminal symbols, moving toward the root. 3.​ Parsing Strategies: Top-down parsers explore inconsistent options early but may generate invalid trees. -​ Bottom-up parsers avoid inconsistent options but might not reach complete parses. 4.​ Examples of Parsing: -​ Examples illustrate parsing structures using given sentences, with CFG rules and NLTK tools. 5.​ Comparison of Parsing Approaches: -​ Efficiency and error tendencies compared for top-down and bottom-up methods. FILL-IN-THE BLANK QUESTIONS 1.​ Statistical parsing uses a _______ model to assign probabilities to parse trees. Answer: Probabilistic 2.​ A probabilistic version of CFG is called _______. Answer: Probabilistic Context-Free Grammar (PCFG) 3.​ The _______ grammar is a corpus annotated with parse trees, commonly used for supervised learning. Answer: Treebank 4.​ In statistical parsing, the probability of a sentence is the _______ of the probabilities of all its derivations. Answer: Sum 5.​ The _______ algorithm helps efficiently determine the most probable derivation for a sentence in PCFG. Answer: Viterbi 6.​ _______ parsing starts from the root of the parse tree and applies grammar rules to generate possible trees. Answer: Top-down 7.​ _______ parsing starts from terminal symbols and works backward to find the root. Answer: Bottom-up 8.​ The F1 score is the harmonic mean of _______ and _______. Answer: Precision, Recall 9.​ Dependency grammar represents syntactic structure as _______ between words. Answer: Dependencies 10.​An alternative to phrase structure grammar is _______ grammar, which focuses on labeled binary relations. Answer: Dependency TOPIC 11 Text Analytics & Sentiment Analysis 1.​ Definition of Sentiment Analysis: -​ Focuses on analyzing people's opinions, sentiments, and emotions in text. -​ Uses NLP, statistics, and machine learning to identify sentiment content. -​ Known as opinion mining. 2.​ Key Sentiment Analysis Concepts: -​ Semantic Orientation and Polarity: Indicate positive, negative, or neutral sentiment. -​ Subjective Impressions: Based on personal judgments, emotional state, or contextual polarity. 3.​ Levels of Sentiment Analysis: -​ Document Level: Analyzes overall sentiment in a document. -​ Sentence Level: Identifies sentiment for each sentence, distinguishing objective from subjective content. -​ Entity and Aspect Level: Examines finer details, like opinions on specific product features. 4.​ Challenges in Sentiment Analysis: -​ Complexity of opinions in text. -​ Issues like negation, sarcasm, and rhetorical devices. 5.​ Steps in Sentiment Analysis Using NLTK: -​ Train classifiers with labeled data. -​ Use feature extraction (e.g., Bag of Words model) to classify sentiment. 6.​ Evaluation Techniques: -​ Sentiment lexicons and Pointwise Mutual Information (PMI). Information Extraction & Named Entity Recognition (NER) 1.​ Definition of NER: -​ Locates and classifies entities in text into categories like names, organizations, locations, etc. -​ Enhances the meaning of text documents through information extraction. 2.​ Applications of NER: -​ Customer support systems for identifying issues. -​ Resume filtering by extracting required skills. -​ Healthcare data analysis for identifying symptoms and diseases. 3.​ Types of NER Systems: -​ Dictionary-Based: Uses predefined lists. -​ Rule-Based: Relies on morphological and contextual patterns. -​ Machine Learning-Based: Trains models with annotated data. -​ Deep Learning-Based: Leverages non-linear feature representation. 4.​ NER Implementation: -​ Techniques include tokenization, part-of-speech tagging, noun phrase chunking, and IOB tagging. -​ Libraries like NLTK and spaCy are used for implementation. 5.​ spaCy Highlights: -​ Pre-trained on the OntoNotes 5 corpus. -​ Supports multiple entity types and requires minimal setup. Fill in the blanks:​ 1.​ Question: Sentiment analysis, also known as ________, uses natural language processing to identify and classify emotions in text.​ Answer: Opinion mining 2.​ Question: The three levels of sentiment analysis are ________, ________, and ________. Answer: document level, sentence level, and entity and aspect level. 3.​ Question: Sentiment analysis aims to classify text into three categories: ________, ________, and ________.​ Answer: Positive, negative, neutral 4.​ Question: The ________ level of sentiment analysis evaluates the overall sentiment of a document.​ Answer: Document 5.​ Question: ________ analysis examines specific attributes of an entity to determine sentiment, such as a product’s features.​ Answer: Aspect 6.​ Question: The ________ model represents text as a collection of unordered words, sorted by frequency of occurrence.​ Answer: Bag of Words 7.​ Question: One major challenge in sentiment analysis is understanding ________, such as "not bad," which can invert the sentiment.​ Answer: Negation 8.​ Question: Challenges in sentiment analysis include ________, ________, and rhetorical devices such as sarcasm. Answer: negation, sarcasm 9.​ Question: Sentiment analysis is often applied to sources like social media posts, ________, and ________.​ Answer: Product reviews, news articles 10.​Question: ________ is an information-theoretic measure used to identify word associations or collocations in text.​ Answer: Pointwise Mutual Information (PMI) 11.​Question: ________ is a Python library commonly used for implementing sentiment analysis using tools like classifiers and feature extraction.​ Answer: NLTK 12.​Question: Positive sentiment words include ________, ________, and ________.​ Answer: Love, amazing, helpful 13.​Question: Named Entity Recognition seeks to classify entities in text into predefined categories like ________, ________, and ________.​ Answer: Persons, locations, organizations 14.​Question: The ________ approach to NER uses predefined vocabulary lists to match entities in text.​ Answer: Dictionary-based 15.​Question: ________ NER models rely on statistical techniques and feature-based representations to detect entities.​ Answer: Machine learning-based 16.​Question: The ________ approach to NER uses non-linear data representations to capture complex relationships in text.​ Answer: Deep learning-based 17.​Question: IOB tagging represents tokens as ________, ________, or ________ in a chunking structure.​ Answer: Inside, outside, beginning 18.​Question: Tools like ________ and ________ are commonly used for implementing Named Entity Recognition.​ Answer: NLTK, spaCy 19.​Question: A rule-based NER system uses ________ rules and ________ rules to identify entities in text.​ Answer: Morphological, contextual 20.​Question: In spaCy, NER is pre-trained on the ________ corpus, which supports multiple entity types.​ Answer: OntoNotes 5 21.​Question: The process of grouping words into chunks like noun phrases is called ________.​ Answer: Chunking 22.​Question: The spaCy command to extract named entities involves using the function ________.​ Answer: nlp() 23.​Question: Named Entity Recognition seeks to locate and classify entities into categories such as ________, ________, and ________. Answer: names of persons, organizations, locations 24.​Question: ________ is a Python library widely used for NER and natural language processing. Answer: spaCy TOPIC 12 1.​ Definition of Speech Recognition ​ Interdisciplinary field combining Computer Science and Computational Linguistics. ​ Converts human speech into text using algorithms and technologies. ​ Known as Automatic Speech Recognition (ASR) or Speech-to-Text. 2.​ Trends in Speech Recognition (2024 and Beyond) ​ Replacement of chat-based AI interfaces with voice input. ​ Improved AI-powered voice assistants (e.g., integration of Large Language Models). ​ Accessibility improvements: automatic captions for social media. ​ Enhanced collaboration tools like Google DuetAI. 3.​ Applications ​ Voice Assistants: Phones, smart devices, and cars. ​ Speech-to-Text tools: Automated transcription for meetings. ​ Accessibility tools: Benefiting those with disabilities. ​ Security: Speaker recognition for authentication. 4.​ How It Works ​ Speech is digitized using a microphone and analog-to-digital converter. ​ Core techniques involve Neural Networks, Hidden Markov Models (HMMs), and Voice Activity Detectors (VADs). ​ Speech signals are analyzed at 10-millisecond intervals to generate cepstral coefficients (vectors representing signal features). 5.​ Challenges in Speech Recognition ​ Variability in pronunciation (e.g., dialects, accents). ​ Homophones (e.g., "bear" vs. "bare"). ​ Impact of noise and emotion. ​ Difficulty in identifying pauses or prosody. 6.​ Data and Formats ​ Common audio formats: WAV, MP3, M4A, WMA. ​ Sampling rates: -​ Telephony systems use 8 kHz. -​ Human hearing ranges between 20 Hz–20,000 Hz. 7.​ Speech Analysis Applications ​ Speaker Diarization: Identifying “who spoke when.” ​ Emotional Classification: Detecting speech emotions like happiness or anger. ​ Text-to-Speech: Generating natural-sounding speech. 8.​ Python Packages for Speech Recognition ​ SpeechRecognition (Google Web Speech API wrapper). ​ Pocketsphinx (offline recognition). ​ Other APIs: Google Cloud Speech, IBM Speech to Text, Whisper (OpenAI). 9.​ Self-Exercise and Implementation ​ Record sentences as.wav files. ​ Use Python libraries like SpeechRecognition to recognize speech. ​ Measure transcription accuracy. FILL IN THE BLANK QUESTIONS 1.​ Speech recognition is an interdisciplinary subfield combining __________ and __________.​ Answer: Computer Science, Computational Linguistics 2.​ Speech signals can be represented in two domains: __________ and __________.​ Answer: time, frequency 3.​ The reverse of speech recognition, converting text to speech, is known as __________.​ Answer: speech synthesis 4.​ The __________ is commonly used in modern speech recognition systems for decoding audio into text.​ Answer: Hidden Markov Model (HMM) 5.​ AI-driven speech recognition is expected to replace traditional __________ interfaces.​ Answer: chat-based AI 6.​ The vector representation of speech signal fragments is known as __________ coefficients.​ Answer: cepstral 7.​ The typical sampling rate for telephony systems is __________ kHz.​ Answer: 8 8.​ Tools like __________ are used for offline speech recognition.​ Answer: Pocketsphinx 9.​ In Python, the __________ package provides a wrapper for the Google Web Speech API.​ Answer: SpeechRecognition 10.​Speaker diarization involves identifying __________ within a speech signal.​ Answer: “who spoke when”

Use Quizgecko on...
Browser
Browser