FINAL NLP PDF

TOPIC 6 TEXT SUMMARIZATION 1. Definition: - Text summarization generates concise versions of large texts without losing essential information. - Common applications include news aggregators like Google News and Inshorts. 2. Types of Text Summarization: - Extractive Summarization: ○ Selects key sentences or phrases directly from the source. ○ Methods include frequency-based techniques, TF-IDF, and tools like Sumy. - Abstractive Summarization: ○ Generates summaries using deep learning models (e.g., BERT, GPT). ○ Paraphrases content and conveys meaning instead of copying phrases. 3. Deep Learning Methods: - Seq2Seq Model with Attention: ○ Encoder (Bi-directional LSTM): Extracts input text features. ○ Decoder (Uni-directional LSTM): Generates summaries one word at a time. ○ Attention Mechanism: Improves performance by addressing the "information bottleneck." - Transformers: ○ Utilize self-attention for higher-quality summaries. ○ Example: PEGASUS pre-trains by masking key sentences and reconstructing them. 4. Algorithms and Tools: - Frequency Method: Selects sentences with high-frequency terms. - Sumy Library: Implements various summarization algorithms. - LSA (Latent Semantic Analysis): Projects data into low-dimensional space while preserving semantics. - LexRank (Cosine Similarity): Measures sentence similarity to create summaries. LARGE LANGUAGE MODELS (LLM) & WORD EMBEDDINGS 1. Word Embeddings: - Represent words as dense vectors in low-dimensional space (25–1000 dimensions). - Examples: ○ Word2Vec: Predicts surrounding words in a window for each word. ○ GloVe: Combines global co-occurrence statistics with vector representations. 2. Limitations of Traditional Representations: - "One-hot" vectors are high-dimensional, sparse, and lack semantic meaning. - Dense embeddings solve sparsity and dimensionality issues. 3. Semantic Patterns: - Word embeddings capture relationships (e.g., king – man ≈ queen – woman). - Analogies can be solved through vector subtraction. 4. Large Language Models: - LLaMA: ○ Trained on trillions of tokens in multiple languages. ○ Features billions of parameters for improved text generation. - Other foundational models include GPT, BERT, and Meta’s AI-based LLaMA. 5. Advantages: - Pretrained word vectors improve downstream NLP tasks. - Self-attention in models enhances understanding of context. FILL IN THE BLANK QUESTIONS 1. Text summarization creates __________ versions of texts for quicker consumption. Answer: shorter 2. Extractive summarization involves selecting __________ from the original text. Answer: key sentences or phrases 3. Abstractive summarization uses deep learning models like __________ or __________. Answer: BERT, GPT 4. The __________ model uses attention mechanisms to avoid information bottlenecks. Answer: Seq2Seq 5. __________ is a library that implements summarization algorithms like TextRank and LSA. Answer: Sumy 6. Word embeddings represent words as __________ vectors. Answer: dense 7. __________ combines global word-word co-occurrence with dense vectors. Answer: GloVe 8. "One-hot" vectors are problematic because they are __________ and __________. Answer: high-dimensional, sparse 9. The transformer mechanism introduces __________ for high-quality outputs. Answer: self attention 10.__________ is a foundational language model trained on trillions of tokens by Meta AI. Answer: LLaMA CLASSIFICATION 1. Introduction to Machine Learning and NLP Integration: - Machine learning learns relations from features in data. - Classification (supervised): Predict classes based on labeled data. - Clustering (unsupervised): Group data without predefined labels. 2. Text Representation for Computers: - Human-readable text must be transformed into numbers for computational processing. - NLP applies patterns from training to predict or analyze new texts. 3. Types of Machine Learning: - Supervised Learning: Uses labeled data to infer relationships. - Unsupervised Learning: Infers structure from unlabeled data. - Semi-supervised Learning: Combines small labeled data with larger unlabeled data. 4. Applications: - Examples include healthcare (predicting high-risk patients), inventory categorization, voice assistants, translation, self-driving cars, etc. 5. Steps in Machine Learning: - Data collection, cleaning, shuffling, splitting, model training, evaluation, and application. 6. Deep Learning and NLP: - Learning representations through successive layers. - Applications include transformers like Google BERT, word embeddings (Word2Vec, GloVe), and reinforcement learning for tasks like NLG. 7. Python Libraries: - Key libraries: NumPy, SciPy, NLTK, Scikit-learn, Pandas, and Matplotlib. 8. Naïve Bayes Classification: - Based on Bayes theorem; assumes feature independence. - Works well for categorical variables; requires feature extraction 9. Text Classification Example: - German name gender classification using the last letter as a feature. - Testing involves unseen names; accuracy evaluation considers informative features. CLUSTERING 1. Text Clustering: - Groups texts into clusters with similar characteristics. - Useful for analyzing large, unstructured datasets. 2. K-Means Clustering Algorithm: - Finds groups in data with K representing the number of clusters. - Steps: Initialize centroids (randomly or from data). Assign data points to the nearest centroid (based on squared Euclidean distance). Update centroids iteratively 3. Clustering with Texts: - Example: News articles categorized based on term similarity. - Visual representation through 2D scatterplots using cosine distance. 4. Pre-processing Text Data: - Stop-word removal: Eliminates common words that add no value (e.g., "the," "and"). - Normalization: Converts text to lowercase and removes punctuation. - Tokenization: Splits text into words and counts occurrences. - Stemming: Reduces words to root forms. 5. Vectorization and tf-idf: - Text-to-numeric transformation using TfIdfVectorizer. - Assigns importance scores to words based on frequency in documents and corpus. 6. Implementation: - Processes text files, performs custom tokenization and stemming. - Creates a matrix where rows represent files and columns represent terms with tf-idf scores. FILL-IN-THE-BLANK 1. Machine learning is the process of _______ from data. Answer: learning 2. In supervised learning, the model uses _______ data to infer relationships. Answer: labeled 3. The goal of clustering is to group data points based on _______ similarity. Answer: feature 4. _______ is a widely used Python library for numerical and scientific computation. Answer: NumPy 5. Stop words are commonly used words that are _______ during data preprocessing. Answer: removed 6. Tf-idf stands for ______ Answer: term frequency-inverse document frequency SHORT ANSWER 1. What is the main difference between classification and clustering? Answer: Classification assigns labels to data points using pre-labeled data (supervised), whereas clustering groups data points based on similarities without predefined labels (unsupervised). 2. Why is text vectorization important in NLP? Answer: It transforms text into numerical data, enabling machine learning algorithms to process and analyze it. 3. What are the steps involved in K-means clustering Answer: Initialize centroids. Assign data points to the nearest centroid. Update centroids iteratively until convergence. 4. How does stemming contribute to NLP preprocessing? Answer: Stemming reduces words to their base form, helping in grouping similar words and simplifying text analysis. 5. What role does tf-idf play in text clustering? Answer: It quantifies the importance of words in a document relative to a corpus, aiding in distinguishing key terms for clustering TOPIC 7 1. Part-of-Speech (POS) Tagging: - The process of assigning lexical class markers (tags) to words in a corpus. - Useful for speech recognition, word sense disambiguation, and other NLP tasks. 2. Word Classes: - Closed Class: Fixed set of grammatical function words (e.g., prepositions, conjunctions). - Open Class: Expanding the set of content words (e.g., nouns, verbs, adjectives) 3. Tagsets: - Penn Treebank Tagset: Most commonly used, with 45 tags. - C5 Tagset: Used for the British National Corpus (BNC), with 61 tags. 4. Ambiguities in POS Tagging: - Words like "book" or "like" can have multiple POS tags depending on context. - Context-based algorithms resolve these ambiguities 5. Approaches to POS Tagging: - Rule-Based: Uses handcrafted linguistic rules. - Learning-Based: Relies on annotated corpora and machine learning (e.g., Naïve Bayes, Neural Networks, HMMs). 6. Probabilistic Models: - Hidden Markov Models (HMMs): Assume the next state depends only on the current state. - Conditional Random Fields (CRFs): Consider global dependencies for sequence labeling. 7. Training and Evaluation: - The training phase estimates probabilities for word-tag and tag transitions. - Evaluation involves calculating metrics like precision, recall, and F-measure. 8. Sequence Labeling Problem: - Classifies each token in a sequence, considering dependencies between neighboring tokens. FILL IN THE BLANK 1. The process of assigning a lexical class marker to each word in a corpus is called _______. Answer: Part-of-Speech Tagging 2. Words like "in" and "on" are part of the _______ class, which has a fixed membership. Answer: Closed 3. The _______ tagset consists of 45 tags and is widely used in NLP. Answer: Penn Treebank 4. Rule-based POS tagging relies on _______ crafted based on linguistic knowledge. Answer: Human 5. In probabilistic sequence models, _______ assumes the next state depends only on the current state. Answer: Hidden Markov Model (HMM) 6. Training data is typically split into _______ for model training and _______ for testing. Answer: 90%, 10% 7. The metric that calculates the harmonic mean of precision and recall is called _______. Answer: F-measure 8. _______ is a problem where the contexts to be tagged do not appear in the training data. Answer: Sparse data 9. The _______ model uses probabilities of word-tag pairs and tag transitions Answer: Bigram Tagger 10.When using a sliding window approach, the classification of a token considers its _______ tokens. Answer: Neighboring TOPIC 8 1. Markov Chain: - A model representing transitions between states with associated probabilities. - Transition probabilities leaving a state must sum to 1. - It cannot represent ambiguity. 2. Hidden Markov Model: - An extension of Markov Chains where the states are hidden (non-observable). - Example: POS tagging, where words are observed, but tags (states) are hidden. - Components: - States (Q): Hidden variables (e.g., Hot/Cold, POS tags). - Observations (O): Observable data (e.g., ice creams eaten, words in a sentence). - Transition Probabilities (A): Likelihood of moving from one state to another. - Emission Probabilities (B): Likelihood of observations given a state. - Initial Probabilities (π): Probability of starting in each state. 3. Fundamental Problems in HMM: - Likelihood Problem: Compute the probability of an observation sequence (solved using the Forward algorithm). - Decoding Problem: Find the most probable sequence of hidden states (solved using the Viterbi algorithm). - Learning Problem: Estimate HMM parameters given observed sequences 4. Forward Algorithm: - Purpose: Calculate the probability of an observation sequence efficiently. - Approach: - Dynamic programming with a forward trellis. - Sums probabilities over all possible paths leading to each state. 5. Viterbi Algorithm: - Purpose: Determine the most probable sequence of hidden states. - Approach: - It is similar to the Forward algorithm but uses the max function instead of summation. - Includes backtracking to reconstruct the best path. - Backtracking pointers are used to trace the best path through the trellis. 6. Applications of HMM: - Weather Prediction: Infer temperature (Hot/Cold) based on ice cream consumption. - POS Tagging: Assign grammatical tags to words in a sentence. - Activity Recognition: Predict activities (e.g., walking, shopping) from observed behavior. - Health Monitoring: Infer health conditions based on symptoms or activities. 7. Variants of HMM: - Bakis HMM: Left-to-right transitions (e.g., modeling speech). - Ergodic HMM: Fully connected; transitions allowed between all states. FILL IN THE BLANK 1. A Markov Chain cannot represent _______ as it uniquely determines the path through states Answer: Ambiguity 2. The extension of Markov Chains that includes hidden states is called _______. Answer: Hidden Markov Model (HMM) 3. In HMM, the probability of observing a specific output given a state is known as _______. Answer: Emission Probability 4. The _______ algorithm is used to compute the probability of an observation sequence. Answer: Forward 5. The _______ algorithm is used to find the most probable sequence of hidden states. Answer: Viterbi 6. A left-to-right HMM commonly used in speech recognition is called a _______ HMM. Answer: Bakis 7. Transition probabilities describe the likelihood of moving from one _______ to another. Answer: State 8. The dynamic programming structure used in Forward and Viterbi algorithms is called a _______. Answer: Trellis 9. In the Forward algorithm, probabilities are computed by _______ over all possible paths leading to a state. Answer: Summing 10.In the Viterbi algorithm, probabilities are computed by taking the _______ over all possible paths leading to a state. Answer: Maximum 11.The HMM component that specifies the probability of starting in each state is called _______. Answer: Initial Probability (π) 12.The _______ pointers in the Viterbi algorithm trace the best path through the states. Answer: Backtracking 13.The Forward and Viterbi algorithms both use _______ programming to improve computational efficiency. Answer: Dynamic TOPIC 9 STATISTICAL PARSING 1. Overview: - Statistical parsing uses probabilistic models to assign probabilities to parse trees. - Helps resolve syntactic ambiguity and supports supervised and unsupervised parser learning. 2. Probabilistic Context-Free Grammar (PCFG): - A CFG variant where each production rule has an associated probability. - Probabilities define distributions for non-terminals. - Example grammar includes rules for English sentence structures with associated probabilities. 3. Treebanks: - Corpora annotated with parse trees, e.g., the Penn Treebank. - Treebanks provide a foundation for supervised learning of PCFGs. 4. Parsing Techniques with PCFG - Use of NLTK libraries (e.g., InsideChartParser, ViterbiParser) for probabilistic parsing. - Steps include defining grammar, generating parse trees, and calculating probabilities. 5. Evaluation Metrics: - PARSEVAL metrics: Recall, Precision, and F1-score measure how well parse trees align with gold standards. - Example calculations for labeled precision and recall were provided. 6. Dependency Grammar (PSG): - Represents syntactic structure through dependencies rather than phrases. - Directed graphs between words, suitable for free word-order languages. SYNTACTIC PARSING 1. Phrase Structure Grammar (PSG): - Introduced by Noam Chomsky; sentences are generated via rewrite rules. - Focuses on deriving correct syntactic parse trees for sentences. 2. Parsing as Search: - Explores all derivations to derive a given string: - Top-Down Parsing: Starts from the root (start symbol). - Bottom-Up Parsing: Begins with terminal symbols, moving toward the root. 3. Parsing Strategies: Top-down parsers explore inconsistent options early but may generate invalid trees. - Bottom-up parsers avoid inconsistent options but might not reach complete parses. 4. Examples of Parsing: - Examples illustrate parsing structures using given sentences, with CFG rules and NLTK tools. 5. Comparison of Parsing Approaches: - Efficiency and error tendencies compared for top-down and bottom-up methods. FILL-IN-THE BLANK QUESTIONS 1. Statistical parsing uses a _______ model to assign probabilities to parse trees. Answer: Probabilistic 2. A probabilistic version of CFG is called _______. Answer: Probabilistic Context-Free Grammar (PCFG) 3. The _______ grammar is a corpus annotated with parse trees, commonly used for supervised learning. Answer: Treebank 4. In statistical parsing, the probability of a sentence is the _______ of the probabilities of all its derivations. Answer: Sum 5. The _______ algorithm helps efficiently determine the most probable derivation for a sentence in PCFG. Answer: Viterbi 6. _______ parsing starts from the root of the parse tree and applies grammar rules to generate possible trees. Answer: Top-down 7. _______ parsing starts from terminal symbols and works backward to find the root. Answer: Bottom-up 8. The F1 score is the harmonic mean of _______ and _______. Answer: Precision, Recall 9. Dependency grammar represents syntactic structure as _______ between words. Answer: Dependencies 10.An alternative to phrase structure grammar is _______ grammar, which focuses on labeled binary relations. Answer: Dependency TOPIC 11 Text Analytics & Sentiment Analysis 1. Definition of Sentiment Analysis: - Focuses on analyzing people's opinions, sentiments, and emotions in text. - Uses NLP, statistics, and machine learning to identify sentiment content. - Known as opinion mining. 2. Key Sentiment Analysis Concepts: - Semantic Orientation and Polarity: Indicate positive, negative, or neutral sentiment. - Subjective Impressions: Based on personal judgments, emotional state, or contextual polarity. 3. Levels of Sentiment Analysis: - Document Level: Analyzes overall sentiment in a document. - Sentence Level: Identifies sentiment for each sentence, distinguishing objective from subjective content. - Entity and Aspect Level: Examines finer details, like opinions on specific product features. 4. Challenges in Sentiment Analysis: - Complexity of opinions in text. - Issues like negation, sarcasm, and rhetorical devices. 5. Steps in Sentiment Analysis Using NLTK: - Train classifiers with labeled data. - Use feature extraction (e.g., Bag of Words model) to classify sentiment. 6. Evaluation Techniques: - Sentiment lexicons and Pointwise Mutual Information (PMI). Information Extraction & Named Entity Recognition (NER) 1. Definition of NER: - Locates and classifies entities in text into categories like names, organizations, locations, etc. - Enhances the meaning of text documents through information extraction. 2. Applications of NER: - Customer support systems for identifying issues. - Resume filtering by extracting required skills. - Healthcare data analysis for identifying symptoms and diseases. 3. Types of NER Systems: - Dictionary-Based: Uses predefined lists. - Rule-Based: Relies on morphological and contextual patterns. - Machine Learning-Based: Trains models with annotated data. - Deep Learning-Based: Leverages non-linear feature representation. 4. NER Implementation: - Techniques include tokenization, part-of-speech tagging, noun phrase chunking, and IOB tagging. - Libraries like NLTK and spaCy are used for implementation. 5. spaCy Highlights: - Pre-trained on the OntoNotes 5 corpus. - Supports multiple entity types and requires minimal setup. Fill in the blanks: 1. Question: Sentiment analysis, also known as ________, uses natural language processing to identify and classify emotions in text. Answer: Opinion mining 2. Question: The three levels of sentiment analysis are ________, ________, and ________. Answer: document level, sentence level, and entity and aspect level. 3. Question: Sentiment analysis aims to classify text into three categories: ________, ________, and ________. Answer: Positive, negative, neutral 4. Question: The ________ level of sentiment analysis evaluates the overall sentiment of a document. Answer: Document 5. Question: ________ analysis examines specific attributes of an entity to determine sentiment, such as a product’s features. Answer: Aspect 6. Question: The ________ model represents text as a collection of unordered words, sorted by frequency of occurrence. Answer: Bag of Words 7. Question: One major challenge in sentiment analysis is understanding ________, such as "not bad," which can invert the sentiment. Answer: Negation 8. Question: Challenges in sentiment analysis include ________, ________, and rhetorical devices such as sarcasm. Answer: negation, sarcasm 9. Question: Sentiment analysis is often applied to sources like social media posts, ________, and ________. Answer: Product reviews, news articles 10.Question: ________ is an information-theoretic measure used to identify word associations or collocations in text. Answer: Pointwise Mutual Information (PMI) 11.Question: ________ is a Python library commonly used for implementing sentiment analysis using tools like classifiers and feature extraction. Answer: NLTK 12.Question: Positive sentiment words include ________, ________, and ________. Answer: Love, amazing, helpful 13.Question: Named Entity Recognition seeks to classify entities in text into predefined categories like ________, ________, and ________. Answer: Persons, locations, organizations 14.Question: The ________ approach to NER uses predefined vocabulary lists to match entities in text. Answer: Dictionary-based 15.Question: ________ NER models rely on statistical techniques and feature-based representations to detect entities. Answer: Machine learning-based 16.Question: The ________ approach to NER uses non-linear data representations to capture complex relationships in text. Answer: Deep learning-based 17.Question: IOB tagging represents tokens as ________, ________, or ________ in a chunking structure. Answer: Inside, outside, beginning 18.Question: Tools like ________ and ________ are commonly used for implementing Named Entity Recognition. Answer: NLTK, spaCy 19.Question: A rule-based NER system uses ________ rules and ________ rules to identify entities in text. Answer: Morphological, contextual 20.Question: In spaCy, NER is pre-trained on the ________ corpus, which supports multiple entity types. Answer: OntoNotes 5 21.Question: The process of grouping words into chunks like noun phrases is called ________. Answer: Chunking 22.Question: The spaCy command to extract named entities involves using the function ________. Answer: nlp() 23.Question: Named Entity Recognition seeks to locate and classify entities into categories such as ________, ________, and ________. Answer: names of persons, organizations, locations 24.Question: ________ is a Python library widely used for NER and natural language processing. Answer: spaCy TOPIC 12 1. Definition of Speech Recognition Interdisciplinary field combining Computer Science and Computational Linguistics. Converts human speech into text using algorithms and technologies. Known as Automatic Speech Recognition (ASR) or Speech-to-Text. 2. Trends in Speech Recognition (2024 and Beyond) Replacement of chat-based AI interfaces with voice input. Improved AI-powered voice assistants (e.g., integration of Large Language Models). Accessibility improvements: automatic captions for social media. Enhanced collaboration tools like Google DuetAI. 3. Applications Voice Assistants: Phones, smart devices, and cars. Speech-to-Text tools: Automated transcription for meetings. Accessibility tools: Benefiting those with disabilities. Security: Speaker recognition for authentication. 4. How It Works Speech is digitized using a microphone and analog-to-digital converter. Core techniques involve Neural Networks, Hidden Markov Models (HMMs), and Voice Activity Detectors (VADs). Speech signals are analyzed at 10-millisecond intervals to generate cepstral coefficients (vectors representing signal features). 5. Challenges in Speech Recognition Variability in pronunciation (e.g., dialects, accents). Homophones (e.g., "bear" vs. "bare"). Impact of noise and emotion. Difficulty in identifying pauses or prosody. 6. Data and Formats Common audio formats: WAV, MP3, M4A, WMA. Sampling rates: - Telephony systems use 8 kHz. - Human hearing ranges between 20 Hz–20,000 Hz. 7. Speech Analysis Applications Speaker Diarization: Identifying “who spoke when.” Emotional Classification: Detecting speech emotions like happiness or anger. Text-to-Speech: Generating natural-sounding speech. 8. Python Packages for Speech Recognition SpeechRecognition (Google Web Speech API wrapper). Pocketsphinx (offline recognition). Other APIs: Google Cloud Speech, IBM Speech to Text, Whisper (OpenAI). 9. Self-Exercise and Implementation Record sentences as.wav files. Use Python libraries like SpeechRecognition to recognize speech. Measure transcription accuracy. FILL IN THE BLANK QUESTIONS 1. Speech recognition is an interdisciplinary subfield combining __________ and __________. Answer: Computer Science, Computational Linguistics 2. Speech signals can be represented in two domains: __________ and __________. Answer: time, frequency 3. The reverse of speech recognition, converting text to speech, is known as __________. Answer: speech synthesis 4. The __________ is commonly used in modern speech recognition systems for decoding audio into text. Answer: Hidden Markov Model (HMM) 5. AI-driven speech recognition is expected to replace traditional __________ interfaces. Answer: chat-based AI 6. The vector representation of speech signal fragments is known as __________ coefficients. Answer: cepstral 7. The typical sampling rate for telephony systems is __________ kHz. Answer: 8 8. Tools like __________ are used for offline speech recognition. Answer: Pocketsphinx 9. In Python, the __________ package provides a wrapper for the Google Web Speech API. Answer: SpeechRecognition 10.Speaker diarization involves identifying __________ within a speech signal. Answer: “who spoke when”

Document Details

Tags

Related

Summary

Full Transcript