NLP-Unit-3 PDF
Document Details
Uploaded by HardierActionPainting3812
Government Polytechnic, Nagpur
Dr. S. S. Gharde
Tags
Summary
This document covers Naïve Bayes methods for text classification. It includes topics like classifier training, evaluation metrics (precision, recall, F-measure), and applications such as spam detection, sentiment analysis. The document is presented as a series of slides, with examples and formulations included.
Full Transcript
UNIT III Naïve Bays and Text Classification By Dr. S. S. Gharde Dept. of Information Technology/ AIML Government Polytechnic Nagpur Contents… 3.1 Naive Bayes Classifiers 3.2 Training the Naive Bayes Classifier , Worked example NLP UNIT III (C)S...
UNIT III Naïve Bays and Text Classification By Dr. S. S. Gharde Dept. of Information Technology/ AIML Government Polytechnic Nagpur Contents… 3.1 Naive Bayes Classifiers 3.2 Training the Naive Bayes Classifier , Worked example NLP UNIT III (C)SSG 3.3 Naive Bayes for other text classification tasks 3.4 Naive Bayes as a Language Model 3.5 Evaluation: Confusion Matrix, Accuracy, Precision, Recall, F- measure 3.6 Test sets and Cross-validation 3.7 Statistical Significance Testing 2 3.8 Avoiding Harms in Classification Introduction Classification lies at the heart of both human and machine intelligence. Examples: Deciding what letter, word, or image has been presented to our senses NLP UNIT III (C)SSG Recognizing faces or voices Sorting mail Assigning grades to homeworks Classification for Text categorization Sentiment analysis 3 Sentiment Analysis Positive or negative movie review? +...zany characters and richly applied satire, and some great plot twists It was pathetic. The worst part about it was the boxing scenes... −...awesome caramel sauce and sweet toasty almonds. I love this place!...awful pizza and ridiculously overpriced... + − 4 Why sentiment analysis? Movie: is this review positive or negative? Products: what do people think about the new iPhone? Public sentiment: how is consumer confidence? Politics: what do people think about this candidate or issue? Prediction: predict election outcomes or market trends from sentiment 5 Scherer Typology of Affective States Emotion: brief organically synchronized … evaluation of a major event angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies nervous, anxious, reckless, morose, hostile, jealous Basic Sentiment Classification Sentiment analysis is the detection of attitudes Simple task we focus on in this chapter Is the attitude of this text positive or negative? We return to affect classification in later chapters Summary: Text Classification Sentiment analysis Spam detection Authorship identification Language Identification Assigning subject categories, topics, or genres … Text Classification: definition Input: a document d a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class c C Classification Methods: Supervised Machine Learning Input: a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm) Output: a learned classifier γ:d c 10 Classification Methods: Supervised Machine Learning Any kind of classifier Naïve Bayes Logistic regression Neural networks k-Nearest Neighbors … Naive Bayes Classifiers Simple ("naive") classification method based on Bayes rule Relies on very simple representation of document Bag of words Naive Bayes Classifiers The Bag of Words Representation 13 Naive Bayes Classifiers The bag of words representation seen 2 γ( sweet whimsical 1 1 )=c recommend 1 happy 1...... Naive Bayes Classifiers Bayes’ Rule Applied to Documents and Classes For a document d and a class c P(d | c)P(c) P(c | d) = P(d) Naive Bayes Classifier (I) cMAP = argmax P(c | d) MAP is “maximum a posteriori” = most likely class cÎC P(d | c)P(c) = argmax Bayes Rule cÎC P(d) = argmax P(d | c)P(c) Dropping the denominator cÎC Naive Bayes Classifier (II) "Likelihood" "Prior" cMAP = argmax P(d | c)P(c) cÎC Document d = argmax P(x1, x2,… , xn | c)P(c) represented as features x1..xn cÎC Naive Bayes Classifier Naive Bayes assumption This is the conditional independence assumption that the probabilities P(xi |c) are independent given the class c and hence can be ‘naively’ multiplied as follows: NLP UNIT III (C)SSG P( x1 ,, xn | c) P( x1 | c) P( x2 | c) P( x3 | c) ... P( xn | c) 18 Multinomial Naive Bayes Classifier cMAP = argmax P(x1, x2,… , xn | c)P(c) cÎC cNB = argmax P(c j )Õ P(x | c) cÎC xÎX Applying Multinomial Naive Bayes Classifiers to Text Classification positions all word positions in test document cNB = argmax P(c j ) c j ÎC Õ P(xi | c j ) iÎpositions Example NLP UNIT III (C)SSG 21 NLP UNIT III (C)SSG 22 Example NLP UNIT III (C)SSG 23 Training Naïve Bayes Classifier For the class prior P(c) Let Nc be the number of documents in our training data with class c and 𝑁𝑑𝑜𝑐 be the total number of documents. Then: 𝑁𝑐 𝑃 𝑐 = 𝑁𝑑𝑜𝑐 NLP UNIT III (C)SSG count(wi , c j ) P̂(wi | c j ) = å count(w, c j ) wÎV Since naive Bayes naively multiplies all the feature likelihoods together, zero probabilities in the likelihood term for any class will cause the probability of the class to be zero. 24 The simplest solution is the add-one (Laplace) smoothing. Training Naïve Bayes Classifier While Laplace smoothing is usually replaced by more sophisticated smoothing algorithms in language modeling, it is commonly used in naive Bayes text categorization: count(wi , c) P̂(wi | c) = NLP UNIT III (C)SSG å (count(w, c)) wÎV count(wi , c) +1 = æ ö çç å count(w, c)÷÷ + V è wÎV ø 25 The vocabulary V consists of the union of all the word types in all classes Training Naïve Bayes Classifier Ignore unknown words in test data (not in train data) Ignore stop words, very frequent words like ‘the’ and ‘a’. NLP UNIT III (C)SSG Compute Prior Probability P(c) Compute Conditional prob. i.e. likelihood Estimation (Since naive Bayes naively multiplies all the feature likelihoods together, zero probabilities in the likelihood term for any class will cause the probability of the class to be zero) 26 Apply add-one (Laplace) smoothing Training Naïve Bayes Classifier NLP UNIT III (C)SSG 27 Worked example NLP UNIT III (C)SSG 28 Worked example 1. Prior from training: 𝑁𝑐𝑗 P(-) = 3/5 𝑃 𝑐𝑗 = P(+) = 2/5 𝑁𝑡𝑜𝑡𝑎𝑙 2. Drop "with" 3. Likelihoods from training: 𝑐𝑜𝑢𝑛𝑡 𝑤𝑖 , 𝑐 + 1 𝑝 𝑤𝑖 𝑐 = 4. Scoring the test set: 𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡 𝑤, 𝑐 + |𝑉| Naive Bayes for other text classification tasks Spam detection Deciding if a particular piece of email is an example of spam (unsolicited bulk email)—one of the first applications of naive Bayes to text classification A common solution here, rather than using all the words as NLP UNIT III (C)SSG individual features, is to predefine likely sets of words or phrases as features. For example the open-source SpamAssassin tool predefines features like the phrase “one hundred percent guaranteed”, or the feature mentions millions of dollars. 30 Naive Bayes for other text classification tasks Spam detection More sample SpamAssassin features: Email subject line is all capital letters Contains phrases of urgency like “urgent reply” Email subject line contains “online pharmaceutical” HTML has unbalanced “head” tags NLP UNIT III (C)SSG Claims you can be removed from the list Language ID system—determining what language a given piece of text is written in. The most effective naive Bayes features are character n-grams or r byte n-grams. A widely used naive Bayes system is langid.py which begins with all possible n-grams of lengths 1-4. Language ID systems are trained on multilingual text, such as Wikipedia. 31 Naive Bayes as a Language Model A naive Bayes model can be viewed as a set of class-specific unigram language models, in which the model for each class instantiates a unigram language model. The model also assigns a probability to each sentence. NLP UNIT III (C)SSG 32 Sec.13.2.1 Naive Bayes as a Language Model Each class = a unigram language model Assigning each word: P(word | c) Assigning each sentence: P(s|c)= P(word|c) Class + 0.1 I I love this fun film 0.1 love 0.01 this 0.1 0.1.05 0.01 0.1 0.05 fun 0.1 film P(s | + ) = 0.0000005 … Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? P(“I love this fun film”|+) = 0.1×0.1×0.01×0.05×0.1 = 0.0000005 P(“I love this fun film”|−) = 0.2×0.001×0.01×0.005×0.1 =.0000000010 Model + Model - 0.1 I 0.2 I I love this fun film 0.1 love 0.001 love 0.01 this 0.01 this 0.1 0.1 0.01 0.05 0.1 0.005 fun 0.2 0.001 0.01 0.005 0.1 0.05 fun 0.1 film 0.1 film P(s|+) > P(s|-) Evaluation: Confusion Matrix The methods for evaluating text classification. The human-defined labels for each document that we are trying to match are refer as the gold labels. A confusion matrix is a table for visualizing how an algorithm NLP UNIT III (C)SSG performs with respect to the human gold labels, using two dimensions (system output and gold labels), and each cell labeling a set of possible outcomes. 35 Evaluation: Confusion Matrix Confusion matrix It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted on another. NLP UNIT III (C)SSG 36 Evaluation: Confusion Matrix NLP UNIT III (C)SSG 37 Evaluation: Confusion Matrix NLP UNIT III (C)SSG 38 Evaluation: Confusion Matrix Let’s understand the confusing terms in the confusion matrix: true positive, true negative, false negative, and false positive with an example. A machine learning model is trained to predict tumor in patients. The test dataset consists of 100 people. NLP UNIT III (C)SSG 39 Evaluation: Confusion Matrix True Positive (TP) — model correctly predicts the positive class (prediction and actual both are positive). In the above example, 10 people who have tumors are predicted positively by the model. True Negative (TN) — model correctly predicts the negative class NLP UNIT III (C)SSG (prediction and actual both are negative). In the above example, 60 people who don’t have tumors are predicted negatively by the model. False Positive (FP) — model gives the wrong prediction of the negative class (predicted-positive, actual-negative). In the above example, 22 people are predicted as positive of having a tumor, although they don’t have a tumor. FP is also called a TYPE I error. False Negative (FN) — model wrongly predicts the positive class (predicted-negative, actual-positive). In the above example, 8 40 people who have tumors are predicted as negative. FN is also called a TYPE II error. Evaluation: Accuracy Accuracy represents the number of correctly classified data instances over the total number of data instances. NLP UNIT III (C)SSG Accuracy may not be a good measure if the dataset is not balanced. Instead of accuracy, precision and recall are preferred. 41 Evaluation: Precision Precision measures the percentage of items the system detected (i.e., items the system labeled as positive) that are in fact positive (according to the human gold labels). Precision is defined as Evaluation: Recall Recall measures the percentage of items actually present in the input that were correctly identified by the system. Recall is defined as There are many ways to define a single metric that incorporates aspects of both precision and recall. The simplest of these combinations is the F-measure. Evaluation: F measure F measure: a single number that combines P and R: The β parameter differentially weights the importance of recall and precision, based perhaps on the needs of an application. Values of β > 1 favor recall, while values of β < 1 favor precision. We almost always use balanced F1 (i.e., = 1) Evaluation: F measure F1-score is a metric which takes into account both precision and recall and is defined as follows: F1 Score becomes 1 only when precision and recall are both 1. F1 score becomes high only when both precision and recall are high. F1 score is the harmonic mean of precision and recall and is a better measure than accuracy. Test sets and Cross-validation We use the training set to train the model, then use the development test set (also called a devset) to perhaps tune some parameters, and decide what the best model is. Run it on the test set to report its performance. Cross-validation: if we use all our data for training and also use all NLP UNIT III (C)SSG our data for testing. We can do this by cross-validation. In cross-validation, we choose a number k, and partition our data into k disjoint subsets called folds. Now we choose one of those k folds as a test set, train our classifier on the remaining k − 1 folds, and then compute the error rate on the test set. Then we repeat with another fold as the test set. If we choose k = 10, we would train 10 different models (each on 90% of our data), test the model 10 times, and average these 10 46 values. This is called 10-fold cross-validation. Statistical Significance Testing need to compare the performance of two systems. In the paradigm of statistical hypothesis testing, we perform test by formalizing two hypotheses. H0 : δ(x) ≤ 0 H1 : δ(x) > 0 The hypothesis H0, called the null hypothesis, supposes that NLP UNIT III (C)SSG δ(x) is actually negative or zero, meaning that A is not better than B. We would like to know if we can confidently rule out this hypothesis, and instead support H1, that A is better. if the null hypothesis H0 was correct, formalize this likelihood as the p-value: the probability, assuming the null hypothesis H0 is true, of seeing the δ(x) that we saw or one even greater P(δ(X) ≥ δ(x)|H0 is true) 47 Statistical Significance Testing A very small p-value means that the difference we observed is very unlikely under the null hypothesis, and we can reject the null hypothesis. A value of.01 means that if the p-value is less than.01, we reject the null hypothesis and assume that A is indeed better than B. NLP UNIT III (C)SSG We say that a result (e.g., “A is better than B”) is statistically significant if the δ has a probability that is below the threshold and we therefore reject this null hypothesis. To compute p-value in NLP we usually use non-parametric tests based on sampling. There are two common non-parametric tests used in NLP: approximate randomization and the bootstrap test 48 Avoiding Harms in Classification It is important to avoid harms that may result from classifiers, One class of harms is representational harms. Harms caused by a system that demeans a social group, for NLP UNIT III (C)SSG example by perpetuating negative stereotypes about them. In other tasks classifiers may lead to both representational harms and other harms, such as censorship. For example the important text classification task of toxicity detection is the task of detecting hate speech, abuse, 49 harassment, or other kinds of toxic language. End of UNIT III NLP UNIT III (C)SSG 50