Sentiment and Text Classification Techniques

Sentiment Text Classification Sanda Martinčić-Ipšić Full professor [email protected] Is this spam? Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton What is the subject of this medical article? MEDLINE Article MeSH Subject Category Hierarchy Antogonists and Inhibitors Blood Supply ? Chemistry Drug Therapy Embryology Epidemiology … 4 Positive or negative movie review? +...zany characters and richly applied satire, and some great plot twists − It was pathetic. The worst part about it was the boxing scenes......awesome caramel sauce and sweet toasty almonds. I love + this place!...awful pizza and ridiculously overpriced... − 5 Positive or negative movie review? +...zany characters and richly applied satire, and some great plot twists It was pathetic. The worst part about it was the boxing − scenes......awesome caramel sauce and sweet toasty almonds. I love + this place!...awful pizza and ridiculously overpriced... − 6 Why sentiment analysis? Movie: is this review positive or negative? Products: what do people think about the new iPhone? Public sentiment: how is consumer confidence? Politics: what do people think about this candidate or issue? Prediction: predict election outcomes or market trends from sentiment 7 Scherer Typology of Affective States Emotion: brief organically synchronized … evaluation of a major event angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies nervous, anxious, reckless, morose, hostile, jealous Basic Sentiment Classification Sentiment analysis is the detection of attitudes Simple task Is the attitude of this text positive or negative? Summary: Text Classification Sentiment analysis Spam detection Authorship identification Language Identification Assigning subject categories, topics, or genres … Text Classification: definition Input: a document d a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class c  C Classification Methods: Supervised Machine Learning Input: a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm) Output: a learned classifier γ:d → c 13 Classification Methods: Supervised Machine Learning Any kind of classifier Naïve Bayes Logistic regression Neural networks k-Nearest Neighbors … Text Classification and Naive Bayes The Naive Bayes Classifier Naive Bayes Intuition Simple ("naive") classification method based on Bayes rule Relies on very simple representation of document Bag of words The Bag of Words Representation 17 The bag of words representation seen 2 γ( sweet 1 whimsical recommend 1 1 )=c happy 1...... Bayes’ Rule Applied to Documents and Classes For a document d and a class c P(d | c)P(c) P(c | d) = P(d) Naive Bayes Classifier (I) cMAP = argmax P(c | d) MAP is “maximum a posteriori” = most likely class cÎC P(d | c)P(c) = argmax Bayes Rule cÎC P(d) = argmax P(d | c)P(c) Dropping the denominator cÎC Naive Bayes Classifier (II) "Likelihood" "Prior" cMAP = argmax P(d | c)P(c) cÎC Document d represented as features x1..xn Multinomial Naive Bayes Independence Assumptions Bag of Words assumption: Assume position doesn’t matter Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c. Multinomial Naive Bayes Classifier Applying Multinomial Naive Bayes Classifiers to Text Classification positions  all word positions in test document Problems with multiplying lots of probabilities There's a problem with this: Multiplying lots of probabilities can result in floating-point underflow!.0006 *.0007 *.0009 *.01 *.5 *.000008…. Idea: Use logs, because log(ab) = log(a) + log(b) We'll sum logs of probabilities instead of multiplying probabilities! We actually do everything in log space Instead of this: This: Notes: 1) Taking log doesn't change the ranking of classes! The class with highest probability also has highest log probability! 2) It's a linear model: Just a max of a sum of weights: a linear function of the inputs So naive bayes is a linear classifier Sec.13.3 Learning the Multinomial Naive Bayes Model First attempt: maximum likelihood estimates simply use the frequencies in the data 𝑁𝑐𝑗 Apriory probability of class cj 𝑃෠ 𝑐𝑗 = 𝑁𝑡𝑜𝑡𝑎𝑙 Parameter estimation Create mega-document for topic j by concatenating all docs in this topic Use frequency of w in mega-document fraction of times word wi appears among all words in documents of topic cj Sec.13.3 Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)? Zero probabilities cannot be conditioned away, no matter the other evidence! Laplace (add-1) smoothing for Naïve Bayes Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(cj) terms Calculate P(wk | cj) terms For each cj in C do Textj  single doc containing all docsj docsj  all docs with class =cj For each word wk in Vocabulary nk  # of occurrences of wk in Textj | docs j | nk + a P(c j ) ¬ P(wk | c j ) ¬ | total # documents| n + a | Vocabulary | Unknown words What about unknown words that appear in our test data but not in our training data or vocabulary? We ignore them Remove them from the test document! Pretend they weren't there! Don't include any probability for them at all! Why don't we build an unknown word model? It doesn't help: knowing which class has more unknown words is not generally helpful! Stop words Some systems ignore stop words Stop words: very frequent words like the and a. Sort the vocabulary by word frequency in training set Call the top 10 or 50 words the stopword list. Remove all stop words from both training and test sets As if they were never there! But removing stop words doesn't usually help So in practice most NB algorithms use all words and don't use stopword lists Text Classification and Naive Bayes Sentiment and Binary Naive Bayes Let's do a worked sentiment example! Sentiment example with add-1 smoothing 1. Prior from training: 𝑁𝑐𝑗 P(-) = 3/5 𝑃෠ 𝑐𝑗 = 𝑁𝑡𝑜𝑡𝑎𝑙 P(+) = 2/5 2. Drop "with" 3. Likelihoods from training: |V|=20 4. Scoring the test set: 𝑐𝑜𝑢𝑛𝑡 𝑤𝑖 , 𝑐 + 1 𝑝 𝑤𝑖 𝑐 = σ𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡 𝑤, 𝑐 + |𝑉| Optimizing for sentiment analysis For tasks like sentiment, word occurrence seems to be more important than word frequency. The occurrence of the word fantastic tells us a lot The fact that it occurs 5 times may not tell us much more. Binary multinominal naive bayes, or binary NB Clip our word counts at 1 Note: this is different than Bernoulli naive Bayes features are independent Booleans (binary variables) describing inputs And P(w|c) is fraction of documents Binary Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(cj) terms Calculate P(wk | cj) terms For each cj in C do Remove duplicates in each doc: docsj  all docs with class =cj For each word type w in docj Retain only a single instance of w Textj  single doc containing all docsj | docs j | For each word wk in Vocabulary P(c j ) ¬ | total # documents| nk  # of occurrences of wk in Textj nk + a P(wk | c j ) ¬ n + a | Vocabulary | Binary Multinomial Naive Bayes on a test document d First remove all duplicate words from d Then compute NB using the same equation: 40 Binary multinominal naive Bayes Counts can still be 2! Binarization is within-doc! An example Compute all probabilities required for classification 10 examples 2 classes 42 An Example (cont …) For C = t, we have C=t A=m B=q /C=t 2 1 2 2 2 Pr(C = t ) Pr( A j = a j | C = t ) =   = j =1 2 5 5 25 For class C = f, we have C=f A=m B=q /C=f 2  1 1 2 1 Pr(C = f ) Pr( A j = a j | C = f ) =   = j =1 2 5 5 25 C = t is more probable- t is the final class. 43 Text Classification and Naive Bayes More on Sentiment Classification Sentiment Classification: Dealing with Negation II really reallylike this like don't moviethis movie Negation changes the meaning of "like" to negative. Negation can also change negative to positive-ish ◦ Don't dismiss this film ◦ Doesn't let us get bored Sentiment Classification: Dealing with Negation Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA). Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79— 86. Simple baseline method: Add NOT_ to every word between negation and following punctuation: didn’t like this movie , but I didn’t NOT_like NOT_this NOT_movie but I Sentiment Classification: Lexicons Sometimes we don't have enough labeled training data In that case, we can make use of pre-built word lists Called lexicons There are various publically available lexicons MPQA Subjectivity Cues Lexicon Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005. Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003. Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/ 6885 words from 8221 lemmas, annotated for intensity (strong/weak) 2718 positive 4912 negative + : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great − : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate 48 The General Inquirer Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press Home page: http://www.wjh.harvard.edu/~inquirer List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls Categories: Positiv (1915 words) and Negativ (2291 words) Strong vs Weak, Active vs Passive, Overstated versus Understated Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc Free for Research Use VADER- Valence Aware Dictionary and sEntiment Reasoner Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI,2014. https://github.com/cjhutto/vaderSentiment lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media lexicon is sensitive to both the polarity and the intensity of sentiments expressed in social media contexts Sentiment ratings from 10 independent human raters (all pre-screened, trained, and quality checked for optimal inter-rater reliability). Over 9,000 token features were rated on a scale from "[–4] Extremely Negative" to " Extremely Positive", with allowance for " Neutral (or Neither, N/A)". every lexical feature that had a non-zero mean rating, and whose standard deviation was less than 2.5 as determined by the aggregate of those ten independent raters. In total over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. Example: "okay" has a positive valence of 0.9, "good" is 1.9, and "great" is 3.1, "horrible" is –2.5, the frowning emoticon :( is –2.2, and "sucks" and it's slang derivative "sux" are both –1.5. 50 Using Lexicons in Sentiment Classification Add a feature that gets a count whenever a word from the lexicon occurs E.g., a feature called "this word occurs in the positive lexicon" or "this word occurs in the negative lexicon" Now all positive words (good, great, beautiful, wonderful) or negative words count for that feature. Using 1-2 features isn't as good as using all the words. But when training data is sparse or not representative of the test set, dense lexicon features can help Naive Bayes in Other tasks: Spam Filtering SpamAssassin Features: Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN) From: starts with many numbers Subject is all capitals HTML has a low ratio of text to image area "One hundred percent guaranteed" Claims you can be removed from the list Naive Bayes in Language ID Determining what language a piece of text is written in. Features based on character n-grams do very well Important to train on lots of varieties of each language (e.g., American English varieties like African-American English, or English varieties around the world like Indian English) Summary: Naive Bayes is Not So Naive Very Fast, low storage requirements Work well with very small amounts of training data Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data Optimal if the independence assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem A good dependable baseline for text classification But we will see other classifiers that give better accuracy Slide from Chris Manning Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling Generative Model for Multinomial Naïve Bayes c=+ X1=I X2=love X3=this X4=fun X5=film 56 Naïve Bayes and Language Modeling Naïve bayes classifiers can use any sort of feature URL, email address, dictionaries, network features But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then Naive bayes has an important similarity to language modeling. 57 Sec.13.2.1 Each class = a unigram language model Assigning each word: P(word | c) Assigning each sentence: P(s|c)=∏ P(word|c) Class pos 0.1 I I love this fun film 0.1 love 0.1 0.1.05 0.01 0.1 0.01 this 0.05 fun 0.1 film P(s | pos) = 0.0000005 Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Model pos Model neg 0.1 I 0.2 I I love this fun film 0.1 love 0.001 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.05 fun 0.005 fun 0.1 film 0.1 film P(s|pos) > P(s|neg) Probabilistic framework Generative model: Each document is generated by a parametric distribution governed by a set of hidden parameters. The generative model makes two assumptions The data (or the text documents) are generated by a mixture model, There is one-to-one correspondence between mixture components and document classes. 60 Mixture model A mixture model models the data with a number of statistical distributions. Intuitively, each distribution corresponds to a data cluster and the parameters of the distribution provide a description of the corresponding cluster. Each distribution in a mixture model is also called a mixture component. The distribution/component can be of any kind 61 An example The figure shows a plot of the probability density function of a 1-dimensional data set (with two classes) generated by a mixture of two Gaussian distributions, one per class, whose parameters (denoted by i) are the mean (i) and the standard deviation (i), i.e., i = (i, i). class 1 class 2 62 Mixture model (cont …) Let the number of mixture components (or distributions) in a mixture model be K. Let the jth distribution have the parameters j. Let  be the set of parameters of all components,  = {1, 2, …, K, 1, 2, …, K}, where j is the mixture weight (or mixture probability) of the mixture component j and j is the parameters of component j. How does the model generate documents? 63 Document generation Due to one-to-one correspondence, each class corresponds to a mixture component. The mixture weights are class prior probabilities, j = Pr(cj|). The mixture model generates each document di by: first selecting a mixture component (or class) according to class prior probabilities (i.e., mixture weights), j = Pr(cj|). then having this selected mixture component (cj) generate a document di according to its parameters, with distribution Pr(di|cj; ) or more precisely Pr(di|cj; j). |C | Pr(d i | ) =  Pr(c j =1 j | Θ) Pr(d i | c j ; ) 64 Model text documents The naïve Bayesian classification treats each document as a “bag of words”. The generative model makes the following further assumptions: Words of a document are generated independently of context given the class label. The naïve Bayes assumption. The probability of a word is independent of its position in the document. The document length is chosen independent of its class. 65 Multinomial distribution With the assumptions, each document can be regarded as generated by a multinomial distribution. In other words, each document is drawn from a multinomial distribution of words with as many independent trials as the length of the document. The words are from a given vocabulary V = {w1, w2, …, w|V|}. 66 Use probability function of multinomial distribution |V | Pr(wt | cj; ) Nti Pr(di | cj; ) = Pr(| di |) | di |! t =1 Nti! (24) where Nti is the number of times that word wt occurs in document di and |V |  |V |  t =1 Nit =| di | t =1 Pr(wt | cj; ) = 1. (25) 67 Parameter estimation The parameters are estimated based on empirical counts.  | D| N Pr(c | d ) ˆ)= Pr(w | c ; . i =1 ti j i (26)   N Pr(c | d ) t j |V | | D| s =1 i =1 si j i In order to handle 0 counts for infrequent occurring words that do not appear in the training set, but may appear in the test set, we need to smooth the probability. smoothing (Laplace, Add one =1 ) 0    1  + i =1 N ti Pr(c j | di ) | D| Pr(wt | c j ; ˆ ) =. (27)  | V | +s =1 i =1 N si Pr(c j | di ) |V | | D| 68 Parameter estimation II Class prior probabilities, which are mixture weights j, can be easily estimated using training data  | D| Pr(cj | di ) (28) ˆ Pr(c | ) = j i =1 |D| 69 Classification Given a test document di, from Eq. (23) (27) and (28) Pr(c ˆ ˆ j | ) Pr(di | cj ; ) ˆ)= Pr(cj | di;  ˆ) Pr(di |  Pr(cj | )k =1 Pr(wdi ,k | cj;  ˆ ˆ) |d i | = r =1 ˆ k =1 di ,k r ) ˆ |C | |d i | Pr(cr | ) Pr(w | c ;  70 Discussions Most assumptions made by naïve Bayesian learning are violated to some degree in practice. Despite such violations, researchers have shown that naïve Bayesian learning produces very accurate models. The main problem is the mixture model assumption. When this assumption is seriously violated, the classification performance can be poor. Naïve Bayesian learning is extremely efficient. 71 Text Classification and Naive Bayes Avoiding Harms in Classification Harms in sentiment classifiers Kiritchenko and Mohammad (2018) found that most sentiment classifiers assign lower sentiment and more negative emotion to sentences with African American names in them. This perpetuates negative stereotypes that associate African Americans with negative emotions Harms in toxicity classification Toxicity detection is the task of detecting hate speech, abuse, harassment, or other kinds of toxic language But some toxicity classifiers incorrectly flag as being toxic sentences that are non-toxic but simply mention identities like blind people, women, or gay people. This could lead to censorship of discussion about these groups. What causes these harms? Can be caused by: Problems in the training data; machine learning systems are known to amplify the biases in their training data. Problems in the human labels Problems in the resources used (like lexicons) Problems in model architecture (like what the model is trained to optimized) Mitigation of these harms is an open research area Meanwhile: model cards Model Cards (Mitchell et al., 2019) For each algorithm you release, document: training algorithms and parameters training data sources, motivation, and preprocessing evaluation data sources, motivation, and preprocessing intended use and users model performance across different demographic or other groups and environmental situations

Sentiment and Text Classification Techniques

Document Details

Tags

Related

Summary

Full Transcript