Web and Text Analytics 2024-25 Week 5 PDF
Document Details
Uploaded by CooperativeIntellect47
University of Macedonia
2024
Evangelos Kalampokis
Tags
Summary
This document is a lecture on web and text analytics, focusing on supervised machine learning and sentiment analysis using Twitter data. It explains how to preprocess tweets to extract features and perform sentiment analysis.
Full Transcript
Web and Text Analytics 2024-25 Week 5 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Supervised Machine Learning ▪ In supervised machine learning, we usually have an input...
Web and Text Analytics 2024-25 Week 5 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Supervised Machine Learning ▪ In supervised machine learning, we usually have an input X, which goes into our prediction function to get your Y^. ▪ We can then compare our prediction with the true value Y. ▪ This gives us our cost which we use to update the parameters θ. © Information Systems Lab Sentiment analysis ▪ In the case of sentiment analysis we have some text – e.g. a tweet “I am happy because I am learning NLP” ▪ and we need to classify it as either positive (1) or negative (0) using a classification algorithm © Information Systems Lab Supervised ML & Sentiment Analysis ▪ To perform sentiment analysis on a tweet, we first have to represent the text (i.e. "I am happy because I am learning NLP ") as features, we then train our classifier (e.g., logistic regression ), and then we can use it to classify the text. ▪ Note that in this case, we either classify 1, for a positive sentiment, or 0, for a negative sentiment. © Information Systems Lab Vocabulary & Feature Extraction ▪ A vocabulary is just a collection of words ▪ Given a tweet, or some text, we can represent it as a vector of dimension V, where V corresponds to our vocabulary size. ▪ Given a set of tweets, the vocabulary is created using all the tweets in a corpus © Information Systems Lab Feature Extraction ▪ If we had the tweet "I am happy because I am learning NLP", then we would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise. ▪ We encode the tweet as a vector ▪ Remember that the vocabulary has been created using all (maybe millions of) tweets in the corpus © Information Systems Lab Sparse vectors ▪ As V gets larger, the vector that we created becomes more sparse. Furthermore, we end up having many more features and end up training θ V parameters. ▪ This could result in larger training time, and large prediction time. © Information Systems Lab Feature Extraction with Frequencies ▪ Given a Corpus with positive and negative tweets we can create a Vocabulary as follows © Information Systems Lab Positive and negative tweets ▪ We know that the corpus contains positive and negative tweets © Information Systems Lab Feature Extraction with Frequencies ▪ We have to encode each tweet as a vector. ▪ Previously, this vector was of dimension V. ▪ Now, we will represent it with a vector of dimension 3. ▪ To do so, you have to create a dictionary to map the word, and the class it appeared in (positive or negative) to the number of times that word appeared in its corresponding class. © Information Systems Lab Feature Extraction with Frequencies Sentiment class (1 is positive Token sentiment) 23 time the token ”followfriday” appears in positive tweets © Information Systems Lab Positive counts ▪ How we create the dictionary regarding the positive counts? © Information Systems Lab Negative counts ▪ How we create the dictionary regarding the negative counts? © Information Systems Lab Frequency dictionary ▪ In the table above, we can see how words like happy and sad tend to take clear sides, while other words like "I, am" tend to be more neutral. © Information Systems Lab Feature extraction ▪ Then, based on the frequency dictionary that we have created we calculate the features of every tweet. © Information Systems Lab Feature extraction ▪ Given the dictionary and the tweet, "I am sad, I am not learning NLP", you can create a vector corresponding to the feature as follows: ▪ Hence you end up getting the following feature vector [1,8,11]. ▪ 1 corresponds to the bias, 8 the positive feature, and 11 the negative feature. © Information Systems Lab Feature extraction ▪ Feature creation for every tweet © Information Systems Lab Preprocessing: Stop words and punctuation ▪ But before we calculate the feature vector, we need to go through the preprocessing steps – Remove stop words and punctuation marks – Remove handles and URLs – Perform stemming and lowercasing ▪ Let’s see a new tweet © Information Systems Lab Preprocessing: Stop words and punctuation ▪ Preprocessing insludes removing stop words and punctuation © Information Systems Lab Preprocessing: Stop words and punctuation ▪ We first remove stop words, i.e. and, are, a, at © Information Systems Lab Preprocessing: Stop words and punctuation ▪ We next remove punctuation marks,i.e. !!! © Information Systems Lab Preprocessing: handles and URLs ▪ We then remove handles (i.e., @YMourri, @AndrewYNg) and URLs (i.e., https://deeplearning.ai) © Information Systems Lab Preprocessing: stemming and lowercasing ▪ Stemming process produces “tun” from “tuning” ▪ Lowercasing produces “great” from “GREAT” © Information Systems Lab Preprocessing ▪ After all these steps the initial tweet has been transformed into a set of four tokens. © Information Systems Lab Preprocessing and Feature Extraction ▪ We go back in our initial case and perform the preprocessing steps © Information Systems Lab Features extraction © Information Systems Lab Hands-on Example ▪ Sentiment analysis of Twitter data – Use the jupyter notebook on the Week 5 folder – Download also the utils.py file © Information Systems Lab Twitter data ▪ We will use the twitter_samples dataset – NLTK's Twitter corpus contains a sample of 20k Tweets (named 'twitter_samples') retrieved from the Twitter Streaming API ▪ The twitter_samples contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets. ▪ If we used all three datasets, we would introduce duplicates of the positive tweets and negative tweets. ▪ We will select just the five thousand positive tweets and five thousand negative tweets © Information Systems Lab Create train and test datasets ▪ Train test split: 20% will be in the test set, and 80% in the training set. ▪ Create the numpy array of positive labels and negative labels. © Information Systems Lab Preprocessing and Frequency - Implementation ▪ We will use two functions from the utils.py file (download it from openeclass) ▪ process_tweet(): cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems. ▪ build_freqs(): this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the freqs dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets. © Information Systems Lab Build_freqs() © Information Systems Lab Create the frequency dictionary ▪ We create the frequency dictionary using the imported build_freqs() function. ▪ The freqs dictionary is the frequency dictionary that's being built. © Information Systems Lab Frequency dictionary ▪ The key is the tuple (word, label), such as ("happy",1) or ("happy",0) ▪ The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label. © Information Systems Lab Process_tweet() © Information Systems Lab Process tweets ▪ The given function process_tweet() tokenizes the tweet into individual words, removes stop words, handles etc. and applies stemming. © Information Systems Lab Logistic Regression ▪ The sigmoid function maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. ▪ Part 1 of the jupyter notebook describes the implementation of Logistic Regression © Information Systems Lab Create the feature matrix (X) ▪ Given a list of tweets, we extract the features and store them in a matrix. ▪ Towards this end, we use the extract_features() function which takes as input – a list of words for one tweet and – a dictionary corresponding to the frequencies of each tuple (word, label) © Information Systems Lab Implement the extract_features function ▪ The extract_features function takes in a single tweet. ▪ Process the tweet using the imported process_tweet() function and save the list of tweet words. ▪ Loop through each word in the list of processed words – For each word, check the freqs dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0) – Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).) © Information Systems Lab Extract_features() © Information Systems Lab Create the feature matrix (X) and train the model ▪ To create the feature matrix ‘X’ we stack the features for all training examples into a matrix `X`. ▪ Then we train a Logistic Regression model © Information Systems Lab Check performance using the test set ▪ We can test our logistic regression model on some new input that our model has not seen before. © Information Systems Lab