Web and Text Analytics 2024-25 Week 5 PDF

Web and Text Analytics 2024-25 Week 5 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Supervised Machine Learning ▪ In supervised machine learning, we usually have an input X, which goes into our prediction function to get your Y^. ▪ We can then compare our prediction with the true value Y. ▪ This gives us our cost which we use to update the parameters θ. © Information Systems Lab Sentiment analysis ▪ In the case of sentiment analysis we have some text – e.g. a tweet “I am happy because I am learning NLP” ▪ and we need to classify it as either positive (1) or negative (0) using a classification algorithm © Information Systems Lab Supervised ML & Sentiment Analysis ▪ To perform sentiment analysis on a tweet, we first have to represent the text (i.e. "I am happy because I am learning NLP ") as features, we then train our classifier (e.g., logistic regression ), and then we can use it to classify the text. ▪ Note that in this case, we either classify 1, for a positive sentiment, or 0, for a negative sentiment. © Information Systems Lab Vocabulary & Feature Extraction ▪ A vocabulary is just a collection of words ▪ Given a tweet, or some text, we can represent it as a vector of dimension V, where V corresponds to our vocabulary size. ▪ Given a set of tweets, the vocabulary is created using all the tweets in a corpus © Information Systems Lab Feature Extraction ▪ If we had the tweet "I am happy because I am learning NLP", then we would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise. ▪ We encode the tweet as a vector ▪ Remember that the vocabulary has been created using all (maybe millions of) tweets in the corpus © Information Systems Lab Sparse vectors ▪ As V gets larger, the vector that we created becomes more sparse. Furthermore, we end up having many more features and end up training θ V parameters. ▪ This could result in larger training time, and large prediction time. © Information Systems Lab Feature Extraction with Frequencies ▪ Given a Corpus with positive and negative tweets we can create a Vocabulary as follows © Information Systems Lab Positive and negative tweets ▪ We know that the corpus contains positive and negative tweets © Information Systems Lab Feature Extraction with Frequencies ▪ We have to encode each tweet as a vector. ▪ Previously, this vector was of dimension V. ▪ Now, we will represent it with a vector of dimension 3. ▪ To do so, you have to create a dictionary to map the word, and the class it appeared in (positive or negative) to the number of times that word appeared in its corresponding class. © Information Systems Lab Feature Extraction with Frequencies Sentiment class (1 is positive Token sentiment) 23 time the token ”followfriday” appears in positive tweets © Information Systems Lab Positive counts ▪ How we create the dictionary regarding the positive counts? © Information Systems Lab Negative counts ▪ How we create the dictionary regarding the negative counts? © Information Systems Lab Frequency dictionary ▪ In the table above, we can see how words like happy and sad tend to take clear sides, while other words like "I, am" tend to be more neutral. © Information Systems Lab Feature extraction ▪ Then, based on the frequency dictionary that we have created we calculate the features of every tweet. © Information Systems Lab Feature extraction ▪ Given the dictionary and the tweet, "I am sad, I am not learning NLP", you can create a vector corresponding to the feature as follows: ▪ Hence you end up getting the following feature vector [1,8,11]. ▪ 1 corresponds to the bias, 8 the positive feature, and 11 the negative feature. © Information Systems Lab Feature extraction ▪ Feature creation for every tweet © Information Systems Lab Preprocessing: Stop words and punctuation ▪ But before we calculate the feature vector, we need to go through the preprocessing steps – Remove stop words and punctuation marks – Remove handles and URLs – Perform stemming and lowercasing ▪ Let’s see a new tweet © Information Systems Lab Preprocessing: Stop words and punctuation ▪ Preprocessing insludes removing stop words and punctuation © Information Systems Lab Preprocessing: Stop words and punctuation ▪ We first remove stop words, i.e. and, are, a, at © Information Systems Lab Preprocessing: Stop words and punctuation ▪ We next remove punctuation marks,i.e. !!! © Information Systems Lab Preprocessing: handles and URLs ▪ We then remove handles (i.e., @YMourri, @AndrewYNg) and URLs (i.e., https://deeplearning.ai) © Information Systems Lab Preprocessing: stemming and lowercasing ▪ Stemming process produces “tun” from “tuning” ▪ Lowercasing produces “great” from “GREAT” © Information Systems Lab Preprocessing ▪ After all these steps the initial tweet has been transformed into a set of four tokens. © Information Systems Lab Preprocessing and Feature Extraction ▪ We go back in our initial case and perform the preprocessing steps © Information Systems Lab Features extraction © Information Systems Lab Hands-on Example ▪ Sentiment analysis of Twitter data – Use the jupyter notebook on the Week 5 folder – Download also the utils.py file © Information Systems Lab Twitter data ▪ We will use the twitter_samples dataset – NLTK's Twitter corpus contains a sample of 20k Tweets (named 'twitter_samples') retrieved from the Twitter Streaming API ▪ The twitter_samples contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets. ▪ If we used all three datasets, we would introduce duplicates of the positive tweets and negative tweets. ▪ We will select just the five thousand positive tweets and five thousand negative tweets © Information Systems Lab Create train and test datasets ▪ Train test split: 20% will be in the test set, and 80% in the training set. ▪ Create the numpy array of positive labels and negative labels. © Information Systems Lab Preprocessing and Frequency - Implementation ▪ We will use two functions from the utils.py file (download it from openeclass) ▪ process_tweet(): cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems. ▪ build_freqs(): this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the freqs dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets. © Information Systems Lab Build_freqs() © Information Systems Lab Create the frequency dictionary ▪ We create the frequency dictionary using the imported build_freqs() function. ▪ The freqs dictionary is the frequency dictionary that's being built. © Information Systems Lab Frequency dictionary ▪ The key is the tuple (word, label), such as ("happy",1) or ("happy",0) ▪ The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label. © Information Systems Lab Process_tweet() © Information Systems Lab Process tweets ▪ The given function process_tweet() tokenizes the tweet into individual words, removes stop words, handles etc. and applies stemming. © Information Systems Lab Logistic Regression ▪ The sigmoid function maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. ▪ Part 1 of the jupyter notebook describes the implementation of Logistic Regression © Information Systems Lab Create the feature matrix (X) ▪ Given a list of tweets, we extract the features and store them in a matrix. ▪ Towards this end, we use the extract_features() function which takes as input – a list of words for one tweet and – a dictionary corresponding to the frequencies of each tuple (word, label) © Information Systems Lab Implement the extract_features function ▪ The extract_features function takes in a single tweet. ▪ Process the tweet using the imported process_tweet() function and save the list of tweet words. ▪ Loop through each word in the list of processed words – For each word, check the freqs dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0) – Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).) © Information Systems Lab Extract_features() © Information Systems Lab Create the feature matrix (X) and train the model ▪ To create the feature matrix ‘X’ we stack the features for all training examples into a matrix `X`. ▪ Then we train a Logistic Regression model © Information Systems Lab Check performance using the test set ▪ We can test our logistic regression model on some new input that our model has not seen before. © Information Systems Lab

Web and Text Analytics 2024-25 Week 5 PDF

Document Details

Tags

Related

Summary

Full Transcript