Sentiment Analysis Case Study PDF
Document Details

Uploaded by BrainiestWilliamsite3306
International Burch University, Stirling Education
Dželila Mehanowić
Tags
Summary
This document is a case study on sentiment analysis, focusing on IMDB movie reviews. The study introduces the process of cleaning and preparing text data for sentiment analysis, and outlines methods including HTML removal, punctuation/number removal, and stop word removal using Python and the Beautiful Soup library.
Full Transcript
Introduction to Natural Language Processing Case Study: Sentiment Analysis Assist. Prof. Dr. Dželila MEHANOVIĆ Dataset Datasets available at training set and test set. The labeled data set consists of 50,000 IMDB movie reviews, selected for sentiment analysis. The sentiment of review...
Introduction to Natural Language Processing Case Study: Sentiment Analysis Assist. Prof. Dr. Dželila MEHANOVIĆ Dataset Datasets available at training set and test set. The labeled data set consists of 50,000 IMDB movie reviews, selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating 5 have a sentiment score of 1. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. Data fields ○ id - Unique ID of each review ○ sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews ○ review - Text of the review Reading The Data The first file that you'll need is labeledTrainData, which contains 25,000 IMDB movie reviews, each with a positive or negative sentiment label. Read the tab-delimited file into Python. Here, "header=0" indicates that the first line of the file contains column names, "delimiter=\t" indicates that the fields are separated by tabs, and quoting=3 tells Python to ignore double quotes, otherwise you may encounter errors trying to read the file. Reading The Data To male sure that 25,000 rows and 3 columns are read, check as follows: Reading The Data The three columns are called "id", "sentiment", and "array." Now, take a look at a few reviews: Data Cleaning And Text Preprocessing Removing Html Markup: The Beautifulsoup Package First, we'll remove the HTML tags. For this purpose, we'll use the Beautiful Soup library. If you don't have Beautiful soup installed, do: pip install BeautifulSoup4 from the command line. Then, within Python, load the package and use it to extract the text from a review: Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions When considering how to clean the text, you should think about the data problem you are trying to solve. For many problems, it makes sense to remove punctuation. On the other hand, in this case, you are working with a sentiment analysis problem, and it is possible that "!!!" or ":-(" could carry sentiment, and should be treated as words. Here, for simplicity, we will remove the punctuation, but it is something you can play with on your own. Similarly, we will remove numbers. To remove punctuation and numbers, you will use a package for dealing with regular expressions, called re. Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions In other words, the re.sub() statement above says, "Find anything that is NOT a lowercase letter (a-z) or an uppercase letter AZ, and replace it with a space." You will also convert reviews to lower case and split them into individual words (called "tokenization" in NLP Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions The final step is to decide how to deal with frequently occurring words that don't carry much meaning. Such words are called "stop words"; in English they include words such as "a", "and", "is", and "the". Similarly to the previous exercise, you will import Natural Language Toolkit (and download it if you haven't done so): Use nltk to get a list of stop words: Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions This will allow you to view the list of English-language stop words. To remove stop words from movie review, do: Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions The previous process was used to clean one review - but you need to clean 25,000 training reviews! To make the code reusable, create a function that can be called many times: Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions Two elements here are new: ○ First, stop word list is converted to a different data type, a set. This is for speed; since you will be calling this function many times, it needs to be fast, and searching sets in Python is much faster than searching lists. ○ Second, the words are joined back into one paragraph. This is to make the output easier to use in the Bag of Words, below. After defining the above function, if you call the function for a single review: Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions Dealing With Punctuation, Numbers And Stopwords: Nltk And Regular Expressions It should give you exactly the same output as all of the individual steps you did previously. Now, loop through and clean all of the training set at once (this might take a few minutes depending on your computer): Creating Features From A Bag Of Words (Using Scikit-Learn) Now that the training reviews are cleaned, we want to convert them to some kind of numeric representation for machine learning. One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences: ○ Sentence 1 "The cat sat on the hat" ○ Sentence 2 "The dog ate the cat and the hat" From these two sentences, our vocabulary is as follows: { the, cat, sat, on, hat, dog, ate, and } Creating Features From A Bag Of Words (Using Scikit-Learn) To get the bags of words, you need to count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is: { the, cat, sat, on, hat, dog, ate, and } Similarly, the features for Sentence 1 are: { 2, 1, 1, 1, 1, 0, 0, 0 } Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1 Creating Features From A Bag Of Words (Using Scikit-Learn) In the IMDB data, there is a large number of reviews, which will give a large vocabulary. To limit the size of the feature vectors, you should choose some maximum vocabulary size. Below, you will use the 5000 most frequent words (remembering that stop words have already been removed). Use the feature_extraction module from scikit-learn to create bag-of-words features. Creating Features From A Bag Of Words (Using Scikit-Learn) To see what the training data array now looks like, do: CountVectorizer comes with its own options to automatically do preprocessing, tokenization, and stop word removal - for each of these, instead of specifying "None", you could have used a built-in method or specified your own function to use. Now that the Bag of Words model is trained, check the vocabulary: Creating Features From A Bag Of Words (Using Scikit-Learn) Creating Features From A Bag Of Words (Using Scikit-Learn) If you're interested, you can also print the counts of each word in the vocabulary: Apply ML Algorithm At this point, you have numeric training features from the Bag of Words and the original sentiment labels for each feature vector, so you can apply ML technique. You can use Random Forest classifier that is included in scikit-learn Random Forest uses many tree-based classifiers to make predictions, hence the "forest"). Below, set the number of trees to 100 as a reasonable default value. More trees may (or may not) perform better, but will certainly take longer to run. Likewise, the more features you include for each review, the longer this will take. Apply ML Algorithm Assignment Task Use test dataset and run the trained Random Forest on it. This file contains another 25,000 reviews and ids; your task is to predict the sentiment label. Submit your file in form of dataframe (.csv or.xlsx format). HINT Do not use test dataset to fit the model (like in training - „forest.fit“). You need to predict the output. Use „forest.predict“ and pass test data features as argument. Thank you