Text Mining: Bag of Words

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the primary function of text transformation in text mining?

  • To identify the sentiment of the text.
  • To remove stop words from the text.
  • To convert text into numerical data.
  • To monitor and control the capitalization of text. (correct)

Which of the following correctly describes the role of data preprocessing in text mining?

  • It derives valuable information from unstructured text data. (correct)
  • It evaluates the final results of mining.
  • It reduces the input of processing to essential information sources.
  • It combines conventional processes with data mining techniques.

What is another common term used for feature selection in the context of data mining?

  • Text preprocessing
  • Variable selection (correct)
  • Data transformation
  • Sentiment analysis

The primary goal of feature selection is to:

<p>find essential information sources or reduce the input of processing. (C)</p> Signup and view all the answers

What is the fundamental principle behind the Bag of Words (BoW) model in natural language processing?

<p>Representing text as a collection of individual words, disregarding grammar and order. (A)</p> Signup and view all the answers

How does the Bag of Words (BoW) model represent text?

<p>As a string of numbers, where each number represents a word. (C)</p> Signup and view all the answers

What is a potential drawback of using the Bag of Words (BoW) model for text representation?

<p>It does not retain information on grammar or word order, potentially losing context. (C)</p> Signup and view all the answers

What effect does adding new sentences with previously unseen words have on a Bag of Words model?

<p>It increases the vocabulary size and the length of the vectors. (C)</p> Signup and view all the answers

In the context of text analysis, what does Term Frequency (TF) measure?

<p>How frequently a term appears in a document. (D)</p> Signup and view all the answers

In the formula for Term Frequency (TF), $tf_{t,d} = \frac{n_{t,d}}{\text{Number of terms in the document}}$, what does $n_{t,d}$ represent?

<p>The number of times term 't' appears in document 'd'. (C)</p> Signup and view all the answers

How is the Term Frequency (TF) calculated for a word in a document?

<p>By dividing the number of times the word appears in the document by the total number of terms in that document. (B)</p> Signup and view all the answers

What does Inverse Document Frequency (IDF) measure?

<p>The importance of a term based on its rarity across all documents. (D)</p> Signup and view all the answers

In the context of text analysis, what does a high IDF value for a term indicate?

<p>The term is highly informative and unique to specific documents. (A)</p> Signup and view all the answers

Given the formula for IDF, $idft = log(\frac{\text{number of documents}}{\text{number of documents with term 't'}})$ what happens to the IDF value of a term if it appears in every document?

<p>It becomes zero. (A)</p> Signup and view all the answers

How is the TF-IDF score calculated for a term in a document?

<p>By multiplying the Term Frequency (TF) by the Inverse Document Frequency (IDF). (C)</p> Signup and view all the answers

What does the GloVe model primarily use to understand the relationships between words?

<p>Co-occurrence matrix. (D)</p> Signup and view all the answers

What statistical information is considered most important in the GloVe model for word representation?

<p>The co-occurrence of words. (D)</p> Signup and view all the answers

According to the GloVe model, how can the relevance of a word to 'ice' versus 'steam' be determined using probability ratios?

<p>By examining the ratio of co-occurrence probabilities of each word with 'ice' and 'steam'. (D)</p> Signup and view all the answers

In the context of GloVe, if the ratio P(k|ice) / P(k|steam) is large for a word 'k', what does this indicate?

<p>'k' is likely related to ice and unrelated to steam. (D)</p> Signup and view all the answers

What is the initial step for word vector learning in the GloVe model?

<p>Computing the ratios of co-occurrence probabilities. (A)</p> Signup and view all the answers

During sentiment analysis, which step involves assigning a category like 'positive', 'negative', or 'neutral' to a text?

<p>Polarity Classification. (C)</p> Signup and view all the answers

What is the role of 'Dictionary Matching' in sentiment analysis?

<p>To identify the polarity based on predefined dictionaries. (D)</p> Signup and view all the answers

Which of the following is NOT a typical data source for sentiment analysis?

<p>Financial reports. (B)</p> Signup and view all the answers

Which of the following tasks is commonly performed by sentiment analysis tools like NLTK?

<p>Tokenization. (D)</p> Signup and view all the answers

What is the purpose of stemming in data preprocessing?

<p>To remove the postfix from words. (C)</p> Signup and view all the answers

Which of the following best describes the purpose of tokenization in data preprocessing for sentiment analysis?

<p>Removing extra spaces and handling emoticons. (B)</p> Signup and view all the answers

Why is stop word removal important in sentiment analysis?

<p>To improve analysis by removing frequently used, non-informative words. (D)</p> Signup and view all the answers

What are the key features extracted during the feature extraction phase of sentiment analysis?

<p>Term Frequency, Term Co-occurrence, and Part of Speech. (D)</p> Signup and view all the answers

Which type of resource is typically used in a dictionary-based approach to sentiment analysis?

<p>Predefined lists of positive and negative words. (A)</p> Signup and view all the answers

What is SentiWordNet?

<p>A standard dictionary used for sentiment analysis. (D)</p> Signup and view all the answers

What major goal was envisioned with the Digital India initiative?

<p>To digitally empower the people of the country. (D)</p> Signup and view all the answers

What role does the Twitter API play in data collection for sentiment analysis, as discussed in the provided content?

<p>It facilitates the collection of data directly from Twitter. (B)</p> Signup and view all the answers

What is the purpose of code like data[text]= dataset[text].apply(lambda x: cleaning_punctuations(x)) in data preprocessing?

<p>To remove punctuation from the text. (A)</p> Signup and view all the answers

In text preprocessing, what does converting all text to lowercase achieve?

<p>It standardizes the text to treat words the same regardless of capitalization. (C)</p> Signup and view all the answers

What is the main purpose of tokenization in the context of text mining?

<p>To split text into individual words or terms. (D)</p> Signup and view all the answers

In the process of text mining, what is the role of stop word removal, and why is it performed?

<p>To focus on main content words by excluding common, less informative words. (C)</p> Signup and view all the answers

How does stemming contribute to text normalization?

<p>By converting all words to their base form. (D)</p> Signup and view all the answers

How does lemmatization differ from stemming?

<p>Lemmatization uses a dictionary to convert words to their base form and stemming operates by just removing the end of the words. (B)</p> Signup and view all the answers

Flashcards

Text transformation

A method used to monitor and control the capitalization of text.

Data preprocessing

Used in text mining to derive valuable information and knowledge from unstructured text data.

Feature selection

Reducing input for processing or finding essential information sources.

Bag of Words (BoW)

A fundamental approach that represents text as an unordered set of words, disregarding grammar.

Signup and view all the flashcards

Term Frequency (TF)

It measures how frequently a term appears in a document.

Signup and view all the flashcards

Inverse Document Frequency (IDF)

It quantifies the importance of a word.

Signup and view all the flashcards

GloVe Matrix

A co-occurrence matrix helps capture semantic relationships between words.

Signup and view all the flashcards

Sentiment analysis

It extracts opinions and sentiments to classify text based on polarity.

Signup and view all the flashcards

Data Collection

Part of sentiment analysis that involves gathering data from blogs, reviews, and social media.

Signup and view all the flashcards

NLTK toolkit

It includes tokenization, stop word removal, stemming, and tagging.

Signup and view all the flashcards

Stemming

Process of removing affixes to reduce words to their root form.

Signup and view all the flashcards

Lemmatization

Process of converting a word to its base or dictionary form.

Signup and view all the flashcards

Data cleaning

This involves removing punctuation, URLs and stopwords

Signup and view all the flashcards

Term frequency

Frequency of any term in a document carries weightage.

Signup and view all the flashcards

Lower casing data

Involves lower casing and tokenization

Signup and view all the flashcards

Study Notes

Text Mining

  • Text transformation monitors and controls text capitalization.
  • Two main document representations are bag of words and vector space.
  • Data preprocessing derives knowledge from unstructured text data.
  • Feature selection reduces processing input and finds essential sources.
  • Feature selection is also called variable selection.
  • Data mining is combined with the conventional process.
  • Final results are evaluated.

Bag of Words (BOW)

  • BoW is a fundamental approach in natural language processing.
  • BoW is text representation with numbers.
  • A sentence can be represented as a bag of words vector, i.e., a string of numbers.

BOW Examples

  • An example corpus includes three movie reviews:
    • Review 1: This movie is very scary and long
    • Review 2: This movie is not scary and is slow
    • Review 3: This movie is spooky and good
  • The vocabulary consists of these 11 words: 'This', 'movie', 'is', 'very', 'scary'.
  • Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
  • Vector of Review 2: [1 1 1 0 1 1 0 1 0 1 0]
  • Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
  • Vectors will be of length 11, based upon the total vocabulary
  • Adding new words will increase the vocabulary size and vector lengths.
  • Vectors may contain many 0s, resulting in a sparse matrix which may be undesirable.
  • Information on grammar or word order is not retained.

Term Frequency (TF)

  • Term Frequency (TF) is how often a term, t, appears in a document, d.
  • tft,d = nt,d / Number of terms in the document where nt,d is the the number of times the term "t" appears in the document “d”. Each document and term would have its own TF value.
  • For the sample Review 2 "This movie is not scary and is slow" and its 11 word vocabulary, the number of words in Review 2 equals 8.
  • The TF for the word 'this' = (number of times 'this' appears in review 2)/(number of terms in review 2) = 1/8.

TF- Term Frequency Examples

  • Term frequencies for an example corpus of three reviews are shown in the table, e.g. review Term Frequency (TF) calculated for three movie reviews. (See source document for expanded table. The below example contains the first column only.)
    • This 1/7
    • movie 1/7
    • is 1/7
    • very 1/7
    • scary 1/7
    • and 1/7
    • long 1/7
    • not 0
    • slow 0
    • spooky 0
    • good 0

Inverse Document Frequency (IDF)

  • It measures the frequency of a term across documents
  • idft = log(number of documents / number of documents with term 't')
  • IDF(‘this') = log(number of documents/number of documents containing the word 'this') = log(3/3) = log(1) = 0
  • Similary, IDF('movie' ) = log(3/3) = 0
  • IDF('is') = log(3/3) = 0
  • IDF('not') = log(3/1) = log(3) = 0.48
  • IDF(‘scary') = log(3/2) = 0.18
  • IDF('and') = log(3/3) = 0
  • IDF('slow') = log(3/1) = 0.48

Inverse Document Frequency (IDF) Examples

  • Inverse Document Frequency (IDF) examples computed for sample movie reviews are shown in the table. (See source document for expanded table. The below example contains the first column only.)
    • This 0.00
    • movie 0.00
    • is 0.00
    • very 0.48
    • scary 0.18
    • and 0.00
    • long 0.48
    • not 0.48
    • slow 0.48
    • spooky 0.48
    • good 0.48

TF-IDF Score

  • The TF-IDF score for a term in a document is obtained by multiplying its TF and IDF scores
  • TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

TF-IDF Example

  • Corpus:
    • A = “The car is driven on the road”
    • B = “The truck is driven on the highway”
  • The example table shows TF, IDF and TF*IDF for this Corpus. (See source document for expanded table. The below example contains the first column only.)
    • The
    • Car
    • Truck
    • Is
    • Driven
    • On
    • The
    • Road
    • Highway

Glove

  • Global Vectors” (GloVe) can cater global context information of the words means it can work on unseen words as well.
  • In Glove, the semantic relationship between the words is obtained using a co-occurrence matrix.
  • Consider two sentences:
    • I am a data science enthusiast
    • I am looking for a data science job
  • The co-occurrence matrix involved in GloVe would look like thisfor the above sentences -Window Size = 1. (See source document for co-occurance matrix values)
  • Each value in this matrix represents the count of co-occurrence with the corresponding word in row/column. This co-occurrence matrix is created using global word co-occurrence count (no. of times the words appeared consecutively; for window size=1).
  • If a text corpus has 1m unique words, the co-occurrence matrix would be 1m x 1m in shape. The word co-occurrence is the most important statistical information available for the model to ‘learn' the word representation.

Glove Examples

  • Consider probabilities for target words ice and steam with various probe words from the vocabulary
  • P(k|ice)/P(k|steam) Example Value table
    • k = solid i.e, words related to ice but unrelated to steam. The expected Pik /Pjk ratio will be large.
    • for words k which are related to steam but not to ice, say k = gas, the ratio will be small.
    • For words like water or fashion, which are either related to both ice and steam or neither to both respectively, the ratio should be approximately one.
  • The probability ratio can better distinguish relevant words (solid and gas) from irrelevant words (fashion and water) than the raw probability.
  • The probability ratio is able to better discriminate between two relevant words.
  • The starting point for word vector learning is ratios of co-occurrence probabilities rather than the probabilities themselves.

Sentiment Analysis

  • Sentiment Analysis includes the following processing steps:
    • Data Collection
    • Preprocessing Data
    • Feature Extraction
    • Sentiment Analysis and Dictionay Matching
    • Polarity Classification of results as Positive, Negative or Neutral

Data Collection for Sentiment Analysis

  • Collecting Data is a vital part of Sentiment Analysis.
  • Various data sources such as blogs, review sites, online posts & micro blogging like Twitter, Facebook are used for Data Collection.

Sentiment Analysis Tools

  • NLTK toolkit is widely used, main features are Tokenization, Stop Word removal, Stemming and tagging; written in Python and available at www.nltk.org.
  • GATE (General Architecture for Text Engineering) is an information extraction system with modules like Tokenizer, Stemming and Part of speech tagger; written in Java and available at https://gate.ac.uk/.
  • Red Opal is widely used to analyze product reviews.
  • Opinion Finder is used for the analysis of different subjective sentences with classification based on polarity and is a platform-independent tool written in Java.

Data Preprocessing

  • Applying Stemming to remove the postfix from each words like “ing”, “tion” etc.
  • Tokenization-Removal of extra spaces, plus Emoticons are replaced with definitions using available data sets.
  • Abbreviations are replaced, plus Pragmatics handling (e.g. hapyyyyyyy becomes happy).
  • Stop Word Removal gets rid of words not of use like prepositions (a, an) and conjunctions (and, between).

Feature Extraction

  • Term Frequency- Frequency of any term in a document has weightage.
  • Term Co-occurrence- Repeated occurrence of a word like Unigram, Bigram or n-gram etc. For each tweet, features are generated for counts of Verbs, adjectives, nouns.

Sentiment Analysis & Polarity Classification

  • A Dictionary Based approach uses a predefined dictionary of positive and Negative words.
  • SentiWord net is a standard dictionary used by most researchers today for sentiment analysis.
  • Task of Polarity classification classifies collected reviews depending upon the emotions expressed as Positive, Negative and Neutral.

Digital India Case Study

  • The Digital India mission aims to digitally empower the people of country and include the following factors:
    • High Speed Internet services to Citizens
    • Business related Services
    • Free Wi-Fi in Trains & Railway Stations
    • Smart City Project.

Data Collection Example

  • Data is collected from Twitter by using the Twitter API (twitter 4j).
  • Python and sample twitter data are shown in the source document.

Data Preprocessing & Feature Extraction Example

  • Python snippets and sample tweet data are shown in the source document.

Tweets Sentiment Analysis Example

  • Sample tweets analyzed for Sentiment Polarity are shown in the table:
    • Sample tweet: Providing high speed internet is the ambitious plan of Reliance Group. Good going # DigitalIndia
      • Polarity Dictionary Keywords: Ambitious, Good
      • Polarity Classification: Positive
    • Sample tweet: @UIDAI plz fix d Aadhar android app SMS verification issue otherwise this will be alet down issue 4 @ DigitalIndia @NarendraModi
      • Polarity Dictionary Keywords: Letdown, Dead
      • Polarity Classification: Negative
    • Sample tweet:@Airtel_Presence 48 hrs landline dead no Internet no action. Is this the #DigitalIndia
      • Polarity Dictionary Keywords: no
      • Polarity Classification: Negative

Sample of Tweets Retrieved Example

  • Sample of tweets retrieved and their polarity are shown in the table:
    • Postal department now enjoying a glory as never before, all due to #DigitalIndia initiatives & e-comm business. Now indispensable. - Negative
    • 48 hrs landline dead no Internet no action. Is this #DigitalIndia- Negative
    • Indian #ecommerce space may soon have a new giant, if #government has its way. #digitalIndia- Positive
    • Providing high speed internet connectivity is the ambitious plan of Reliance Group. Good going #Digital India- Positive
    • #DigitalIndia has new avenues in Future- Neutral

Polarity Classification Result Example

  • The example Polarity Classification Result are:
    • Positive 50%
    • Neutral 30%
    • Negative 20%
  • (See source for example distribution pie chart)

Necessary Dependencies for Analysis (Python)

  • Utilities listed: re, numpy as np, pandas as pd
  • Plotting listed: seaborn as sns, from wordcloud import WordCloud, import matplotlib.pyplot as plt
  • nltk listed: from nltk.stem import WordNetLemmatizer
  • sklearn listed: from sklearn.svm import LinearSVC, from sklearn.naive_bayes import BernoulliNB, from sklearn.linear_model import LogisticRegression, from sklearn.model_selection import train_test_split, from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics import confusion_matrix, classification_report

Reading and Loading Datasets in Python

  • Importing the dataset

  • DATASET_COLUMNS=['target', 'ids', 'date', 'flag', 'user', 'text']
  • DATASET_ENCODING = "ISO-8859-1"
  • df = pd.read_csv('Project_Data.csv', encoding=DATASET_ENCODING, names=DATASET_COdf.sample(5)

Exploratory Data Analysis (Python)

  • Output of df.head() is a dataframe: Shows dataframe header row containing feature names: target, ids, date, flag, user, text. And a sample from the dataset.

Python Dataframe functions for Dataframe inspection

  • df.columns shows column names: Index(['target', 'ids', 'date', 'flag', 'user', 'text'])
  • print('length of data is', len(df))
  • length of data is 1048576
  • df.shape shows the shape of the dataframe (1048576, 6)
  • df.info() shows the dataframe info

Dataframe inspection properties

  • df.dtypes output:
    • target int64
    • ids int64
    • date object
    • flag object
    • user object
    • text object
    • dtype: object
  • Checking for null values:
    • np.sum(df.isnull().any(axis=1)) Output: 0
    • df['target'].unique() Output: array([0, 4], dtype=int64)
    • df['target'].nunique() Output: 2

Data Visualization of Target Variables

  • Python Code:
    • Plotting the distribution for datasetax = df.groupby('target').count().plot(kind='bar', title='Distribution of data'ax.set_xticklabels(['Negative', 'Positive'], rotation=0)# Storing data in lists.text, sentiment = list(df['text']), list(df['target'])

  • Distribution of data bargraph shows relative target data sizes

Graph of target values

  • Python Code: import seaborn as sns
  • sns.countplot(x='target', data=df)
  • The bar graph shows the counts of the target variables

Data Preprocessing Steps and Python Code

  • Selecting the text and Target column data=df[['text','target']]
  • Replacing the values to ease understanding data['target'] = data['target'].replace(4,1)
  • Printing unique values of target variables Output: array([0, 1], dtype=int64)

Preprocessing Example Code

  • data_pos = data[data['target'] == 1]
  • data_neg = data[data['target'] == 0]
  • Defining set containing all stopwords in English - See source for text

Cleaning and removing punctuations sample python code

  • Python code import stringenglish_punctuations = string.punctuationpunctuations_list = english_punctuationsdef cleaning_punctuations(text):translator = str.maketrans('', '', punctuations_list)return text.translate(translator)dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x))dataset['text'].tail()
  • Output of sample text transformation is shown in the table.

Cleaning and removing URLs sample python code

  • Python code def cleaning_URLs(data): return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)dataset['text'] = dataset['text'].apply(lambda x: cleaning_URLs(x))dataset['text'].tail()
  • Output of sample text transformation is shown in the table.

Preprocessing steps

  • Removing punctuations like., ! $() *% @
  • Removing URLs
  • Removing Stop words
  • Lower casing
  • Tokenization
  • Stemming
  • Lemmatization

Noise Reduction of Text

  • Text data often contains noise such as punctuation, special characters, and irrelevant symbols.
  • Prerocessing helps remove these elements, making the text cleaner and easier to analyze.

Normalization of Text

  • Different forms of words (e.g., "run," "running," "ran") can convey the same meaning but appear in different forms.
  • Preprocessing techniques like stemming and lemmatization help standardize these variations.

Tokenization of Text

  • Text data needs to be broken down into smaller units, such as words or phrases, for analysis.
  • Text Tokenization divides into meaningful units, facilitating subsequent processing steps like feature extraction.

Additional Text Cleanup Methods

  • Stopword Removal: Stopwords are common words like "the,” “is,” and “and” that often occur frequently but convey little semantic meaning.
  • Removing stopwords can improve the efficiency of text analysis by reducing noise.
  • Feature Extraction: Preprocessing can involve extracting features from text, such as word frequencies, n-grams, or word embeddings, which are essential for building machine learning models.
  • Dimensionality Reduction: Text data often has a high dimensionality due to the presence of a large vocabulary.
  • Preprocessing techniques like term frequency-inverse document frequency (TF-IDF) or dimensionality reduction methods can help.

Pre-processing Example

  • Table with example data, shown in source document. Sample properties shown in table are: v1 = labels, v2 = text content
  • Output properties in example columns are: target, ids, date, flag, user, text

Example Import Data, Show Metadata, Show Data Snippet

  • Python code: import pandas as pd data = pd.read_csv("spam.csv",encoding="ISO-8859-1") print(data.head())
    • Table with example data. Sample properties shown in table are: v1 = labels, v2 = text content.

Output text content analysis

  • Python example code #checking the count of the dependent variable data['v1'].value counts()
  • Output properties in example:
    • Label name: content count.
    • content label are spam and ham

Additional Text Cleanup Methods (Python) and Examples

  • Python code: Remove puntuation and store: #library that contains punctuationimport stringstring.punctuation#defining the function to remove punctuationdef remove_punctuation(text):punctuationfree="".join([i for i in text if i not instring.punctuation])return punctuationfree#storing the puntuation free textdata['clean_msg']= data['v2'].apply(lambdax:remove_punctuation(x))data.head()
    • Sample output metadata in the example is:Label: NameMetadata.

Removing additional text formatting and examples

  • Example, lower the text content # Remove puntuation and example metadata shown in image is: Label: Name metadata

Additional formatting applied

  • example is: #Tokenization process code: Function applied and output, and label.

Adding Stop Words library and examples

  • Python example import nltk import nltk.corpus and examples from original source: Label: Description

Removing Stop Words libarary implementation and examples

  • from examples original set text with descriptions
  • python example code original with description. label.

Applying stemming functions

  • python example code from original.

Example Stem Function

  • Crazy = Crazi

final example

Lemmatization

  • process descriptions is: reduces to different context
  • example descriptions and steps.
    

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser