Podcast
Questions and Answers
What is the primary function of text transformation in text mining?
What is the primary function of text transformation in text mining?
- To identify the sentiment of the text.
- To remove stop words from the text.
- To convert text into numerical data.
- To monitor and control the capitalization of text. (correct)
Which of the following correctly describes the role of data preprocessing in text mining?
Which of the following correctly describes the role of data preprocessing in text mining?
- It derives valuable information from unstructured text data. (correct)
- It evaluates the final results of mining.
- It reduces the input of processing to essential information sources.
- It combines conventional processes with data mining techniques.
What is another common term used for feature selection in the context of data mining?
What is another common term used for feature selection in the context of data mining?
- Text preprocessing
- Variable selection (correct)
- Data transformation
- Sentiment analysis
The primary goal of feature selection is to:
The primary goal of feature selection is to:
What is the fundamental principle behind the Bag of Words (BoW) model in natural language processing?
What is the fundamental principle behind the Bag of Words (BoW) model in natural language processing?
How does the Bag of Words (BoW) model represent text?
How does the Bag of Words (BoW) model represent text?
What is a potential drawback of using the Bag of Words (BoW) model for text representation?
What is a potential drawback of using the Bag of Words (BoW) model for text representation?
What effect does adding new sentences with previously unseen words have on a Bag of Words model?
What effect does adding new sentences with previously unseen words have on a Bag of Words model?
In the context of text analysis, what does Term Frequency (TF) measure?
In the context of text analysis, what does Term Frequency (TF) measure?
In the formula for Term Frequency (TF), $tf_{t,d} = \frac{n_{t,d}}{\text{Number of terms in the document}}$, what does $n_{t,d}$ represent?
In the formula for Term Frequency (TF), $tf_{t,d} = \frac{n_{t,d}}{\text{Number of terms in the document}}$, what does $n_{t,d}$ represent?
How is the Term Frequency (TF) calculated for a word in a document?
How is the Term Frequency (TF) calculated for a word in a document?
What does Inverse Document Frequency (IDF) measure?
What does Inverse Document Frequency (IDF) measure?
In the context of text analysis, what does a high IDF value for a term indicate?
In the context of text analysis, what does a high IDF value for a term indicate?
Given the formula for IDF, $idft = log(\frac{\text{number of documents}}{\text{number of documents with term 't'}})$ what happens to the IDF value of a term if it appears in every document?
Given the formula for IDF, $idft = log(\frac{\text{number of documents}}{\text{number of documents with term 't'}})$ what happens to the IDF value of a term if it appears in every document?
How is the TF-IDF score calculated for a term in a document?
How is the TF-IDF score calculated for a term in a document?
What does the GloVe model primarily use to understand the relationships between words?
What does the GloVe model primarily use to understand the relationships between words?
What statistical information is considered most important in the GloVe model for word representation?
What statistical information is considered most important in the GloVe model for word representation?
According to the GloVe model, how can the relevance of a word to 'ice' versus 'steam' be determined using probability ratios?
According to the GloVe model, how can the relevance of a word to 'ice' versus 'steam' be determined using probability ratios?
In the context of GloVe, if the ratio P(k|ice) / P(k|steam) is large for a word 'k', what does this indicate?
In the context of GloVe, if the ratio P(k|ice) / P(k|steam) is large for a word 'k', what does this indicate?
What is the initial step for word vector learning in the GloVe model?
What is the initial step for word vector learning in the GloVe model?
During sentiment analysis, which step involves assigning a category like 'positive', 'negative', or 'neutral' to a text?
During sentiment analysis, which step involves assigning a category like 'positive', 'negative', or 'neutral' to a text?
What is the role of 'Dictionary Matching' in sentiment analysis?
What is the role of 'Dictionary Matching' in sentiment analysis?
Which of the following is NOT a typical data source for sentiment analysis?
Which of the following is NOT a typical data source for sentiment analysis?
Which of the following tasks is commonly performed by sentiment analysis tools like NLTK?
Which of the following tasks is commonly performed by sentiment analysis tools like NLTK?
What is the purpose of stemming in data preprocessing?
What is the purpose of stemming in data preprocessing?
Which of the following best describes the purpose of tokenization in data preprocessing for sentiment analysis?
Which of the following best describes the purpose of tokenization in data preprocessing for sentiment analysis?
Why is stop word removal important in sentiment analysis?
Why is stop word removal important in sentiment analysis?
What are the key features extracted during the feature extraction phase of sentiment analysis?
What are the key features extracted during the feature extraction phase of sentiment analysis?
Which type of resource is typically used in a dictionary-based approach to sentiment analysis?
Which type of resource is typically used in a dictionary-based approach to sentiment analysis?
What is SentiWordNet?
What is SentiWordNet?
What major goal was envisioned with the Digital India initiative?
What major goal was envisioned with the Digital India initiative?
What role does the Twitter API play in data collection for sentiment analysis, as discussed in the provided content?
What role does the Twitter API play in data collection for sentiment analysis, as discussed in the provided content?
What is the purpose of code like data[text]= dataset[text].apply(lambda x: cleaning_punctuations(x))
in data preprocessing?
What is the purpose of code like data[text]= dataset[text].apply(lambda x: cleaning_punctuations(x))
in data preprocessing?
In text preprocessing, what does converting all text to lowercase achieve?
In text preprocessing, what does converting all text to lowercase achieve?
What is the main purpose of tokenization in the context of text mining?
What is the main purpose of tokenization in the context of text mining?
In the process of text mining, what is the role of stop word removal, and why is it performed?
In the process of text mining, what is the role of stop word removal, and why is it performed?
How does stemming contribute to text normalization?
How does stemming contribute to text normalization?
How does lemmatization differ from stemming?
How does lemmatization differ from stemming?
Flashcards
Text transformation
Text transformation
A method used to monitor and control the capitalization of text.
Data preprocessing
Data preprocessing
Used in text mining to derive valuable information and knowledge from unstructured text data.
Feature selection
Feature selection
Reducing input for processing or finding essential information sources.
Bag of Words (BoW)
Bag of Words (BoW)
Signup and view all the flashcards
Term Frequency (TF)
Term Frequency (TF)
Signup and view all the flashcards
Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF)
Signup and view all the flashcards
GloVe Matrix
GloVe Matrix
Signup and view all the flashcards
Sentiment analysis
Sentiment analysis
Signup and view all the flashcards
Data Collection
Data Collection
Signup and view all the flashcards
NLTK toolkit
NLTK toolkit
Signup and view all the flashcards
Stemming
Stemming
Signup and view all the flashcards
Lemmatization
Lemmatization
Signup and view all the flashcards
Data cleaning
Data cleaning
Signup and view all the flashcards
Term frequency
Term frequency
Signup and view all the flashcards
Lower casing data
Lower casing data
Signup and view all the flashcards
Study Notes
Text Mining
- Text transformation monitors and controls text capitalization.
- Two main document representations are bag of words and vector space.
- Data preprocessing derives knowledge from unstructured text data.
- Feature selection reduces processing input and finds essential sources.
- Feature selection is also called variable selection.
- Data mining is combined with the conventional process.
- Final results are evaluated.
Bag of Words (BOW)
- BoW is a fundamental approach in natural language processing.
- BoW is text representation with numbers.
- A sentence can be represented as a bag of words vector, i.e., a string of numbers.
BOW Examples
- An example corpus includes three movie reviews:
- Review 1: This movie is very scary and long
- Review 2: This movie is not scary and is slow
- Review 3: This movie is spooky and good
- The vocabulary consists of these 11 words: 'This', 'movie', 'is', 'very', 'scary'.
- Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
- Vector of Review 2: [1 1 1 0 1 1 0 1 0 1 0]
- Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
- Vectors will be of length 11, based upon the total vocabulary
- Adding new words will increase the vocabulary size and vector lengths.
- Vectors may contain many 0s, resulting in a sparse matrix which may be undesirable.
- Information on grammar or word order is not retained.
Term Frequency (TF)
- Term Frequency (TF) is how often a term, t, appears in a document, d.
- tft,d = nt,d / Number of terms in the document where nt,d is the the number of times the term "t" appears in the document “d”. Each document and term would have its own TF value.
- For the sample Review 2 "This movie is not scary and is slow" and its 11 word vocabulary, the number of words in Review 2 equals 8.
- The TF for the word 'this' = (number of times 'this' appears in review 2)/(number of terms in review 2) = 1/8.
TF- Term Frequency Examples
- Term frequencies for an example corpus of three reviews are shown in the table, e.g. review Term Frequency (TF) calculated for three movie reviews. (See source document for expanded table. The below example contains the first column only.)
- This 1/7
- movie 1/7
- is 1/7
- very 1/7
- scary 1/7
- and 1/7
- long 1/7
- not 0
- slow 0
- spooky 0
- good 0
Inverse Document Frequency (IDF)
- It measures the frequency of a term across documents
- idft = log(number of documents / number of documents with term 't')
- IDF(‘this') = log(number of documents/number of documents containing the word 'this') = log(3/3) = log(1) = 0
- Similary, IDF('movie' ) = log(3/3) = 0
- IDF('is') = log(3/3) = 0
- IDF('not') = log(3/1) = log(3) = 0.48
- IDF(‘scary') = log(3/2) = 0.18
- IDF('and') = log(3/3) = 0
- IDF('slow') = log(3/1) = 0.48
Inverse Document Frequency (IDF) Examples
- Inverse Document Frequency (IDF) examples computed for sample movie reviews are shown in the table. (See source document for expanded table. The below example contains the first column only.)
- This 0.00
- movie 0.00
- is 0.00
- very 0.48
- scary 0.18
- and 0.00
- long 0.48
- not 0.48
- slow 0.48
- spooky 0.48
- good 0.48
TF-IDF Score
- The TF-IDF score for a term in a document is obtained by multiplying its TF and IDF scores
- TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
TF-IDF Example
- Corpus:
- A = “The car is driven on the road”
- B = “The truck is driven on the highway”
- The example table shows TF, IDF and TF*IDF for this Corpus. (See source document for expanded table. The below example contains the first column only.)
- The
- Car
- Truck
- Is
- Driven
- On
- The
- Road
- Highway
Glove
- Global Vectors” (GloVe) can cater global context information of the words means it can work on unseen words as well.
- In Glove, the semantic relationship between the words is obtained using a co-occurrence matrix.
- Consider two sentences:
- I am a data science enthusiast
- I am looking for a data science job
- The co-occurrence matrix involved in GloVe would look like thisfor the above sentences -Window Size = 1. (See source document for co-occurance matrix values)
- Each value in this matrix represents the count of co-occurrence with the corresponding word in row/column. This co-occurrence matrix is created using global word co-occurrence count (no. of times the words appeared consecutively; for window size=1).
- If a text corpus has 1m unique words, the co-occurrence matrix would be 1m x 1m in shape. The word co-occurrence is the most important statistical information available for the model to ‘learn' the word representation.
Glove Examples
- Consider probabilities for target words ice and steam with various probe words from the vocabulary
- P(k|ice)/P(k|steam) Example Value table
- k = solid i.e, words related to ice but unrelated to steam. The expected Pik /Pjk ratio will be large.
- for words k which are related to steam but not to ice, say k = gas, the ratio will be small.
- For words like water or fashion, which are either related to both ice and steam or neither to both respectively, the ratio should be approximately one.
- The probability ratio can better distinguish relevant words (solid and gas) from irrelevant words (fashion and water) than the raw probability.
- The probability ratio is able to better discriminate between two relevant words.
- The starting point for word vector learning is ratios of co-occurrence probabilities rather than the probabilities themselves.
Sentiment Analysis
- Sentiment Analysis includes the following processing steps:
- Data Collection
- Preprocessing Data
- Feature Extraction
- Sentiment Analysis and Dictionay Matching
- Polarity Classification of results as Positive, Negative or Neutral
Data Collection for Sentiment Analysis
- Collecting Data is a vital part of Sentiment Analysis.
- Various data sources such as blogs, review sites, online posts & micro blogging like Twitter, Facebook are used for Data Collection.
Sentiment Analysis Tools
- NLTK toolkit is widely used, main features are Tokenization, Stop Word removal, Stemming and tagging; written in Python and available at www.nltk.org.
- GATE (General Architecture for Text Engineering) is an information extraction system with modules like Tokenizer, Stemming and Part of speech tagger; written in Java and available at https://gate.ac.uk/.
- Red Opal is widely used to analyze product reviews.
- Opinion Finder is used for the analysis of different subjective sentences with classification based on polarity and is a platform-independent tool written in Java.
Data Preprocessing
- Applying Stemming to remove the postfix from each words like “ing”, “tion” etc.
- Tokenization-Removal of extra spaces, plus Emoticons are replaced with definitions using available data sets.
- Abbreviations are replaced, plus Pragmatics handling (e.g. hapyyyyyyy becomes happy).
- Stop Word Removal gets rid of words not of use like prepositions (a, an) and conjunctions (and, between).
Feature Extraction
- Term Frequency- Frequency of any term in a document has weightage.
- Term Co-occurrence- Repeated occurrence of a word like Unigram, Bigram or n-gram etc. For each tweet, features are generated for counts of Verbs, adjectives, nouns.
Sentiment Analysis & Polarity Classification
- A Dictionary Based approach uses a predefined dictionary of positive and Negative words.
- SentiWord net is a standard dictionary used by most researchers today for sentiment analysis.
- Task of Polarity classification classifies collected reviews depending upon the emotions expressed as Positive, Negative and Neutral.
Digital India Case Study
- The Digital India mission aims to digitally empower the people of country and include the following factors:
- High Speed Internet services to Citizens
- Business related Services
- Free Wi-Fi in Trains & Railway Stations
- Smart City Project.
Data Collection Example
- Data is collected from Twitter by using the Twitter API (twitter 4j).
- Python and sample twitter data are shown in the source document.
Data Preprocessing & Feature Extraction Example
- Python snippets and sample tweet data are shown in the source document.
Tweets Sentiment Analysis Example
- Sample tweets analyzed for Sentiment Polarity are shown in the table:
- Sample tweet: Providing high speed internet is the ambitious plan of Reliance Group. Good going # DigitalIndia
- Polarity Dictionary Keywords: Ambitious, Good
- Polarity Classification: Positive
- Sample tweet: @UIDAI plz fix d Aadhar android app SMS verification issue otherwise this will be alet down issue 4 @ DigitalIndia @NarendraModi
- Polarity Dictionary Keywords: Letdown, Dead
- Polarity Classification: Negative
- Sample tweet:@Airtel_Presence 48 hrs landline dead no Internet no action. Is this the #DigitalIndia
- Polarity Dictionary Keywords: no
- Polarity Classification: Negative
- Sample tweet: Providing high speed internet is the ambitious plan of Reliance Group. Good going # DigitalIndia
Sample of Tweets Retrieved Example
- Sample of tweets retrieved and their polarity are shown in the table:
- Postal department now enjoying a glory as never before, all due to #DigitalIndia initiatives & e-comm business. Now indispensable. - Negative
- 48 hrs landline dead no Internet no action. Is this #DigitalIndia- Negative
- Indian #ecommerce space may soon have a new giant, if #government has its way. #digitalIndia- Positive
- Providing high speed internet connectivity is the ambitious plan of Reliance Group. Good going #Digital India- Positive
- #DigitalIndia has new avenues in Future- Neutral
Polarity Classification Result Example
- The example Polarity Classification Result are:
- Positive 50%
- Neutral 30%
- Negative 20%
- (See source for example distribution pie chart)
Necessary Dependencies for Analysis (Python)
- Utilities listed: re, numpy as np, pandas as pd
- Plotting listed: seaborn as sns, from wordcloud import WordCloud, import matplotlib.pyplot as plt
- nltk listed: from nltk.stem import WordNetLemmatizer
- sklearn listed: from sklearn.svm import LinearSVC, from sklearn.naive_bayes import BernoulliNB, from sklearn.linear_model import LogisticRegression, from sklearn.model_selection import train_test_split, from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics import confusion_matrix, classification_report
Reading and Loading Datasets in Python
-
Importing the dataset
- DATASET_COLUMNS=['target', 'ids', 'date', 'flag', 'user', 'text']
- DATASET_ENCODING = "ISO-8859-1"
- df = pd.read_csv('Project_Data.csv', encoding=DATASET_ENCODING, names=DATASET_COdf.sample(5)
Exploratory Data Analysis (Python)
- Output of df.head() is a dataframe: Shows dataframe header row containing feature names: target, ids, date, flag, user, text. And a sample from the dataset.
Python Dataframe functions for Dataframe inspection
- df.columns shows column names: Index(['target', 'ids', 'date', 'flag', 'user', 'text'])
- print('length of data is', len(df))
- length of data is 1048576
- df.shape shows the shape of the dataframe (1048576, 6)
- df.info() shows the dataframe info
Dataframe inspection properties
- df.dtypes output:
- target int64
- ids int64
- date object
- flag object
- user object
- text object
- dtype: object
- Checking for null values:
- np.sum(df.isnull().any(axis=1)) Output: 0
- df['target'].unique() Output: array([0, 4], dtype=int64)
- df['target'].nunique() Output: 2
Data Visualization of Target Variables
- Python Code:
-
Plotting the distribution for datasetax = df.groupby('target').count().plot(kind='bar', title='Distribution of data'ax.set_xticklabels(['Negative', 'Positive'], rotation=0)# Storing data in lists.text, sentiment = list(df['text']), list(df['target'])
-
- Distribution of data bargraph shows relative target data sizes
Graph of target values
- Python Code: import seaborn as sns
- sns.countplot(x='target', data=df)
- The bar graph shows the counts of the target variables
Data Preprocessing Steps and Python Code
- Selecting the text and Target column data=df[['text','target']]
- Replacing the values to ease understanding data['target'] = data['target'].replace(4,1)
- Printing unique values of target variables Output: array([0, 1], dtype=int64)
Preprocessing Example Code
- data_pos = data[data['target'] == 1]
- data_neg = data[data['target'] == 0]
- Defining set containing all stopwords in English - See source for text
Cleaning and removing punctuations sample python code
- Python code import stringenglish_punctuations = string.punctuationpunctuations_list = english_punctuationsdef cleaning_punctuations(text):translator = str.maketrans('', '', punctuations_list)return text.translate(translator)dataset['text']= dataset['text'].apply(lambda x: cleaning_punctuations(x))dataset['text'].tail()
- Output of sample text transformation is shown in the table.
Cleaning and removing URLs sample python code
- Python code def cleaning_URLs(data): return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)dataset['text'] = dataset['text'].apply(lambda x: cleaning_URLs(x))dataset['text'].tail()
- Output of sample text transformation is shown in the table.
Preprocessing steps
- Removing punctuations like., ! $() *% @
- Removing URLs
- Removing Stop words
- Lower casing
- Tokenization
- Stemming
- Lemmatization
Noise Reduction of Text
- Text data often contains noise such as punctuation, special characters, and irrelevant symbols.
- Prerocessing helps remove these elements, making the text cleaner and easier to analyze.
Normalization of Text
- Different forms of words (e.g., "run," "running," "ran") can convey the same meaning but appear in different forms.
- Preprocessing techniques like stemming and lemmatization help standardize these variations.
Tokenization of Text
- Text data needs to be broken down into smaller units, such as words or phrases, for analysis.
- Text Tokenization divides into meaningful units, facilitating subsequent processing steps like feature extraction.
Additional Text Cleanup Methods
- Stopword Removal: Stopwords are common words like "the,” “is,” and “and” that often occur frequently but convey little semantic meaning.
- Removing stopwords can improve the efficiency of text analysis by reducing noise.
- Feature Extraction: Preprocessing can involve extracting features from text, such as word frequencies, n-grams, or word embeddings, which are essential for building machine learning models.
- Dimensionality Reduction: Text data often has a high dimensionality due to the presence of a large vocabulary.
- Preprocessing techniques like term frequency-inverse document frequency (TF-IDF) or dimensionality reduction methods can help.
Pre-processing Example
- Table with example data, shown in source document. Sample properties shown in table are: v1 = labels, v2 = text content
- Output properties in example columns are: target, ids, date, flag, user, text
Example Import Data, Show Metadata, Show Data Snippet
- Python code: import pandas as pd data = pd.read_csv("spam.csv",encoding="ISO-8859-1") print(data.head())
- Table with example data. Sample properties shown in table are: v1 = labels, v2 = text content.
Output text content analysis
- Python example code #checking the count of the dependent variable data['v1'].value counts()
- Output properties in example:
- Label name: content count.
- content label are spam and ham
Additional Text Cleanup Methods (Python) and Examples
- Python code: Remove puntuation and store: #library that contains punctuationimport stringstring.punctuation#defining the function to remove punctuationdef remove_punctuation(text):punctuationfree="".join([i for i in text if i not instring.punctuation])return punctuationfree#storing the puntuation free textdata['clean_msg']= data['v2'].apply(lambdax:remove_punctuation(x))data.head()
- Sample output metadata in the example is:Label: NameMetadata.
Removing additional text formatting and examples
- Example, lower the text content # Remove puntuation and example metadata shown in image is: Label: Name metadata
Additional formatting applied
- example is: #Tokenization process code: Function applied and output, and label.
Adding Stop Words library and examples
- Python example import nltk import nltk.corpus and examples from original source: Label: Description
Removing Stop Words libarary implementation and examples
- from examples original set text with descriptions
- python example code original with description. label.
Applying stemming functions
- python example code from original.
Example Stem Function
- Crazy = Crazi
final example
Lemmatization
- process descriptions is: reduces to different context
-
example descriptions and steps.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.