Big Data for Marketing Lecture 4

Study Notes

Text classification: assigns a discrete label to a text document, where the set of possible labels is Y.
Applications: spam filtering, analysis of electronic health records, sentiment analysis, and more.

Simplifies text by breaking it down into a collection of words, disregarding grammar and word order, but maintaining multiplicity (frequency of each word's appearance).
Steps:
- Tokenization: breaks text into individual words or tokens.
- Vocabulary building: creates a list of unique words.
- Vectorization: converts text into numerical vectors.

Types of tokens:
- Word tokens: most common type, e.g., ["Natural", "Language", "Processing", "is", "fascinating"].
- Subword tokens: useful for languages with rich morphology, e.g., "unhappiness" -> ["un", "happiness"].
- Character tokens: beneficial for languages with large character sets, e.g., "你好" -> ["你", "好"].
- Sentence tokens: treats entire sentences as tokens, e.g., sentence segmentation.

Handling multiple words together (n-grams).
Dealing with punctuation, which can affect word meaning.
Stemming: reduces words to their root form, e.g., "running", "runner", and "ran" -> "run".

Labels each token with its part of speech, encoding information about the word's definition and use in context.

Measures the sentiment of phrases or chunks of text.
Applications: brand monitoring, crises management, customer feedback analysis, market research, targeted marketing campaigns, and influencer marketing.
Approaches:
- Rule-based algorithm composed by a human.
- Machine learning model learned from data.
- Valence Aware Dictionary for sEntiment Reasoning (VADER).

Needs a measure for document similarity.
Measures:
- Jaccard Distance.
- Edit distance.
- Term Frequency-Inverse Document Frequency (TF-IDF) extension of the bag of words model.### TF-IDF and Word Embeddings
TF-IDF enhances the Bag of Words (BoW) model by weighing terms based on their importance, reducing the impact of frequently occurring but less informative words.
Word embeddings are dense vector representations of words that capture semantic meanings and relationships.
Unlike BoW, which represents words as sparse vectors with high dimensionality, word embeddings map words into a continuous vector space of lower dimensionality.

Word embeddings allow similar words to have similar vector representations, enabling the model to capture semantic relationships.
Words with similar meanings are close to each other in the vector space (e.g., "king" and "queen" are closer to each other than to "dog").
Word embeddings typically have lower dimensionality (e.g., 100 to 300 dimensions) compared to the size of the vocabulary in BoW.

Embeddings can capture context if trained on large corpora, allowing them to understand the meaning of words in context.

GloVe captures word relationships using a co-occurrence matrix, which counts how often words appear together in a corpus.
FastText extends Word2Vec by representing words as bags of character n-grams, allowing it to handle out-of-vocabulary words better.
Contextual Embeddings (e.g., BERT, ELMo) generate word embeddings that depend on the context in which the words appear, providing dynamic representations for words in different contexts.

Standard ML clustering: K-means (partitive), Hierarchical (agglomerative)
Topic modeling: Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for topic modeling.
LDA assumes that documents are mixtures of topics, and topics are mixtures of words.

The goal of LDA is to discover the hidden topic structure in a collection of documents.
Each topic is characterized by a distribution over a fixed vocabulary of words.
Document Representation: Each document is represented as a distribution over topics.

Big data leaves us with two big questions: how to store big data and how to process big data.
Distributed computing is the key to processing big data.
Apache Spark is a unified engine for big data processing.

Spark is a fast and general engine for large-scale data processing.
Resilient Distributed Datasets (RDD) are Spark's primary data abstraction.
DataFrames are like distributed in-memory tables with named columns and schemas.