Podcast
Questions and Answers
In computational linguistics, what primary role do pseudowords play in research?
In computational linguistics, what primary role do pseudowords play in research?
- They enable controlled experiments by removing prior meaning associations. (correct)
- They serve as replacements for outdated vocabulary.
- They are used to test the processing speed of native speakers.
- They help in understanding the historical evolution of languages.
What is the main goal when using the 'Wuggy' algorithm to create pseudowords?
What is the main goal when using the 'Wuggy' algorithm to create pseudowords?
- To ensure the created pseudowords follow the phonotactic constraints of a given language. (correct)
- To create words that are as different as possible from real words.
- To generate words that have clear emotional valence.
- To produce words that are universally pronounceable across all languages.
Why is it important for computational models to account for the emotional valence of words?
Why is it important for computational models to account for the emotional valence of words?
- To optimize search engine rankings based on sentiment.
- To better predict stock market fluctuations.
- To simulate human-like understanding and processing of language nuances. (correct)
- To improve the accuracy of machine translation between languages.
What is the primary function of 'edit distance' in computational linguistics?
What is the primary function of 'edit distance' in computational linguistics?
How does the 'Systematicity Hypothesis' explain the relationship between word forms and their meanings?
How does the 'Systematicity Hypothesis' explain the relationship between word forms and their meanings?
In the context of word embeddings, what is the significance of representing words as vectors?
In the context of word embeddings, what is the significance of representing words as vectors?
What key advantage does the FastText model offer over Word2Vec when handling language data?
What key advantage does the FastText model offer over Word2Vec when handling language data?
When evaluating word embeddings, what does 'intrinsic evaluation' primarily assess?
When evaluating word embeddings, what does 'intrinsic evaluation' primarily assess?
According to the experiment by Gatti et al. (2024), what can be inferred about how humans process the valence of novel vs. real words?
According to the experiment by Gatti et al. (2024), what can be inferred about how humans process the valence of novel vs. real words?
What is the main purpose of using Pointwise Mutual Information (PMI) when improving word representations?
What is the main purpose of using Pointwise Mutual Information (PMI) when improving word representations?
When discussing N-gram language models, what is a major limitation related to 'data sparsity'?
When discussing N-gram language models, what is a major limitation related to 'data sparsity'?
What is the purpose of using an <UNK>
token in handling unseen words in language modeling?
What is the purpose of using an <UNK>
token in handling unseen words in language modeling?
What key advantage do neural language models offer compared to N-gram models?
What key advantage do neural language models offer compared to N-gram models?
What is the main function of the 'self-attention' mechanism in transformer models?
What is the main function of the 'self-attention' mechanism in transformer models?
What problem do contextualized word embeddings solve that fixed word embeddings do not?
What problem do contextualized word embeddings solve that fixed word embeddings do not?
Why is positional encoding necessary in transformer models?
Why is positional encoding necessary in transformer models?
What is a key motivation behind using subword tokenization in modern NLP models?
What is a key motivation behind using subword tokenization in modern NLP models?
What is a key difference in word processing between humans and transformer models?
What is a key difference in word processing between humans and transformer models?
What aspect of language does the Jaccard distance primarily measure?
What aspect of language does the Jaccard distance primarily measure?
Which of the following describes 'lexicon' as defined in the text?
Which of the following describes 'lexicon' as defined in the text?
What does the distributional hypothesis propose about word meaning?
What does the distributional hypothesis propose about word meaning?
Why is 'valence' considered an important feature in computational linguistics?
Why is 'valence' considered an important feature in computational linguistics?
What is a drawback of using trigram encoding in language models?
What is a drawback of using trigram encoding in language models?
What is the purpose of text preprocessing in NLP before training a model?
What is the purpose of text preprocessing in NLP before training a model?
How do language models use probabilities?
How do language models use probabilities?
Flashcards
Pseudo Words
Pseudo Words
Words that, through repeated use, gain understanding and acceptance, effectively becoming real words.
Valence
Valence
A measure of how positive or negative the meaning conveyed by a word is.
Edit Distance
Edit Distance
The smallest number of edits (insert, delete, replace) needed to transform one string into another.
Lexicon
Lexicon
Signup and view all the flashcards
Distributional Hypothesis
Distributional Hypothesis
Signup and view all the flashcards
Jaccard's Distance
Jaccard's Distance
Signup and view all the flashcards
Word Embeddings
Word Embeddings
Signup and view all the flashcards
Word2Vec
Word2Vec
Signup and view all the flashcards
FastText
FastText
Signup and view all the flashcards
Why use pseudowords?
Why use pseudowords?
Signup and view all the flashcards
"Wuggy" Algorithm
"Wuggy" Algorithm
Signup and view all the flashcards
Bigram Encoding
Bigram Encoding
Signup and view all the flashcards
Trigram Encoding
Trigram Encoding
Signup and view all the flashcards
Form-Meaning Mapping
Form-Meaning Mapping
Signup and view all the flashcards
Systematicity Hypothesis
Systematicity Hypothesis
Signup and view all the flashcards
Arbitrariness Hypothesis
Arbitrariness Hypothesis
Signup and view all the flashcards
What do language models do?
What do language models do?
Signup and view all the flashcards
N-Gram Language Models
N-Gram Language Models
Signup and view all the flashcards
LSTM & GRU
LSTM & GRU
Signup and view all the flashcards
Positional Encoding
Positional Encoding
Signup and view all the flashcards
Subword Tokenization
Subword Tokenization
Signup and view all the flashcards
Contextualized word embeddings
Contextualized word embeddings
Signup and view all the flashcards
Self-Attention
Self-Attention
Signup and view all the flashcards
Issue with Word2Vec & FastText
Issue with Word2Vec & FastText
Signup and view all the flashcards
Contextualized word embeddings
Contextualized word embeddings
Signup and view all the flashcards
Study Notes
Exams
- Group assignment answers count, incentivizing correct answer recognition for job relevance
- Only 1 in 10 points are outcome-based.
- Midterm: closed questions, 1 hour, open book, designed for reliance when truly stuck
- Final: mix of closed and open questions, open book, 2.5 hours
Lecture 1
- Pseudowords become words through enough usage and understanding
- Pseudowords possess semantics and some level of meaning
- A pseudoword is plausible if 70% of people agree on its meaning or emotional sentiment
- Plausibility of pseudowords varies by country
- Native English speakers struggle to recognize words with 3+ consonants
- Valence indicates how positive or negative a word is
- Valence is computed using slides and crowd-sourced estimates for its arousal and dominance
- Edit distance determines word neighbors based on the number of edits
- Strings are similar when edit distance computes the fewest edits (insert, delete, replace)
- Form-meaning mapping is easier if meaning follows form, but communication becomes confused, conversely meaning that doesn't predict the name results in arbitrary mapping that is harder to learn but less prone to confusion
- Shared meaning between two terms leads to different names and a higher co-occurrence likeliness
Lexicon
- A lexicon serves as a resource connecting words and expressions
Lecture 2
- The distributional hypothesis states that word meaning is determined by its context or "the company it keeps"
- Word similarity increases when their context overlaps
- Jaccard's distance measures set overlap between 0 (no overlap) and 1 (complete overlap)
- Jaccard's distance applies the formula: jacc(x,y)= ( x AND y) / (x OR y)
Embeddings
- Continuous numerical representations of words are embedded in an n-dimensional space
- Embedding turns words into numbers, assigning meaningful positions in space, clustering similar words
- Word2Vec is a tool that helps computers understands words via meanings, not just letters
FastText Model
- An enhanced version of Word2Vec from Facebook AI Research (FAIR) that uses subwords to better handle rare words, misspellings, and word forms
- Treats words as pieces
- For example, "playing" is split into ["pla", "lay", "ayi", "yin", "ing"]
- It helps recognize similar words like "play" and "playing"
ChatGPT Summary: Lecture 1
- Title: That word sounds nice – Gauging semantic connotations for entirely novel words
- Authors: Giovanni Cassani & Afra Alishahi
- Topic: How humans interpret novel words (pseudowords) and whether they assign meaning based on existing linguistic patterns.
Introduction: Learning From Novel Words
- All words were once meaningless pseudowords
- People generalize meanings to new words based on form (spelling/sound)
Key question
- Do people recognize positive or negative pseudowords?
Example experiment
- Participants receive unfamiliar pseudowords and asked "Which word feels more positive?"
- The goal is to see how emotional valence is applied to meaningless words
Why use pseudowords?
- Controlled experiments are useful for research
- Employed in lexical decision, priming, and sentence processing tasks
- Some examples of pseudowords are Keex, Plufgok, Bixmel
Pseudoword creation
- A pseudoword must "sound real"
- "Wuggy" Algorithm flips real word letters
- Ensures phonotactic constraints are followed
Phonotactic constraints in English
- Valid: “Stray” (CCVCC)
- Invalid: “Spfovik” (CCCVCCVCC) – Too many consonants together.
- Trst is a valid word in Croatian, but not in English
Valence: Emotional Meaning in Words
- How is it determined whether a word “feels” positive or negative?
Defining Valence
- A core aspect of meaning
- Evolutionarily important for survival
- Affects processing, learning, and memory of words
How to Compute Valence?
- Human ratings are determined via surveys using the Likert scale.
- Crowdsourcing Studies collect valence ratings for thousands of words.
How Computers Encode Words
- How can words be converted into numbers for analysis?
Basic Method: Letter Counting
- Each word becomes a 26-dimensional vector (one value per letter).
- Issues of this is that it's too simple, ignoring context
Using N-Grams for More Context
- Bigram encoding looks at letter pairs
- Trigram Encoding identifies three letter sequences
- Trade off is that more context equals more sparsity
Similarity Between Words: Nearest Neighbors
- How is it decided if two words “look alike”?
- Edit Distance (Levenshtein Distance) finds the choice most similar to "minced"
Edit Distance (Levenshtein Distance)
- Counts changes (insert, delete, replace) needed to turn one word into another
- Lower edit distance equals more similarity
- Normalization Issues: Short words naturally have lower edit distance
Add-ons to Edit Distance
- Weighted Costs can make some edits (e.g., replacing vowels) more common.
- Transposition recognizes swapping two letters as a minor error
- External Features factor in keyboard layout and typing frequency
Form-Meaning Mapping in Language
- How do word forms (spellings) reflect their meanings?
Two Theories
- Systematicity Hypothesis (Dante's view) which predicts word form equals meaning where learning is easier but has a risk of confusion
- Arbitrariness Hypothesis (Shakespeare's view) which predicts that word form is random, with less confusion, but harder for learning new words
- Language balances both principles, where common words are systematic and rare words are arbitrary
Applying These Ideas to Pseudowords
- Determining whether people can guess the emotional valence of a new word just from its letters.
- Gatti et al. (2024) collected valence ratings, trained a model, and test for pseudoword valence
Findings
- Real Words: Letter n-grams predict valence poorly (r² = 0.01 ).
- Novel Words: Letter n-grams predict valence better ( r² = 0.18).
Conclusion
- Despite the form-meaning mapping being arbitrary, people generalize valence from known to novel words!
Final Thoughts & Open Questions
- Humans continuously learn new words
- Meaning is instinctively assigned to unfamiliar words
- Whether computational models learn in a similar way is a question
Future Research Questions
- How do people generalize emotional meaning from form?
- What patterns in language influence this process?
- How can AI models be improved to handle new words better?
Key Takeaways
- Humans subconsciously assign emotional meaning to unfamiliar words.
- Pseudowords remove pre-existing meaning helping aiding understanding of language learning
- A computational model can be trained to predict word valence
Lecture 2
- Title: This word might co-occur with nice words – A distributional approach to novel words
- Authors: Giovanni Cassani & Afra Alishahi
- Topic: Understanding how word meaning emerges from co-occurrence patterns in language, using distributional semantics and word embeddings.
Introduction: Words in Context
- In Class 1, words were isolated strings
- Words co-occur
Key Ideas from Linguistics
- Meaning as Use (Wittgenstein, 1953) determines that a word’s meaning comes from how it is used, as opposed to its definition
- Distributional Hypothesis (Harris, 1957) which states "Words that occur in similar contexts have similar meanings."
- Firth’s Principle (Firth, 1957) which states "You shall know a word by the company it keeps."
Example
- Cat and dog often appear in similar sentences (e.g., "My __ is sleeping").
- The prior statement suggests they have related meanings, even if not defining them explicitly
Measuring Word Similarity: Jaccard Distance
- How much do two words share the same context?
Limitation
- Jaccard distance ignores frequency
Word Embeddings: Representing Words as Vectors
- Turns words into numbers (vectors) to analyze relationships, discarding isolated symbol approach
What are embeddings?
- Continuous, numerical representations of words as points embedded in an n-dimensional space
- Words as points in a multi-dimensional space
- Similar words have closer vectors
- The space itself depends on how words are used in a language
Example of Context-Based Word Representations
- First define a target word ("car")
- A context window of size is defined
Improving Word Representations: Pointwise Mutual Information (PMI)
- Better measurement is needed regarding the significance of a co-occurrence
Why PMI is Useful
- Words that co-occur more than expected are given a higher weight
- Reduces interference from random co-occurrences
Example
- "Doctor" and "Hospital" shows a High PMI with strong semantic link
- "Doctor" and "Random" shows a Low PMI unlikely to co-occur
Reducing Dimensionality
- Raw co-occurrence matrices are too large (often 100,000+ dimensions) reducing the data while keeping important info
- Singular Value Decomposition (SVD) captures main data patterns
- t-SNE & PCA projects high-dimensional data into 2D or 3D for visualization
Goal
- Making embeddings smaller, denser, and more informative
Predicting Word Meaning: Word2Vec
- Train the model to predict word occurrences
Word2Vec Model
- Two approaches of Continuous Bag of Words (CBOW) that Predicts a word from surrounding words and Skip-Gram with negative sampling (SGNS) which Predicts surrounding words from a given word
Training Process
- Start with random word vectors.
- Co-occurrences are correctly predicted by adjusting them
- Embeddings are captured based on capturing word meaning
Example Results
- King - Man + Woman ≈ Queen
- Paris - France + Italy ≈ Rome
Handling Pseudowords: FastText Instead of Word2Vec
- How do embeddings are created for words that don’t exist in the training data
FastText Model (Facebook AI Research)
- Key Improvement is uses character-level n-grams instead of whole words
- For example, “windowist” → Composed of n-grams like: win, ind, ndo, dow, ist
- Meaning of "windowist" is inferred thanks to similar words when "windowist" was never seen before
Why is FastText Useful?
- Handles rare words better
- Works well for languages with complex morphology
- Captures subword patterns
Evaluating Embeddings: How Do We Know They Work?
- Embeddings are tested in line with matching human intuition
Intrinsic Evaluation
- Includes word similarity, Analogy, and Clustering of similar words
- Word similarity tasks (e.g., WordSim-353, SimLex-999) are assessed.
- Analogy tasks (e.g., King - Man + Woman = Queen) are tested
- Finally, clustering similar words in a 2D space (e.g., via PCA or t-SNE).
Extrinsic Evaluation
- Utilizes embeddings in real tasks (e.g., sentiment analysis, machine translation).
- If embeddings improve the performance, they are marked useful
Conclusion: What About Pseudoword Valence?
- Do people judge pseudowords based on co-occurrence patterns or letter structure?
Findings
- For real words, FastText embeddings capture valence well (r² = 0.62).
- Letter n-grams work for pseudowords (r² = 0.12).
Takeaway
- Humans strategize known words differently from pseudowords
- Meaning stems from form whenever co-occurrences are missing
Final Thoughts
- Words get their meanings from context and co-occurrence patterns, aided by embeddings
- Models can handle unknown words, but humans still rely on surface-level features
Lecture 3
- Title: A Replication – What Matters When Building on Another Study
- Authors: Giovanni Cassani & Afra Alishahi
- Topic: replicate, evaluate, and extend, focusing on valence prediction with different models
Introduction: Why Replication Matters
- Study needs initial confirmation to hold up
- Key steps in replication are reproducing existing results, modifying variables, and evaluating findings
- Example: To replicate Gatti et al. (2024) tested how pseudowords encode emotional valence, aided by the Warriner et al. (2014) dataset.
Replication Pipeline
- Experiment is carried out with a structured pipeline to predict valence for pseudowords
- Steps are extracting unigram vectors, training a linear regression model, applying the model to pseudowords, and checking if valence aligns through model fit
- Finally computing r^2 to measure accuracy for the model
Extending the Study: Testing Different Models & Stimuli
- Not all pseudowords behave the same way
- Testing Different Models by Comparing word2vec, FastText, and LLMs
- Pseudoword Selection has to be carried out to make sure pseudowords like Combatman are used
- Humans may categorize a pseudoword is human if it is realistic
Pseudowords in Lexical Decision Tasks
- People can categorize pseudowords is actual words
- Shown a mix of words and pseudowords the participants have to decide which are real words
- Stimuli used are real words MINE, or pseudowords QWEFQK
Notes
- Confirms pseudowords that are too similar to a real word get mistaken for one
- Complicates form-meaning mapping studies
Pseudoword Valence Ratings
- Study examines whether people can judge valence
- Participants rate positivity/negativity of a pseudoword
- Gorpeous for positive sound and Tutoured negative
Methodology
- Some pseudowords sound similar to existing words
- Others leave little sense
Morphology
- Words made from meaningful parts help the understanding, for example happiness from the word "happy"
- If pseudowords are built using recognizable parts, they may be easier to interpret
Challenges
- Lack of meaning leads to regression to the mean
- Ratings cluser around the average because of lacking association
- Focus on pseudowords with extreme is a a solution
Correlation Methods for Evaluation
- Statistical methods determine aspects of correlation
- Different models emphasize different tasks which in turn make some tests like Pearson's r not ideal for pseudowords
Testing New Models
- Do new NLP models capture pseudoword valence?
- FastText was previously tried LLMs are explored
- LLMs are trained on massive datasets via transformer architectures and predict large context window meaning
Challenges
- LLM embeddings require regularization
- Training on languages could detect universal patterns
Text Preprocessing
- Different NLP models require different text preprocessing methods
- Text processing is carried out using 3 steps
Key Concepts
- Tokenization is splittint the text
- 3 main methods applied like word-based splitting, FastText's uses n-grams, and special tokens like LLMs
- Final 2 steps are converting word to base form named Lemmatization, and removing affixed known as stemming
- Model performance will vary from proccessing.
Summary & Takeaways
Refines study
- It refines current understanding of pseudoword proccessing aided by by testing new evalution
- Replication matters; verification is needed when extending a study
- Different categories of preudowords
- Form-meaning mappings are multi-faceted, models have to be preprocessed
- LLMs create new dimensions that needs normalization
Lecture 4
- Title: Modeling Language Through Prediction – Word After Word After Word After…
- Authors: Giovanni Cassani & Afra Alishahi
- Topic: Understanding how language models predict words, covering n-gram models, Markov chains, neural embeddings, and transformers
Introduction: Language Models & Prediction
- Language models rely on prediction and word sequences
- Words are not independent, but have appear in patterns
- Language models assign probabilities and generate new sentences in turn aiding speech and spelling error correction
N-Gram Language Models
- They predict a word based on the past n words
- Bigram models and Trigram follow same method
- However Data sparsity is a issue and are limited by a no long-term context
Probabilities in Language Models
- Prediction happens because probabilities are used
- Used in spelling correction and Machine Translation
Chain Rule of Probability
- Multiplication of probabilities represents a whole sentence probability Solution: Markov approximations and word embeddings
Markov Chains & Maximum Likelihood Estimation (MLE)
- The tactic of approximating probability using history
- Has 2 main challenges
Challenges
- Uknown words
- Data sparsity
Handling Unseen Words
- This is aided using known words
- A "Solution is needed
Smoothing Techniques
- Probabilities have to be confirmed to avoid probability errors
- Is aided by Laplace Smoothing which small counts to everypossible n grams are added or Backoff & Interpolation where a unigram estimate makes up for a missing bigram
N-Grams to Word Embeddings
- Instead of storing a raw probability a language model maps vectors in a space
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.