5-Contextual-Information.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Full Transcript
N-grams For Word Context And Word Similarity An -gram is a subsequence of contiguous items. Example bigrams on text: Introduction to text mining Introduction, Introduction to, to text, text mining, mining Example trigrams on characters of example: example → _ex, exa, xam, amp, mpl, ple, le_ Benefici...
N-grams For Word Context And Word Similarity An -gram is a subsequence of contiguous items. Example bigrams on text: Introduction to text mining Introduction, Introduction to, to text, text mining, mining Example trigrams on characters of example: example → _ex, exa, xam, amp, mpl, ple, le_ Beneficial in particular for other languages: ▶ more tolerant to declension and cases ▶ composed words have a similar vector to the part words → _Ti Tis isc sch ch_ Tisch Tennis _Te Ten enn nni nis is_ → Tischtennis → _Ti Tis isc sch cht hte ten enn nni nis is_ 30 Word Embeddings – Motivation Consider these two sentences:1 Cosine similarity: 0, if stop words were removed. Want to recognize: Obama ~ President speaks ~ greets press ~ media Illinois ~ Chicago 1 Example taken from Kusner et al. [KSKW15] 31 Word Embeddings – Motivation The bag-of-words representation does not capture word similarity: For example the words Obama and President: Obama President Obama President Because of this, the documents are completely dissimilar (except for stopwords): ➜ We want a word representation where. Preferably also of lower dimensionality (100–500) than our vocabulary! 32 Vector Representations of Words In the bag of words model, we can interpret our document vectors as: where is a unit vector containing a in the column corresponding to word. In “distributed representations”, the word information is not in a single position anymore: 35 Alternative Vector Representations of Words We can obtain such representations with different approaches: Document occurrences (term-document-matrix): Neighboring words (cooccurrence vectors): Neighboring words with positions: Character trigraphs: 36 Learned Vector Representations of Words The previous examples were engineered, high-dimensional, and sparse features. We can get some success with cosine similarity to compare words. ➜ Can we learn lower-dimensional (dense) features from the data? LSA can be seen as such an approach: factorize the term-document-matrix Many methods can be seen as a variant of this: ▶ build a (large) explicit representation ▶ factorize1 into a lower-dimensional approximation ▶ use approximation as new feature vector instead 1 not necessarily by SVD, but instead, e.g., similar to neural networks 37 Word Context ▶ The context of a word is representative of the word. ▶ Similar words often have a similar context (e.g., girl). ▶ Statistics can often predict the word, based on the context. ▶ Context of a word ≈ a document: a, chasing, is, on, playground, the ➜ Try to model words based on their context! But: many documents per word, same problems as with real documents, … 38 Learning Vector Representations Can we learn a representation of words/text? [BDVJ03; RoGoPl06; RuHiWi86] (Recently: word2vec [MCCD13], vLBL [MnKa13], HPCA [LeCo14], [LeGo14a], GloVe [PeSoMa14]) Basic idea: 1. Train a neural network (or a map function, or factorize a matrix) to either: ▶ predict a word, given the preceding and following words (Continuous Bag of Words, CBOW) ▶ predict the preceding and following words, given a word (Skip-Gram) 2. Configure one layer of the network to have dimensions (for small ) Usually: one layer network (not deep), 100 to 1000 dimensions. 3. Map every word to this layer, and use this as feature. 4. Predict words by similarity to a “target” vector. Note: this maps words, not documents! 39 Word2Vec – Beware of Cherry Picking Famous example (with the famous “Google News” model): Berlin is to − Berlin Beware of cherry picking! is to Berlin is to Ottawa Germany is to is to Apple man is to is to king Germany Germany as Paris ≈ Paris Germany Canada Berlin Microsoft king man as as as as as as Washington_D.C. Washington_D.C. United_States Volkswagen boy prince Computed using https://rare-technologies.com/word2vec-tutorial/ (now defunct) is to France − France is to is to is to is to is to is to 40 Word2Vec – Beware of Data Bias Most similar words to Munich: Munich_Germany, Dusseldorf, Berlin, Cologne, Puchheim_westward Most similar words to Berlin: Munich, BBC_Tristana_Moore, Hamburg, Frankfurt, Germany Most similar words to Dortmund: Leverkusen, Borussia Dortmund, Bayern, Schalke, Bundesliga Computed using https://rare-technologies.com/word2vec-tutorial/ (now defunct) 41 Word2Vec and Word Embeddings – Strengths & Limitations ▶ focused primarily on words, not on documents ▶ captures certain word semantics surprisingly well ▶ mostly preseves linguistic relations: plural, gender, language, … (and thus very useful for machine translation) ▶ requires massive training data (needs to learn projection matrixes of size ) ▶ only works for frequent-enough words, unreliable on low-frequency words ▶ does not distinguish homonyms, and is affected by training data bias ▶ how to fix, if it does not work as desired? [BCZS16; CaBrNa17] ▶ not completely understood yet [GoLe14; LeGo14b; LeGoDa15], but related to matrix factorization [PeSoMa14] 42 Summary ▶a vector representation of documents is often needed for analysis ▶ text needs to be segmented into sentences, tokenized, stemmed, … ▶ sparse representation allows using text search techniques ▶ cosine similarity is often used (on sparse representations) ▶ TF-IDF normalization improves search and similarity results ▶ many heuristic choices (e.g., TF-IDF variant) ▶ dense models (e.g., word2vec) are a recent hype, with major successes but: need huge training data, no guarantees, hard to fix if they do not work right 44