Vector Space Modelling with Machine Learning PDF

Vector space modelling with ML Leonard Johard Agenda LSA — what important is missing? ANNs solving embedding task ○ word2vec, doc2vec ○ DSSM ○ Transformers 2 LSA critics Speed issue. Even optimized SVD is slow and requires memory and CPU time Fast Randomized SVD (facebook) Alternating Least Squares (ALS). Distributed and streaming versions Model issue. PCA assumes normal data distribution, but life is complicated; SVD preserves angles, but angle != semantic similarity. Both dimension reduction methods are global. pLSA. Statistical independence (any distribution) vs linear orthogonality. What about adding new words/texts? Can we take some model and don’t care about distributions, statistics, memory and so on? 3 word2vec (2013) - group of methods Do not compute, predict! Skip-grams: CBoW: 4 CBoW - Continuous bag of words BoW - multiset of objects disregarding order (works for images, texts, …) CBoW - continuous sample of text (window) Input - one-hot context encoding (dict-size vector) Output - overall collection distribution (dict size vector) Activation (softmax) models a likelihood Where is embedding? 5 Continuous skip-gram model “… instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence” (paper) increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity...we give less weight to the distant words by sampling less from those words in our training examples. 6 Bonus: vector space arithmetics 7 Training and quality … we used three training epochs with stochastic gradient descent and backpropagation. We chose starting learning rate 0.025 and decreased it linearly, so that it approaches zero at the end of the last training epoch. … there are 8869 semantic and 10675 syntactic questions. 8 word2vec (and almost everyone’s) problems OOV - out of vocabulary or even underrepresented words. - today is approached with subword tokenization Grammar, abbreviations, forms and homographs — out of scope. (Paper don’t discuss lexers or stemmers) Training depends on N (context size) ✕ D (dictionary). Still works with words, not paragraphs or sentences. 9 doc2vec (Paragraph Vectors) … we concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. … paragraph vectors are unique among paragraphs, the word vectors are shared. No syntax: using a parse tree to combine word vectors, has been shown to work for only sentences because it relies on parsing. The paragraph token can be thought of as another word. Paragraph vector length can be different. “the inference stage” to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W, U, b fixed 10 Details 1. Paragraph Vector - Distributed Memory (PV- DM) - concatenate paragraph and word vectors 2. Paragraph Vector - Distributed Bag of Words (PV-DBOW) - ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. Sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector 11 Training and quality SGD NB Paragraph vector inference requires running GD! Sentiment analysis (5-class, 2-class) classification ○ Special characters such as ,.!? are treated as a normal word 12 Go deeper? 13 DSSM - Deep Structured Semantic Model (by MS) Original architecture: ○ Trained to predict cosine similarity ○ Uses bag of letter trigrams Important: ○ Initially created for search (uses specific metric) Problem: ○ Relatively small input size (333/263 of trigrams) for deep network Training: ○ Positive - clicked headers ○ Negative - shown but not clicked ○ Not necessary relevance! 14 DSSM update by Yandex Input layer: ○ Trigrams ○ +1M of words ○ +1M of word bigrams Training: ○ Failed on random negatives ○ Failed on fake negatives ○ hard negative mining: similar to GANs, but simpler, as network fights itself ○ Another target: dwelltime 15 BERT (Bidirectional Encoder Representations from Transformers), YATI Created to learn language model — and to solve general language tasks Attention and self-attention (~syntactic tree) Modes: Trained to predict 15% of masked words Also trained to predict logical connections between phrases “The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training” 16 BERT details Task 1: Masked LM Task 2: Next Sentence Prediction 17 Reading Papers and articles (links in presentation) on this topic 18

Vector Space Modelling with Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue