23-Neural-Word-Embeddings.pdf

Neural Networks for Language Modeling Early approach by Bengio et al. [BDVJ03]: ▶ learn a feature vector in to represent the similarity between words ▶ look-up table for word vectors ▶ context by concatenating the previous words: ▶ optional direct connection layer ▶ hidden layer with sigmoid activation ▶ hidden to output matrix and bias ▶ ▶ Softmax to turn the output into probabilities: 6 Continuous Bag of Words (CBOW, word2vec [LeMi14; MCCD13]) One hidden layer neural network: Use context words to predict word Intuition: sum the rows of every input word in. , find the most similar column in as output. 7 Skip-Gram with Negative Sampling (SGNS, word2vec [LeMi14; MCCD13]) One hidden layer neural network: Every word corresponds to one row in the “encoder matrix” Every word corresponds to one column in the “decoder matrix” Weight matrixes neighbor words and for (= word vectors). (usually discarded). are iteratively optimized to best predict the and. 8 Interpretation [GoLe14; LeGo14; LeGoDa15] For skip-gram, we can interpret Where as follows: is related to pointwise mutual information, ➜ word2vec is an (implicit) matrix factorization of (for negative samples) We can get similar (but not quite as good) results with SVD [LeGo14]. 10 Loss Functions of word2vec [LeMi14; MCCD13] Probability of word given : Loss function for CBOW: Loss function for Skip-Gram: where is the number of context windows. For performance, approximate softmax (testing all is too expensive) with, e.g., hierarchical softmax or negative sampling. Train with: backpropagation, stochastic gradient descent 11 Optimizing Skip-Gram We can optimize the weights using stochastic gradient descent and back-propagation. The basic idea is to update the rows of where and with a learning rate : is the prediction error wrt. the th target, and is the th target. Intuitively, in each iteration we ▶ make the “good” output vector(s) more similar to output we computed ▶ make the “bad” output vector(s) less similar to output we computed use negative sampling: do not update all of them, only a sample ▶ make the input vector(s) more similar to the vector of the desired output ▶ make the input vector(s) less similar to the vector of the undesired output 12 Global Vectors for Word Representation [PeSoMa14] If we aggregate all word cooccurrences into a matrix , the skip-gram objective: becomes This is similar to the loss function of GloVe: Where and are biases for the matrix, and cooccurrences, e.g.,. is a weighting function that reduces the weight of frequent 13 From word2vec to Paragraph Vectors (“doc2vec”) [LeMi14] The early approaches used the average word vector, but it did not work too well. We can design the vector representation as we like. Idea: also include the document. Concatenate the word vector with a document indicator Need to optimize. additional variables. ⇒ we also optimize a vector for each (training) paragraph/document. 14 FastText [BGJM17; GMJB17] Original word2vec only used the most frequent words. ⇝ no word2vec vectors for rare or unseen words. Idea: can we (also) use character n-grams to estimate vectors for new words? example → _ex, exa, xam, amp, mpl, ple, le_ Beneficial in particular for other languages: ▶ more tolerant to declension and cases ▶ composed words get similar vectors → _Ti Tis isc sch ch_ Tisch Tennis _Te Ten enn nni nis is_ → Tischtennis → _Ti Tis isc sch cht hte ten enn nni nis is_ Models become so large, we also need to perform compression [JGBD16]. 15

23-Neural-Word-Embeddings.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue