26-Encoder-Architectures-and-Sentence-Embeddings.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Related
- 27-Decoder-Architectures-and-RAG.pdf
- SwinTRG: Swin Transformer Based Radiology Report Generation for Chest X-rays PDF
- LLM_RNN_Transformer.pdf
- Automated Question and Answer Generation from Texts using Text-to-Text Transformers PDF
- Transformers for Speech - Ajou University - PDF
- Attention Is All You Need PDF
Full Transcript
Bidirectional Encoder Representations from Transformers (BERT) [DCLT19] Encoder-only transformer, similar to ELMo but with transformers instead of LSTMs. ▶ advanced the state-of-the-art for 11 NLP tasks. ▶ used 12/24 bidirectional self-attention transformers with 12/16 heads (base/large) ▶ trained w...
Bidirectional Encoder Representations from Transformers (BERT) [DCLT19] Encoder-only transformer, similar to ELMo but with transformers instead of LSTMs. ▶ advanced the state-of-the-art for 11 NLP tasks. ▶ used 12/24 bidirectional self-attention transformers with 12/16 heads (base/large) ▶ trained with two unsupervised pre-training tasks: ▶ Masked Language Modelling (MLM) – predict a masked word ▶ Next Sentence Prediction (NSP) – is the given next sentence following the previous one? ▶ task differentiation only in last layer ⇒ most of the architecture can be reused. Pre-training BERT was at that time computationally expensive, but once the pre-training is over, the fine-tuning on a down-stream task is relatively cheap. BERT had a massive impact! 43 Bidirectional Encoder Representations from Transformers [DCLT19] /2 Input = Sequence = Sentence(s) A + Sentence(s) B [CLS]: Special classification token - holds sequence representation for classification tasks [SEP]: Special separation token Segment embeddings: two trained vectors, for the first and second sentence in the task Token Embeddings: WordPiece [WSCL16] – splits OOV words into in-vocabulary pieces 44 Masked Language Modelling Bidirectional conditioning is non-trivial because each word would see itself in multiple contexts, making the prediction trivial. Solution: Masked Language Modeling (MLM) 45 BERT Improvements Improvements to BERT to make it more computationally efficient (during training or inference). ALBERT [LCGG20] DistilBERT [SDCW19] ELECTRA [CLLM20] Parameter reduction: ▶ Matrix factorization decouples Embeddings and hidden layer ▶ Cross-layer parameter sharing (also stabilizes) #Parameters: -40% Replaces MLM with Replaced Token Detection Inference: 60% faster GLUE performance: 97% Distillation = compression Replaces NSP with Sentence-Order technique [BuCaNi06; HiViDe15] ▶ Student model is trained to Prediction reproduce Teacher model’s ▶ Negative samples consecutive target probabilities sentences with inversed order MLM only utilizes 15% masked tokens - RTD uses all input tokens Generator produces replacements; ELECTRA = Discriminator 46 ALBERT: Matrix Decoupling [LCGG20] BERT ties size of WordPiece embedding matrix to size of hidden layer This is problematic from a modelling and a practical perspective: ( ). ▶ Modelling: WordPiece embeddings should be context-independent BUT hidden layers learn contextdependent embeddings. Why tie them to same dimensionality? ▶ Practical: Generally, vocabulary size needs to be large (here: 30k). If then increasing also increases embedding matrix (size: ). ⇒ billions of parameters, many updated only sparsely. Solution: Add projection to lower-dimensional, separately chosen embedding space. ▶ #Parameters BERT: ▶ #Parameters ALBERT: ⇒ significant when 47 Distillation [SDCW19] Final training objective of DistilBERT: The distillation loss ( ) forces DistilBERT to learn BERT’s behavior, the MLM loss ( ) remains, and the cosine embedding loss ( ) is added to align the directions of student and teacher hidden state vectors. 48 ELECTRA: Replaced Token Detection [CLLM20] A different training objective than masked token prediction more suitable for classification, by adapting the idea of adversarial networks to text: 49 Early Sentence Embeddings Paragraph Vectors / doc2vec [LeMi14]: Extension of word2vec: Predict word from neighbor words and a learned paragraph vector that is unique to the paragraph. ➜ treat the document id as a “virtual word” available in the entire paragraph Skip-thought vectors [KZSZ15]: Encoder-decoder approach using Gated Recurrent Units. Learn to predict (by decoding) the previous sentence from the vector embedding (encoding) of sentence. and the next sentence ➜ the output of the encoder is a sentence representation (“skip-thought vector”) 51 Sequential Denoising Autoencoders and FastSent [HiChKo16] Sequential Denoising Autoencoders [HiChKo16]: Autoencoder: encode a sentence and decode it from the vector again Denoising: corrupt the input to the encoder, recover the correct sentence ⇝ robust to noise Originally used in image analysis, here with LSTM on text. FastSent [HiChKo16]: Faster and simpler alternative to skip-thought vectors. Train on bag-of-words representations of subsequent sentences with a simple log-linear model. 52 Sentence Similarity with BERT BERT can be used for sentence similarity, by concatenating the sentences separated with a special [SEP] token and predicting similarity from the embedding of [SEP]. But we only compare two at a time, comparisons is expensive. We would like to have one vector per sentence, and use vector similarity! BERT produces an embedding for each word, not for sentences nor documents. Two common approaches: ▶ average all word embeddings ▶ use the embedding of the starting token [CLS] But often worse than averaging (non-context-sensitive) GloVe embeddings [ReGu19]. 53 Further sentence embeddings InferSent [CKSB17] Compared many different architectures: LSTM, GRU; unidirectional and bidirectional, mean and max pooling, self-attentive encoder, hierarchical convolutional, … Bidirectional LSTM with max pooling worked best Trained supervised using, e.g., entailment/contradiction tasks Universal Sentence Encoder [CYKH18] Transformer-based and Deep-Averaging-Network based. Transformer-base using the product of word embedding vectors worked best “at the cost of compute time and memory usage scaling dramatically with sentence length” 54 Sentence-BERT – Siamese BERT-Networks [ReGu19] Siamese CBOW [KeBoRi16]: Use word2vec-like vectors, average over the sentences. Optimize the cosines of , to predict if they are related (supervised, via neighbor sentences). Sentence-BERT [ReGu19]: Siamese network based on BERT, fine-tuned via classification. Mean pooling the word embeddings to obtain a sentence embedding worked best. During training, they concatenated For sentence similarity, the cosine of , adding and did not further improve results. is used. 55