26 - Encoder Architectures and Sentence Embeddings

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Why is it problematic to tie the size of the WordPiece embedding matrix to the size of the hidden layer in BERT?

WordPiece embeddings should be context-independent but hidden layers learn context-dependent embeddings.

What is the solution proposed in ALBERT to address the issue of tying the size of the WordPiece embedding matrix to the size of the hidden layer?

Add projection to lower-dimensional, separately chosen embedding space.

What technique does ALBERT use to replace Masked Language Modeling?

Replaced Token Detection (RTD)

How does ALBERT achieve parameter reduction compared to BERT?

Matrix factorization decouples embeddings and hidden layer; Cross-layer parameter sharing. Signup and view all the answers

What is the main goal of distillation in ALBERT?

To compress the model. Signup and view all the answers

What is the key difference between ALBERT and BERT in terms of the training objective?

ALBERT replaces Next Sentence Prediction (NSP) with Sentence-Order technique. Signup and view all the answers

What are the two unsupervised pre-training tasks used in BERT?

Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) Signup and view all the answers

What is the purpose of the [CLS] token in BERT?

It holds the sequence representation for classification tasks. Signup and view all the answers

How does BERT handle the issue of bidirectional conditioning being non-trivial?

By using Masked Language Modeling (MLM) Signup and view all the answers

What are the two types of special tokens used in BERT?

[CLS] for classification tasks and [SEP] for separation Signup and view all the answers

What are the two types of embeddings used in BERT?

Segment embeddings and Token embeddings Signup and view all the answers

How did BERT impact the field of Natural Language Processing (NLP)?

It advanced the state-of-the-art for 11 NLP tasks. Signup and view all the answers

What is the final training objective of DistilBERT according to the text?

DistilBERT is forced to learn BERT’s behavior through distillation loss, while maintaining MLM loss and adding cosine embedding loss to align hidden state vectors. Signup and view all the answers

How does ELECTRA's training objective differ from masked token prediction?

ELECTRA uses a training objective based on adversarial networks, more suitable for classification, instead of masked token prediction. Signup and view all the answers

What is the main idea behind Skip-thought vectors?

Skip-thought vectors use an encoder-decoder approach with Gated Recurrent Units to predict the previous and next sentences from the current sentence. Signup and view all the answers

Explain the process of Sequential Denoising Autoencoders described in the text.

Sequential Denoising Autoencoders encode a sentence, corrupt the input to the encoder, and then recover the correct sentence, making them robust to noise. Signup and view all the answers

What is the extension of word2vec described in the text?

Paragraph Vectors or doc2vec is an extension of word2vec that predicts a word from neighbor words and a learned paragraph vector unique to the paragraph. Signup and view all the answers

What is the significance of treating the document id as a 'virtual word' in Paragraph Vectors?

By treating the document id as a 'virtual word,' it becomes available in the entire paragraph for prediction, enhancing the contextual understanding. Signup and view all the answers

Flashcards are hidden until you start studying