Transformer Models and RNN Limitations

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following is NOT a problem associated with Recurrent Neural Networks (RNNs) for sequence modeling?

Architectural complexity
Vanishing/exploding gradients
Computational efficiency (can be parallelized) (correct)
Long-Term Dependency Issues

What is the primary function of the decoder in a Transformer model?

Handling long-term dependencies by remembering past inputs
Encoding the entire input sequence into a single vector
Handling the fixed size input and output constraints
Generating the output sequence based on the encoded input (correct)

What is the major advantage of using Transformers over RNNs for sequence-to-sequence tasks?

Transformers can handle variable length input and output sequences more efficiently.
Transformers are able to learn long-term dependencies more effectively.
Transformers can be parallelized, leading to faster training and inference.
All of the above (correct)

What is the purpose of the 'beginning of a sequence' (BOS) token in the Transformer decoder?

To signal the start of the output sequence to the decoder. (D) Signup and view all the answers

Which of the following is NOT a limitation of using Long Short-Term Memory (LSTM) networks for sequence modeling?

Cannot handle variable input/output lengths (C) Signup and view all the answers

Which of the following is a characteristic of the LLaMa model?

It comes in various sizes, ranging from 1B to 90B parameters. (A) Signup and view all the answers

What is the primary motivation behind the "Masked LM" task in BERT?

To train the model to understand the relationships between words in a sentence. (A) Signup and view all the answers

How many attention maps are produced in a single BERT layer with 12 heads for a sentence with 11 tokens?

132 (B) Signup and view all the answers

Which sampling approach is most likely to produce repetitive or predictable text?

Greedy sampling (A) Signup and view all the answers

Which BEST describes the key difference between BERT and GPT in terms of their primary focus?

BERT is primarily focused on understanding relationships between words in a sentence, while GPT is primarily focused on generating text. (B) Signup and view all the answers

The content mentions that BERT uses bidirectional attention. What does this mean?

Each token in the input sequence can attend to all other tokens in the same sequence. (B) Signup and view all the answers

What is a key characteristic of the "Top-p" sampling approach?

It samples tokens from the set of most probable tokens whose cumulative probability is below a threshold. (C) Signup and view all the answers

Which of the following tasks is NOT mentioned as a supervised fine-tuning task for GPT-1?

Machine translation (B) Signup and view all the answers

What is the primary advantage of using unsupervised pretraining for language models such as GPT-1?

It allows the model to be trained on a large amount of unlabeled data, which is often easier and cheaper to obtain. (D) Signup and view all the answers

What is the primary advantage of scaling model size according to the content?

It allows competitive results without fine-tuning. (A) Signup and view all the answers

In terms of architecture, what type of architecture is used in all versions from GPT-1 to GPT-3?

Decoder-only (A) Signup and view all the answers

What is the main method by which GPT-3 learns tasks?

By using few-shot learning with prompts. (C) Signup and view all the answers

Which GPT model features the highest number of parameters?

GPT-3 (B) Signup and view all the answers

What limitation do larger models like Jurassic-1 and Gopher face, as mentioned in the content?

They are oversized but tend to be undertrained. (B) Signup and view all the answers

What does the context length refer to in models like GPT?

The maximum number of tokens the model can process at once. (D) Signup and view all the answers

What approach did DeepMind recommend for training large models effectively?

Utilizing fixed-sized models with defined parameters. (D) Signup and view all the answers

What happens to the loss as model size and data increase, according to the provided information?

It decreases following a power law. (D) Signup and view all the answers

What role does long-term memory play for an agent?

It stores user preferences for personalized assistance. (C) Signup and view all the answers

What is the primary purpose of the Planning module in an agent?

To devise strategies for problem-solving. (A) Signup and view all the answers

Which process allows an agent to evaluate its past decisions to identify improvements?

Reflection (C) Signup and view all the answers

What does the Chain-of-Thoughts process involve?

Sequential reasoning for complex problems. (D) Signup and view all the answers

Subgoal decomposition helps an agent to:

Break down complex problems into manageable tasks. (B) Signup and view all the answers

How does self-criticism benefit an agent?

It critically analyzes performance for improvements. (D) Signup and view all the answers

What is the function of memory retrieval in decision-making for an agent?

To extract relevant information from stored memory. (A) Signup and view all the answers

Self-updating in agents refers to which process?

Automatically updating memory with new knowledge. (A) Signup and view all the answers

Which disadvantage is associated with the BLEU metric?

It doesn't consider the semantic similarity of sentences. (A), It accepts garbage sentences as valid. (B) Signup and view all the answers

What limitation does the BERT Score have?

It depends on an external model for token output. (B) Signup and view all the answers

What does the Exact Match (EM) metric indicate?

A binary measure of correctness in matching. (A) Signup and view all the answers

In natural language inference, what does 'entailment' mean?

The hypothesis is true based on the premise. (A) Signup and view all the answers

What are ranking metrics used for in language tasks?

To assign relative importance to tokens. (C) Signup and view all the answers

Which scenario describes a 'closed book' question-answering task?

The model answers questions based solely on prior knowledge. (A) Signup and view all the answers

What is the primary goal of human evaluation in language tasks?

To measure coherence, creativity, and fluency. (D) Signup and view all the answers

Which collection involves predicting a missing word in narrative passages?

LAMBADA (C) Signup and view all the answers

Flashcards

Recurrent Neural Network (RNN)

A type of neural network that processes sequential data, like text. It has a hidden state that keeps track of the context, but can struggle with long sequences due to vanishing gradients.

Long Short-Term Memory (LSTM)

Specialized RNN that uses gates to control the flow of information, addressing the vanishing gradient problem and improving long-term dependency.

Seq2Seq

A sequence-to-sequence model that allows for inputs and outputs of different lengths. It consists of an encoder and a decoder, enabling complex tasks like machine translation.

Transformer

A powerful model that breaks the reliance on RNNs for sequence-to-sequence tasks. It encodes the entire input and uses attention to focus on relevant parts for decoding. It can handle longer sequences.