Transformer Network: Causal Self-Attention

What is the key advantage of using self-attention in sequence processing?

Reduced computational complexity
Ability to model long-range dependencies
Independence of computations at each time step (correct)
Ability to process sequences of varying lengths

In the context of self-attention, what is the output sequence used for?

Predicting the next token in a sequence
Generating a summary of the input sequence
Computing contextualized representations (correct)
Classifying the sequence according to a predefined category

What is the primary motivation behind using self-attention in language models?

To enable parallelization of computations
To reduce the number of parameters in the model
To improve the interpretability of the model
To model complex contextual relationships (correct)

What is the key difference between self-attention and recurrent neural networks (RNNs)?

Self-attention is parallelizable, while RNNs are not (C) Signup and view all the answers

What is the role of the context in self-attention?

To compute the relevance of each token to the current token (B) Signup and view all the answers

What is the core intuition behind the attention mechanism?

Comparing an item of interest to a collection of other items (C) Signup and view all the answers

What is the purpose of the α value in the attention-based approach?

To normalize the scores to provide a probability distribution (C) Signup and view all the answers

What is the role of a query in the attention process?

As the current focus of attention when being compared to all of the other preceding inputs (D) Signup and view all the answers

What is the result of the computation over the inputs in the attention-based approach?

The output a (B) Signup and view all the answers

What is the purpose of the softmax function in the attention-based approach?

To normalize the scores to provide a probability distribution (C) Signup and view all the answers

What is the advantage of using transformers in attention-based models?

They create a more sophisticated way of representing how words contribute to the representation of longer inputs (C) Signup and view all the answers

What is the role of a key in the attention process?

As a preceding input being compared to the current focus of attention (D) Signup and view all the answers

What is the primary purpose of self-attention in transformers?

To integrate the representation of words from the previous layer to build the current layer's representation (A) Signup and view all the answers

What is the main difference between self-attention and traditional recurrent neural networks?

Self-attention can consider the entire context when computing a word's representation, whereas traditional RNNs consider only the previous words (C) Signup and view all the answers

What is the role of the self-attention weight distribution α in Figure 10.1?

It indicates the importance of each word at layer 5 when computing the representation of the word 'it' at layer 6 (A) Signup and view all the answers

What is the primary advantage of using self-attention in transformers?

It allows the model to consider the entire context when computing a word's representation (D) Signup and view all the answers

What is the main difference between the representation of the word 'it' at layer 5 and layer 6?

The representation at layer 6 is computed based on the entire context, whereas the representation at layer 5 is computed based on local information (C) Signup and view all the answers