Recent Lessons

Show all results for ""

25- Transformer Basics

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the main weakness of sequential processing in RNNs?

Each hidden state depends on the previous one, making it hard to parallelize.

How does the transformer architecture handle the issue of sequential processing in RNNs?

Transformer architecture uses attention mechanisms to process all tokens in parallel, without sequential dependencies.

What is one drawback of using attention mechanisms in the transformer architecture?

Attention mechanisms require additional computation.

Explain the concept of 'bottleneck' in the context of passing data along time in models.

In traditional sequential models, there is a bottleneck as data passes through each time step sequentially. Signup and view all the answers

Why is it important for a language model to consider the importance of different input tokens?

Not all inputs are equally important, and their importance may vary across time steps. Signup and view all the answers

How can a language model benefit from understanding the importance of specific input tokens?

By attending more to relevant tokens, the language model can make more accurate predictions. Signup and view all the answers

What is the purpose of attention in the transformer architecture?

To provide the LM with weights to focus on certain inputs. Signup and view all the answers

What are the matrices involved in Scaled Dot-Product Attention?

Q (Matrix of queries), K (Matrix of keys), V (Matrix of values). Signup and view all the answers

How is the vanishing gradient problem weakened in Scaled Dot-Product Attention?

By using a scaling factor. Signup and view all the answers

What is the purpose of masking in attention mechanisms?

To mask parts of the input during training and decoding. Signup and view all the answers

Why is Multi-Head Attention used in complex tasks?

To capture different views of the input simultaneously, such as nouns, pronouns, and verbs. Signup and view all the answers

What is the challenge with naive attention implementations regarding memory?

They require significant memory for large context sizes, leading to bottlenecks. Signup and view all the answers

What are some optimizations implemented in FlashAttention 2?

scaled dot-product attention in a single CUDA kernel, blockwise processing, optimized use of GPU cache and VRAM Signup and view all the answers

Why does Full attention in LongFormer not scale to a large context?

Full attention in LongFormer does not scale to a large context because it requires computation for every word Signup and view all the answers

What is the concept of Factorized Self-Attention in Sparse Transformer?

Different heads can have different attention access patterns, certain access patterns are more efficient to compute Signup and view all the answers

How does Positional Encoding help with the order of tokens in transformers?

Positional Encoding is added to embeddings to compute position differences and preserve token order Signup and view all the answers

What is the purpose of Rotary Embeddings (RoPE) according to the text?

Rotary Embeddings aim to preserve relative, not absolute, location in the text Signup and view all the answers

Why does PE not extend well to a context longer than dim/2 to dim of the embeddings?

PE does not extend well to longer contexts because it only depends on relative, not absolute, location Signup and view all the answers

Flashcards are hidden until you start studying