Neural Networks - Lecture 8 - Attention Mechanisms PDF

Neural Networks - Lecture 8 Attention Mechanisms in Neural Networks (part 2) Slide contents partly based on “High Performance Natural Language Processing” tutorial at EMNLP 2020 Outline ● Computational Aspects of the Attention Mechanism ● Efficient Transformer Techniques – Data-Independent Attention Patterns – Data-Dependent Attention Patterns – Kernels and Alternative Attention Mechanisms – Recurrence in Transformer Architectures 2/43 Recap – Original Transformer Architecture Transformer Architecture key elements: ● Input and positional embedding ● Multi-Head Attention for encoder ● ● ● Masked Multi-head Attention for decoder Layer Normalization Residual connections – carry over positional embeddings 3/43 Recap – Generalized Attention Operation Attention Operation principle: ● ● A summary of values (Oi) based on similarity (aij) between value keys and the query Similarity based on some function ϕ 4/43 Recap – Scaled Dot-Product Attention Dot-Product Similarity ϕ (Q i , K j )=exp( softmax ( x)i = Q i K Tj √d ), where d −feature dimensionality exp( x i ) ∑ j exp ( x j ) 5/43 Recap – Scaled Dot-Product Attention ● l → sequence length ● d → feature dim ● h → number of attention heads 6/43 Recap – Scaled Dot-Product Attention ● l → sequence length ● d → feature dim ● h → number of attention heads 7/43 Transformer Architecture – computational view ● Quadratic bottleneck in sequence length, due to multi-headed attention – ● => Serious challenges when large sequences are required, e.g.: ● Long range dependencies (span over paragraphs) in documents ● Speech processing ● High resolutoin image processing => Search for solutions to make self-attention more efficient – Approximate attention computation using more efficient operations 8/43 Efficient Attention Wide range of techniques Efficient Transformers: A Survey, Tay et al, 2022 9/43 Efficient Attention Wide range of techniques ● Data-independent patterns – ● Data-dependent patterns – ● Key and Query embeddings play role in defining the attention pattern (e.g. hashing, clustering) Kernels and alternative attention patterns – ● Based on identifying subsets of the full query-key attention matrix (e.g. forms of local attention) Attention operation is approximated using kernel based methods Recurrence – Additional recurrence relations are added to tackle very large context sizes 10/43 Efficient Attention – Data-Independent Patterns Blockwise Patterns ● Divide sequence into local blocks and restrict attention within them 11/43 Efficient Attention – Data-Independent Patterns Blockwise Attention [Blockwise Self-Attention for Long Document Understanding, Qiu et al., 2019] Q, K and V – split into blcok matrices, π – permutation of {1, 2, …, n} Q=[Q T1 ,Q T2 ,⋯Q Tn ]T , K =[ K T1 , K T2 ,⋯K Tn ]T , V =[V T1 , V T2 ,⋯V Tn ]T 12/43 Efficient Attention – Data-Independent Patterns Strided Patterns ● Skip some query/key pairs ● Reduce time complexity to one quadratic in sequence length / stride 13/43 Efficient Attention – Data-Independent Patterns Diagonal (sliding window) Patterns ● ● Compute attention over the diagonal Reduce time complexity to one linear in sequence length and window size 14/43 Efficient Attention – Data-Independent Patterns Random Patterns ● Compute attention over random query/key pairs ● Reduce time complexity to one linear in number of pairs 15/43 Efficient Attention – Data-Independent Patterns Global Attention Patterns ● ● Usually applied to one (e.g. [CLS]) or a few special tokens (e.g. sentence or paragraph level tokens) that are often prepended to the sequence Usually combined with other attention patterns 16/43 Efficient Attention – Data-Independent Patterns – Longformer Example: Longformer ● ● ● [The Long Document Transformer, Beltagy et al., 2020] Developed to address problem in handling large documents Alleviate problem of information loss from the chunking procedure where documents need to be split into separately analysed sub texts (e.g. splitting into chunks of 512 tokens) Showcases use of sliding, strided and global attention patterns 17/43 Efficient Attention – Data-Independent Patterns – Longformer Example: Longformer ● ● ● ● ● [The Long Document Transformer, Beltagy et al., 2020] Sliding Window Attention: each token attends to ½w tokens on each side – the number of attention layers (l) ensures an eventual l x w receptive window Dilated sliding Window: reaches a l x d x w receptive field (can be 10^4 tokens wide for small values of d) – Multi-headed attention – use different dilation configs per head – Use dilated sliding window only in upper transformer layers Global Attention – Applied to a few preselected locations of input locations (e.g. CLS token for classification, question tokens for QA) – Each query attends to all other tokens and is attended by every token – A good method to incorporate problem-specific inductive bias Two sets of projections are learned: – Qs, Ks, Vs for sliding window; – Qg, Kg, Vg for global attention Position Embeddings: leveraged from RoBERTa’s pretrained positional embeddings (copy over the 512 position embeddings 8x → 4096 tokens for Longformer) 18/43 Efficient Attention – Data-Independent Patterns – BigBird Example: BigBird ● [BigBird:Transformers for Longer Sequences, Zaheer et al., 2020] Attention pattern composes global, sliding and random patterns of token blocks 19/43 Efficient Attention – Data-Independent Patterns – BigBird Example: BigBird ● ● [BigBird:Transformers for Longer Sequences, Zaheer et al., 2020] Role of global and random attention patterns is to enable long range connections between tokens at less expense of memory BigBird applied for: long document summarization and QA, contextual representations of genomics sequences, web search 20/43 Efficient Attention – Data-dependent Patterns ● ● Define “closeness” of a query/key pair by means of a function of their embeddings Create buckets / clusters → compute attention within them – Create buckets such that they contain highest attention weights in the attention matrix 21/43 Efficient Attention – Data-dependent Patterns Reformer ● Example: Reformer [Reformer: The Efficient Transformer, Kitaev et al, 2020] – addresses the following problems: – Memory in a model with N layers is N-times larger than in a singlelayer model due to the fact that activations need to be stored for backpropagation → use Reversible layers – Depth dff of intermediate feed-forward layers is often much larger than the depth dmodel of attention activations, it accounts for a large fraction of memory use → use chunking – Attention on sequences of length L is O(L2) in both computational and memory complexity → use locality-sensitive hashing 22/43 Efficient Attention – Data-dependent Patterns Reformer ● Locality Sensitive Hashing – Use hash functions to define closeness → two points q and p are close with good enough probability if hash(q) == hash(p) – Map high-dimensional vectors to a set of discrete values (clusters/buckets) → approximate nearest neighbour search Src: Efficient Transformers – A Naive review 23/43 Efficient Attention – Data-dependent Patterns Reformer ● Angular Locality Sensitive Hashing – Use a single projection for both key and query onto a unit sphere, divided into predefined regions, each with a distinct code ● Group the shared keys/queries into buckets of at most a few hundred tokens 24/43 Efficient Attention – Data-dependent Patterns Reformer ● Angular Locality Sensitive Hashing – Compute attention matrices within each bucket – Bucketing process is stochastic (random rotations) → compute several hashes to ensure that tokens with similar shared key-query embeddings end up in same bucket 25/43 Efficient Attention – Data-dependent Patterns Reformer ● Reversible Residual Layer – Address problem of storing the forward-pass activations (needed during backprop) for each transformer layer – Idea: reconstruct each layer’s activations exactly from the subsequent layer’s activations → backprop without storing the activations in memory; – Combine Attention and the Feedforward pass of a transformer layer into the RevNet block 26/43 Efficient Attention – Data-dependent Patterns Reformer ● Chunking – Use idea that computations in feed-forward layers are independent across positions in a sequence → split them into chunks → process one chunk at a time in a batch 27/43 Efficient Attention – Data-dependent Patterns Reformer ● Reformer architecture – putting it all together 28/43 Efficient Attention – Data-dependent Patterns Linformer Linformer ● ● [Linformer: Self-attention with linear complexity, Wang et al., 2020] Based on the observation that the attention operation is low-rank => use a method to approximate an SVD of the context mapping matrix (queries to keys) to a low rank matrix key-idea: reduce memory complexity by down-projecting the sequence length (Key and Value embeddings) 29/43 Efficient Attention – Data-dependent Patterns Linformer Linformer ● [Linformer: Self-attention with linear complexity, Wang et al., 2020] key-idea: reduce memory complexity by down-projecting the sequence length: project 1) Transform the n x d dimensional Key and Value layers into k x d – dimensional projections 2) Compute an n x k context mapping matrix using scaled dot-product attention Projection matrices Ei and Fi can be shared headwise or layer-wise 30/43 Efficient Attention – Kernel Interpretation ● ● Attention interpreted as kernel function in an infinte feature space – Interpret softmax(Q x KT) as the Gram Matrix of an exponential kernel => do the inverse of the “kernel trick” – explicitly compute ϕ(Q) x ϕ(K)T ● Φ is a feature transformation function ● In practice – ϕ is a polynomial (instead of exponential function) Reduce computational complexity to O(n x dk2 x dv) , where dk – degree of polynomial kernel – This is advantageous when dv < n 31/43 Efficient Attention – Kernel Interpretation ● Performer [Choromanski et al. 2021, Rethinking Attention with Performers] – uses an efficient (linear) generalized attention framework which allows a broad class of attention mechanisms based on different similarity measures (kernels) – Rethink attention as aij =g (Q Ti ) K (Q Ti , K Tj )h ( K Tj ) K ( x , y )=E [ ϕ ( x)T ϕ ( y )], ϕ (u)is a random feature map – FAVOR+ mechanism: Fast Attention Via positive Orthogonal Random features(FAVOR+) 32/43 Efficient Attention – Kernel Interpretation ● Performer [Choromanski et al. 2021, Rethinking Attention with Performers] – FAVOR+ mechanism: Fast Attention Via positive Orthogonal Random features(FAVOR+) 33/43 Efficient Attention – Kernel Interpretation ● Performer [Choromanski et al. 2021, Rethinking Attention with Performers] – FAVOR+ mechanism: Fast Attention Via positive Orthogonal Random features Softmax-based dot-product attention can be approximated using the above scheme if setting: ‖x‖2 h( x)=exp( ), l=2 , f 1 =sin , f 2 =cos 2 34/43 Efficient Attention – Adding recurrence for long sequences ● Useful approach when dealing with very long sequences or the previous approaches still does not fit in available hardware – Naive approach – split sequence into multiple smaller ones and process separately 35/43 Efficient Attention – Adding recurrence for long sequences ● Transformer-XL [Dai et al. 2019, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context] – Adds a component that feeds the hidden states of previous "segments" as inputs to current segments layers 36/43 Efficient Attention – Adding recurrence for long sequences ● Transformer-XL [Dai et al. 2019, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context] – Adds a relative position encoding scheme to facilitate the recurrence strategy – Explicitly factor out attention on content and attention on position Typical attention Transformer XL a) query content to key content ● Uj replaced with its relative position counterpart b) query content to key position ● Ui substituted with learnable parameters u and v c) query position to key position – c) attend to some terms more than others d) query position to key position – d) attend to some positions more than others 37/43 Efficient Attention – Comparing approaches ● The Long-Range Arena Challenge [Tay et al, 2020b, Long range arena: A benchmark for efficient transformers] 38/43 Efficient Attention – Comparing approaches ● The Long-Range Arena Challenge [Tay et al, 2020b, Long range arena: A benchmark for efficient transformers] 39/43 Efficient Attention – Comparing approaches ● The Long-Range Arena Challenge [Tay et al, 2020b, Long range arena: A benchmark for efficient transformers] 40/43 Efficient Attention – Comparing approaches ● The Long-Range Arena Challenge [Tay et al, 2020b, Long range arena: A benchmark for efficient transformers] 41/43 References [Qiu et al., 2014] Qiu, J., Ma, H., Levy, O., Yih, S. W. T., Wang, S., & Tang, J. (2019). Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972. [Beltagy et al, 2020] Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. [Zaheer et al., 2018] Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., ... & Ahmed, A. (2020). Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 1728317297. [Kitaev., 2020] Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451. [Wang et al. 2020] Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768. [Choromanski et al., 2020] Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., ... & Weller, A. (2020). Rethinking attention with performers. arXiv preprint arXiv:2009.14794. [Tay et al, 2020a] Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: A survey. ACM Computing Surveys (CSUR). [Tay et al, 2020b] Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., ... & Metzler, D. (2020). Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006. Transformer Survey Blog: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/ 42/43 End of part 2 – set 1 :-) 43/43

Neural Networks - Lecture 8 - Attention Mechanisms PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue