In Defense of RAG in the Era of Long-Context Language Models PDF

In Defense of RAG in the Era of Long-Context Language Models Tan Yu Anbang Xu Rama Akkiraju NVIDI...

In Defense of RAG in the Era of Long-Context Language Models Tan Yu Anbang Xu Rama Akkiraju NVIDIA NVIDIA NVIDIA Santa Clara, California Santa Clara, California Santa Clara, California United States United States United States [email protected] [email protected] [email protected] Abstract Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution arXiv:2409.01666v1 [cs.CL] 3 Sep 2024 for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favor- (a) F1 score. ing the long-context LLM over RAG, we ar- gue that the extremely long context in LLMs suffers from a diminished focus on relevant in- formation and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented genera- tion (OP-RAG) mechanism, which significantly improves the performance of RAG for long- context question-answer applications. With OP-RAG, as the number of retrieved chunks (b) Input token count. increases, the answer quality initially rises, and then declines, forming an inverted U-shaped Figure 1: Comparisons between the proposed order- curve. There exist sweet points where OP-RAG preserve retrieval-augmented generation (OP-RAG) and could achieve higher answer quality with much approaches using long-context LLMs without RAG less tokens than long-context LLM taking the on En.QA dataset of ∞Bench. Our OP-RAG uses whole context as input. Extensive experiments Llama3.1-70B as generator, which significantly outper- on public benchmark demonstrate the superior- forms its counterpart using Llama3.1-70B without RAG. ity of our OP-RAG. 1 Introduction Llama3.1 (Meta, 2024b), Phi-3 (Abdin et al., 2024), Due to the limited context window length and Mistral-Large2 (AI, 2024) all support 128-K (eg, 4096) of early-generation large language context. Gemini-1.5-pro even supports a 1M con- models (LLMs), retrieval augmented generation text window. The recent emergence of long-context (RAG) (Guu et al., 2020; Lewis et al., 2020) is an LLMs naturally leads to the question: is RAG nec- indispensable choice to handle a large-scale context essary in the age of long-context LLMs? Li et al. corpus. Since the answer quality is heavily depen- (2024) recently systematically compares RAG with dent on the performance of the retrieval model, a long-context (LC) LLMs (w/o RAG) and demon- lot of efforts are devoted to improving the retrieval strates that LC LLMs consistently outperform RAG recall/precision when designing the RAG system. in terms of answer quality. Recently, the state-of-art LLMs support much In this work, we re-examine the effectiveness longer context windows. For example, GPT- of RAG in long-context answer generation. We 4O (OpenAI, 2023), Claudi-3.5 (Anthropic, 2024), observe that the order of retrieved chunks in the context of LLM is vital for the answer quality. Dif- ferent from traditional RAG which places the re- trieved chunks in a relevance-descending order, we propose to preserve the order of retrieved chunks in the original text. Our experiments show that the proposed order-preserving mechanism significantly improves the answer quality of RAG. Meanwhile, using the proposed order-preserve Figure 2: Vanilla RAG versus the proposed order- RAG, as the number of retrieved chunks increases, preserve the RAG. As shown in the example, a long the answer quality initially rises and then declines. document is cropped into 13 chunks, {ci }13 i=1. The sim- This is because, with more retrieved chunks, the ilarity score is appended to each chunk. We retrieve model has access to more potentially relevant in- top 4 chunks with the highest similarity scores. Vanilla RAG places the chunks in a score-descending order, formation, which improves the chances of retriev- whereas the proposed order-preserve RAG places the ing the correct context needed to generate a high- chunks based on the order in the original document. quality answer. However, as more chunks are re- trieved, the likelihood of introducing irrelevant or distracting information also increases. This excess LLMs, RAG is a promising solution to overcoming information can confuse the model, leading to a the limitation of short context window. decline in answer quality. The trade-off, therefore, Long-context LLM. To support the long sequence is between improving recall by retrieving more of language models, many efforts have been de- context and maintaining precision by limiting dis- voted to improving the computing efficiency of tractions. The optimal point is where the balance self-attention (Choromanski et al., 2020; Zaheer between relevant and irrelevant information maxi- et al., 2020; Tay et al., 2020; Dao et al., 2022; Dao, mizes the quality of the answer. Beyond this point, 2024) and boosting extensibility of positional en- the introduction of too much irrelevant information coding (Press et al., 2021; Sun et al., 2022; Chen degrades the model’s performance. It explains the et al., 2023). Recently, the flagship LLMs such as inferior performance of the approach taking the GPT-4O (OpenAI, 2023), Gemini-1.5-Pro (Reid whole long context as the input of LLM. et al., 2024), Claudi-3.5 (Anthropic, 2024), Grok- Different from the conclusion from Li et al. 2 (xAI, 2024), and Llama3.1 (Meta, 2024a) have (2024), with the proposed order-preserving mech- supported extremely large context. With the ex- anism, RAG achieves higher answer quality com- istence of long-context LLMs, RAG is no longer pared with its counterparts that rely solely on Long- a indispensable module for long-context question- Context LLMs. As shown in Figure 4a, On En.QA answering task. Recently, Li et al. (2024) con- dataset of ∞Bench (Zhang et al., 2024), using only cludes that using long-context without RAG could 16K retrieved tokens, we achieve 44.43 F1 score significantly outperforms RAG. Different from the with Llama3.1-70B. In contrast, without RAG, conclusion from (Li et al., 2024), in this work, Llama3.1-70B making full use of 128K context we demonstrate the proposed order-preserve RAG only achieves 34.32 F1 score, GPT-4O achieves could beat the long-context LLMs without RAG. only 32.36 F1 score and Gemini-1.5-Pro obtains only 43.08 F1 score as evaluated by Li et al. (2024). 3 Order-Preserve RAG That is, RAG could achieve a higher F1 score even with a significant reduction on input length. Let us denote the long textual context, e.g., a long document, by d. We split d into N chunks sequen- tially and uniformly, {ci }N i=1. The index i implies 2 Related Work the sequential order of the chunk ci in d. That is, Retrieval-augmented generation. By incorporat- ci−1 denotes the chunk before ci whereas ci+1 de- ing the external knowledge as context, retrieval- notes the chunk right after ci. Given a query q, we augmented generation (RAG) (Guu et al., 2020; obtain the relevance score of the chunk ci by com- Lewis et al., 2020; Mialon et al., 2023) allows lan- puting cosine similarity between the embedding of guage model to access up-to-date and specific in- q and that of ci : formation, reducing hallucinations and improving factual accuracy. Before the era of long-context si = cos(emb(q), emb(ci )), (1) (a) EN.QA (b) EN.MC Figure 3: The influence of context length on the performance of RAG. The evaluations are conducted on En.QA and EN.MC datasets of ∞Bench. where cos(·, ·) denotes the cosine similarity func- theless, the average context length of LongBench tion and emb(·) denotes the embedding function. is below 20K words, which is not long enough to We retrieve the top k chunks with the highest evaluate the recent long-context LLMs supporting similarity scores with the query and denote the in- 128K-token window size. dices of top k chunks by J = {ji }ki=1. We preserve the order of chunks in the original long context d, 4.2 Implementation details. that is, we constrain We set the chunk size as 128 tokens on all datasets. Chunks are non-overlapped. We use BGE-large-en- jl > jm ⇐⇒ l > m. (2) v1.5 (Xiao et al., 2023) to extract the embedding of queries and chunks, by default. Figure 2 visualizes the difference between the vanilla RAG and the proposed order-preserve RAG. 4.3 Ablation Study Different from vanilla RAG placing the chunks in the order of similarity descending, the proposed The influence of context length. We evaluate order-preserve RAG keep the order of chunks in the influence of the context length on the perfor- the original document. mance of the proposed order-preserve RAG. Since each chunk contains 128 tokens, the context length 4 Experiments is 128m, where m is the number of the retrieved chunks as the context for generating the answer. As 4.1 Datasets. shown in Figure 3, as the context length increases, We conduct experiments on EN.QA and EN.MC the performance initially increases. This is because datasets of ∞Bench (Zhang et al., 2024) bench- more context might have a greater chance of cover- mark, specially designed for long-context QA eval- ing the relevant chunk. Nevertheless, as the context uation. To be specific, En.QA consists of 351 length further increases, the answer quality drops human-annotated question-answer pairs. On av- since more irrelevant chunks are used as distrac- erage, the long context in En.QA contains 150,374 tions. To be specific, Llama3.1-8B model achieves words. We use F1-score as metric for evaluation on the performance peak when the context length is En.QA. EN.MC consists of 224 question-answer 16K on both EN.QA dataset and EN.MC dataset, pairs, which are annotated similarly to En.QA, but whereas the best performance of Llama3.1-70B each question is provided with four answer choices. model is achieved at 48K on EN.QA and 32K on On average, the long context in En.MC contains EN.MC. The fact that the peak point of Llama3.1- 142,622 words. We use accuracy as metric for eval- 70B comes later than Llama3.1-8B model might uation on En.QA. We notice there is another bench- be because the larger-scale model has a stronger mark termed LongBench (Bai et al., 2023). Never- capability to distinguish the relevant chunks from (a) EN.QA (b) EN.MC Figure 4: Comparisons between the proposed order-preserve RAG and vanilla RAG. The evaluations are conducted on En.QA and EN.MC datasets of ∞Bench, using Llama3.1-70B model. irrelevant distractions. EN.QA EN.MC Method F1 Score Tokens Acc. Tokens Order-preserve RAG versus vanilla RAG. As Long-context LLM w/o RAG shown in Figure 4, when the number of retrieved Llama3.1-70B 34.26 117K 71.62 117K chunks are small (e.g, 8), the advantage of the pro- GPT-4O 32.36 117K 78.42 117K posed order-preserve RAG over vanilla RAG is not Gemini-1.5-Pro 43.08 196K 85.57 188K SELF-ROUTE (Li et al., 2024) considerably. In contrast, when the number of re- GPT-4O 34.95 85K 77.29 62K trieved chunks is large, our order-preserve RAG Gemini-1.5-Pro 37.51 83K 76.86 62K significantly outperforms vanilla RAG. To be spe- Llama3.1-70B order-preserve RAG (ours) cific, on EN.QA dataset, when the number of re- OP-RAG-16K 44.43 16K 84.72 16K trieved chunk is 128, vanilla RAG only achieves OP-RAG-24K 45.45 24K 88.65 24K OP-RAG-48K 47.25 48K 85.59 48K 38.40 F1-score whereas our order-preserve RAG achieves 44.43 F1-score. On EN.MC dataset, re- Table 1: Comparisons among the long-context LLM trieving 192 chunks, vanialla RAG only achieves without RAG, SELF-ROUTE mechanism (Li et al., 81.22 accuracy whereas our order-preserve RAG 2024) and the proposed order-preserve (OP) RAG. obtains 88.65 accuracy. 4.4 Main Results cantly outperforms than using much fewer tokens in the input of LLMs. We compare the proposed order-preserve RAG with two types of baselines. The first category of ap- 5 Conclusion proaches uses the long-context LLM without RAG. As shown in Table 1, without RAG, LLM takes a In this paper, we have revisited the role of retrieval- huge number of tokens as input, which is inefficient augmented generation (RAG) in the era of long- and costly. In contrast, the proposed order-preserve context language models (LLMs). While recent RAG not only significantly reduces the number of trends have favored long-context LLMs over RAG tokens, but also significantly improves the answer for their ability to incorporate extensive text se- quality. For instance, using Llama3.1-70B model, quences, our research challenges this perspective. the approach without RAG only achieves a 34.26 We argue that extremely long contexts in LLMs F1 score on EN.QA with an average of 117K to- can lead to a diminished focus on relevant infor- kens as input. In contrast, our OP-RAG with 48K mation, potentially degrading answer quality in tokens as input attains a 47.25 F1 score. The sec- question-answering tasks. To address this issue, we ond category of baselines takes the SELF-ROUTE proposed the order-preserve retrieval-augmented mechanism (Li et al., 2024), which routes queries generation (OP-RAG) mechanism. Our extensive to RAG or long-context LLM based on the model experiments on public benchmarks have demon- self-reflection. As shown in Table 1, ours signifi- strated that OP-RAG significantly improves the performance of RAG for long-context question- Meta. 2024a. Introducing llama 3.1: Our most capable answer applications. OP-RAG’s superior perfor- models to date. mance suggests that efficient retrieval and focused Meta. 2024b. Llama 3.1 models. context utilization can outperform the brute-force approach of processing extremely long contexts. Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. 2023. Augmented language References models: a survey. arXiv preprint arXiv:2302.07842. Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, OpenAI. 2023. GPT-4 technical report. ArXiv, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harki- 2303:08774. rat Behl, et al. 2024. Phi-3 technical report: A highly Ofir Press, Noah A Smith, and Mike Lewis. 2021. capable language model locally on your phone. arXiv Train short, test long: Attention with linear biases preprint arXiv:2404.14219. enables input length extrapolation. arXiv preprint Mistral AI. 2024. Mistral large 2. arXiv:2108.12409. Anthropic. 2024. Claude 3.5 sonnet. Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Fi- Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao rat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Un- Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, locking multimodal understanding across millions of and Juanzi Li. 2023. Longbench: A bilingual, mul- tokens of context. arXiv preprint arXiv:2403.05530. titask benchmark for long context understanding. arXiv preprint arXiv:2308.14508. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shao- han Huang, Alon Benhaim, Vishrav Chaudhary, Xia Shouyuan Chen, Sherman Wong, Liangjian Chen, and Song, and Furu Wei. 2022. A length-extrapolatable Yuandong Tian. 2023. Extending context window of transformer. arXiv preprint arXiv:2212.10554. large language models via positional interpolation. arXiv preprint arXiv:2306.15595. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Met- zler. 2020. Efficient transformers: A survey.(2020). Krzysztof Choromanski, Valerii Likhosherstov, David arXiv preprint cs.LG/2009.06732. Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, xAI. 2024. Grok-2 beta release. Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXiv preprint arXiv:2009.14794. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: Packaged resources Tri Dao. 2024. FlashAttention-2: Faster attention with to advance general chinese embedding. Preprint, better parallelism and work partitioning. In Inter- arXiv:2309.07597. national Conference on Learning Representations (ICLR). Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago On- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, tanon, Philip Pham, Anirudh Ravula, Qifan Wang, and Christopher Ré. 2022. FlashAttention: Fast and Li Yang, et al. 2020. Big bird: Transformers for memory-efficient exact attention with IO-awareness. longer sequences. Advances in neural information In Advances in Neural Information Processing Sys- processing systems, 33:17283–17297. tems (NeurIPS). Xinrong Zhang, Yingfa Chen, Shengding Hu, Zi- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- hang Xu, Junhao Chen, Moo Khai Hao, Xu Han, pat, and Mingwei Chang. 2020. Retrieval augmented Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and language model pre-training. In International confer- Maosong Sun. 2024. ∞bench: Extending long ence on machine learning, pages 3929–3938. PMLR. context evaluation beyond 100k tokens. Preprint, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio arXiv:2402.13718. Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neu- ral Information Processing Systems, 33:9459–9474. Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. 2024. Retrieval aug- mented generation or long-context llms? a compre- hensive study and hybrid approach. arXiv preprint arXiv:2407.16833.

In Defense of RAG in the Era of Long-Context Language Models PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue