Long Context vs. RAG for LLMs Evaluation PDF

Long Context vs. RAG for LLMs: An Evaluation and Revisits Xinze Li1 , Yixin Cao2† , Yubo Ma1 , Aixin Sun1† 1...

Long Context vs. RAG for LLMs: An Evaluation and Revisits Xinze Li1 , Yixin Cao2† , Yubo Ma1 , Aixin Sun1† 1 S-Lab, Nanyang Technological University 2 School of Computer Science, Fudan University {xinze002, yubo001}@e.ntu.edu.sg [email protected] [email protected] Abstract with long context windows to read in more infor- mation (LC) (Fei et al., 2024; Chen et al., 2023; Extending context windows (i.e., Long Con- Wang et al., 2024c), and (ii) employing retriev- text, LC) and using retrievers to selectively arXiv:2501.01880v1 [cs.CL] 27 Dec 2024 ers to include text segments relevant to the query access relevant information (i.e., Retrieval- Augmented Generation, RAG) are the two main (RAG) (Jiang et al., 2023; Asai et al., 2024; Gao strategies to enable LLMs to incorporate ex- et al., 2023). tremely long external contexts. This paper re- As shown by the timeline in Figure 1a, there is a visits recent studies on this topic, highlight- clear trend toward developing models that handle ing their key insights and discrepancies. We longer context windows and combining LC with then provide a more comprehensive evalua- RAG methods. The chronological overview of re- tion by filtering out questions answerable with- lated studies highlights an increasing focus on both out external context, identifying the most ef- LC and RAG since mid-2023, as evidenced by a fective retrieval methods, and expanding the datasets. We show that LC generally out- growing number of publications aimed at optimiz- performs RAG in question-answering bench- ing the efficient retrieval, and utilization of long marks, especially for Wikipedia-based ques- contexts. The development of models supporting tions. Summarization-based retrieval performs longer context windows underscores the growing comparably to LC, while chunk-based retrieval importance of handling extensive inputs effectively. lags behind. However, RAG has advantages in Despite the broad consensus regarding the impor- dialogue-based and general question queries. tance of LC and RAG, there remain disagreements These insights underscore the trade-offs be- tween RAG and LC strategies, offering guid- and contradictory insights from different studies, ance for future optimization of LLMs with ex- summarized in Table 1. For example, while several ternal knowledge sources. We also provide an studies agree on the effectiveness of combining LC in-depth discussion on this topic, highlighting and RAG (Xu et al., 2024b; Jiang et al., 2024b), the overlooked importance of context relevance others suggest that combining may not be benefi- in existing studies. cial (Bai et al., 2024a; Jin et al., 2024). Moreover, conflicting conclusions are reported regarding the benefits of RAG versus LC. Some papers find RAG 1 Introduction advantageous in certain contexts (Xu et al., 2024a; Large Language Models (LLMs) (Brown et al., Yu et al., 2024), while others highlight superior 2020) have demonstrated strong zero/few-shot ca- results from LC (Li et al., 2024; Xu et al., 2024b). pabilities in open-ended question answering (Yang These divergent insights showcase the complexity et al., 2019). However, they face challenges such as and ongoing debates in the field, suggesting that hallucinations (Shuster et al., 2021; Ji et al., 2023), optimal strategies may vary depending on specific lacking real-time information and domain-specific model architectures and benchmark conditions. knowledge (Su et al., 2024; Zhang et al., 2024), To explore the underlying reasons, we conduct among others. A common solution is to enhance an in-depth investigation into the conditions that LLMs with external memory to provide reliable lead to disagreements among existing studies. Dur- and up-to-date data sources. Yet, incorporating ing this process, we also identify key aspects additional content is constrained by the limited that may have been overlooked in earlier research. context window of LLMs. To address this, two Specifically, we revisit the evaluation process and main approaches are adopted: (i) building models implement the following changes. First, we fil- (Bai et al., 2024a) (Xu et al., 2024b) (Jiang et al., 2024b) (Li et al., 2024) (Yu et al., 2024) (Leng et al., 2024) LongBench Ret Meets LC LLM LongRAG Self-ROUTE OP-RAG LC RAG Performance Aug Sep Oct Jun Jul Aug Sep Oct Nov Dec 2024 ChatQA2 LC LLM Meets RAG LongBenchV2 (Xu et al., 2024a) (Jin et al., 2024) (Bai et al., 2024b) (a) Related work on LC and RAG, each paper is labeled by a char and one color. For instance, green and "L" represent "LongRAG". B InternLM-7B-8K P DBRX-Instruct C P Qwen2-72B-Instruct V Qwen2.5-72B-Instruct R B XGen-7B-8K B LongChat-v1.5-7B-32K P Mixtral-8x7B L C P GPT-4-Turbo V GLM-4-9B-Chat V GLM4-Plus B P GPT-3.5-Turbo B Vicuna-v1.5-7B-16K L DeepSeek-V2-Chat M Gemma2-9B P GPT-o1 2023 Jun Jul Aug Oct Nov 2024 Mar Apr May Jun Jul Aug Sep B ChatGLM2-6B-32K C Claude2 R Nemo-GPT-43B L P Claude-3-Opus L S O P V GPT-4o C O P V Llama-3.1-8B/70B-Instruct B Llama2-7B-Chat L P Claude-3-Sonnet P Gemini-1.5-flash M Mistral-NeMo-12B-Instruct R Llama2-70B S GPT-3.5-Turbo-16k P Claude-3-Haiku L S O M P Gemini-1.5-pro C Llama-3-ChatQA2-8B/70B (b) Chronological progress of key LLMs from 2023 to 2024. We focus on the models that publications in 1a use. We underline the models that support context window length of ≥ 32K. B R Text-embedding-ada-002 Sentence Window Retriever P Text-embedding-3 V Zhipu-embedding-3 B M BM25 B R S Contriever R S Dragon L O BGE-Large L C M E5-Mistral-7b RAPTOR Feb Mar Sep Oct Nov Dec Feb 1980s 2022 2023 2024 (c) History of frequently used retrievers from the 1980s until 2024. We bold the retrievers that no existing publications in 1a uses. Figure 1: Chronological overview of the development of RAG and LC. The Sub-graphs respectively illustrate the timelines for (a) publications related to LC and RAG, (b) long-context models, and (c) retrievers. We label before each model and retriever with the char and color block representing the publication that uses it. ter out questions from existing datasets that can lenges for comparing and combining LC and RAG, be correctly answered without external context, re- reflecting on the key points that researchers tend to moving biases from the parametric knowledge of overlook in this field. Evaluation results indicate LLMs and focusing on questions requiring external that LC models generally outperform RAG when knowledge. Second, we evaluate retrieval methods processing self-contained information like stories, and baselines on a smaller filtered dataset (1,000+ while RAG excels at handling fragmented infor- questions) from 12 QA datasets to identify the best mation, particularly in dialogue-based contexts. retriever. Third, we expand the dataset size by ap- These experiments deepen our understanding of the proximately 10 times by collecting additional data strengths and limitations of LC and RAG, offering from the original sources of the 12 datasets1. Lastly, valuable insights into optimizing retrieval strate- we compare the answers produced by the two set- gies and effectively integrating these approaches to tings, i.e., LC and RAG, and conduct an in-depth enhance performance in open-domain question an- analysis. Our results are based on the expanded swering. These findings also based on a systematic dataset using the long-context setting and the best survey of existing studies on this topic (see § 2). retrieval method identified earlier. Additionally, we discuss key aspects of comparing Our key contributions in this paper are as follows: LC and RAG in § 6, highlighting areas that have (i) Providing a comprehensive survey of existing been underexplored in prior research. studies on LC and RAG, analyzing their implemen- tations and key insights. (ii) Proposing a fair and 2 Related Work systematic evaluation framework, and performing detailed analyses to understand the strengths and Our primary focus is to evaluate and compare LC limitations of LC and RAG. (iii) Discussing chal- and RAG. To this end, we review papers with a 1 The experiment code and expanded datasets are available similar focus, and provide a detailed analysis of the at https://github.com/lixinze777/LC_VS_RAG retrievers and long-context settings they employ. 2.1 Retrievers Index-based Retrieval requires pre-processing Retrievers, as fundamental components of RAG on the documents with more complicated data struc- pipelines, focus on identifying and extracting con- tures (Gupta et al., 2018). With the development textually relevant segments of documents. We of LLM, Llama-Index (Liu, 2022) was proposed to categorize retrieval strategies into three main ap- facilitate interaction between the model and doc- proaches: chunk-based retrieval, which splits doc- uments more conveniently. The index provides a uments into smaller segments and then retrieves flexible interface to construct various data struc- those most relevant to a query; index-based re- tures, known as “indices” that store, organize, and trieval, which builds specialized index structures facilitate quick retrieval of context. Once created, to guide efficient and context-rich lookups; and these indices can be efficiently queried, guiding the summarization-based retrieval, which leverages hi- LLM to the most relevant information, improving erarchical summaries to capture a document’s key the accuracy of responses. Some classic indexing information at various levels of abstraction. methods include tree index which constructs a hi- erarchical tree from nodes, and knowledge graph Chunk-based Retrieval can be broadly cat- index, which builds a knowledge graph with la- egorized into sparse retrievers and dense re- beled nodes and relationships. trievers. Sparse retrievers, such as the classic Summarization-based Retrieval is built on top BM25 (Robertson and Zaragoza, 2009), operate on of chunk- and index-based approaches. It provides term frequency-based representations of text and comprehensive summaries for key points in a doc- rank chunks based on a similarity function, lever- ument. These summaries available for retrieval. aging exact matches and term weighting. With RAPTOR (Sarthi et al., 2024) improves retrieval the advent of word embeddings, dense retrievers by generating recursive summaries of text chunks have gained prominence. These models encode organized in a tree structure. Instead of retrieving both queries and document chunks into dense vec- short, contiguous text snippets, RAPTOR clusters tor representations and calculate relevance using text segments, summarizes them at various levels, similarity metrics, such as cosine similarity. and forms a hierarchical tree that represents the Since text similarity is often defined by measur- document’s content at different levels of abstrac- ing the distance between embeddings, the quality tion. This allows retrieval models to extract context of these embeddings is particularly important. Con- at varying levels of detail, improving the ability to triever (Izacard et al., 2022) leverages contrastive handle complex questions that require synthesizing learning for training without supervision. By gen- information from multiple parts of the document. erating synthetic queries and pre-training on un- Such a summarization-based retrieval method en- labeled data, Contriever provides robust retrieval hances retrieval accuracy for tasks requiring long- capabilities especially in cross-lingual applications. range or multi-step reasoning. On a larger scale, BGE-Large (Xiao et al., 2023) employs diverse datasets and sophisticated training 2.2 Long-Context LLMs methods to outperform previous models on compre- Many research efforts focus on extending input and hensive benchmarks such as C-MTEB. E5Mistral- output windows to accommodate more context (see 7b (Wang et al., 2024b) combines open-source, Figure 1b), enabling applications such as extended decoder-only LLMs with synthetic data generation dialogues, large document processing, and complex pipelines. With minimal human annotations, the multimodal tasks. Thus, our analysis focuses on fine-tuning achieves SOTA performance on BEIR two dimensions: the model capabilities and the and MTEB. Dragon (Lin et al., 2023) also employs context length they can reach. data augmentation, including cropping and gener- ative queries, and integrates labels from multiple Model Ability. While most of the models dis- retrieval sources. This strategy ensures its effec- cussed here excel at understanding long docu- tiveness without increasing model complexity. An- ments, many emphasize specialized capabilities. other method of learning high-quality embeddings ChatGLM2-6B-32K (Zeng et al., 2024) employs is through strong generalization ability from LLMs. Multi-Query Attention to achieve high reason- For instance, OpenAI embeddings draw upon the ing efficiency with low memory usage, mak- GPT-3.5/4 family while Zhipu-embedding-3 lever- ing it suitable for tasks requiring deep reason- ages the GLM family (Zeng et al., 2024). ing. XGen-7B-8K (Nijkamp et al., 2023) en- hances long-context conversational understanding do DeepSeek-V2-Chat(DeepSeek-AI et al., 2024), and text summarization, enabling coherent and con- Qwen2-72B-Instruct(Yang et al., 2024), Qwen2.5- textually rich dialogues. InternLM-7B-8k (Cai 72B-Instruct (Qwen et al., 2024), GLM-4-9B- et al., 2024) is optimized for knowledge under- Chat (Zeng et al., 2024), GLM-4-Plus, Mistral-12b- standing, reading comprehension, and multilingual Instruct, and Llama-3.1-Instruct. Notably, Gemini- translation, supporting diverse linguistic applica- 1.5-flahs and Gemini-1.5-pro(Reid et al., 2024) tions. Models like DeepSeek-V2-Chat (DeepSeek- both support up to an unprecedented 10M tokens. AI et al., 2024), Qwen2-72B-Instruct (Yang et al., These ultra long-context models enable the process- 2024), Qwen2.5-72B-Instruct (Qwen et al., 2024), ing of exceptionally large documents, complex mul- Mixtral-7x8b (Jiang et al., 2024a), and DBRX- timodal tasks, and extensive multi-turn dialogues. Instruct excel in mathematical computations, log- ical reasoning, and coding, demonstrating strong 2.3 Comparing & Combining LC and RAG performance in technical and analytical tasks. Since the increase in LLMs’ context window Additionally, Claude-3-Opus, Sonnet, Haiku, lengths, some models can contain the entire docu- Gemini-1.5-flash, and Gemini-1.5-pro (Reid et al., ment, reducing the need to retrieve on documents. 2024) incorporate multi-modal capabilities, effec- Hence, more studies have begun comparing the tively handling both textual and visual informa- performance of long-context LLMs and RAG, as tion. GLM-4-9B-Chat (Zeng et al., 2024), Mistral- well as investigating ways to combine them. Long- 12b-Instruct, and Llama-3.1-Instruct (Dubey et al., Bench (Bai et al., 2024a) conducts early compari- 2024) offer robust multilingual abilities, strong son experiments on a 4K model with RAG and a instruction-following and multi-turn dialogue ca- 32K model. Xu et al. (2024b) systematically com- pabilities, increasing their utility in a wide range pare LC LLMs and RAG, and proposes their combi- of conversational scenarios. Finally, Claude-2 is nation. LongRAG (Jiang et al., 2024b) introduces notable for low hallucination rate when processing long retrievers and long readers, a successful appli- extra-long documents, ensuring high accuracy and cation of long retrieval units to RAG. ChatQA2 (Xu reliability in information retrieval and synthesis. et al., 2024a) instruction-tunes long-context LLMs to a 128K context window and tests their ability Context Length. As shown in Figure 1b, there is with long-context retrievers. Self-ROUTE (Li et al., a clear trend of increasing context length in newly 2024) enables the model to select either RAG or released models. Following the categorization ap- LC based on self-reflection to reduce costs. OP- proach proposed by ChatQA2 (Xu et al., 2024a), RAG (Yu et al., 2024) preserves the original order we classify these models into three categories based of retrieved chunks, and LC LLM meets RAG (Jin on their supported context windows: short (up to et al., 2024) investigates long-context LLMs in 4K), long (up to 32K), and ultra-long (more than RAG systems, proposing retrieval reordering meth- 32K) context models. ods. LC RAG Performance of LLM (Leng et al., Short context models, such as Llama2-70B and 2024) evaluates the effectiveness of RAG on long- llama2-7B-chat-4k (Touvron et al., 2023), support context LLMs across context lengths from 2K to up to 4K tokens and are typically employed as 2M tokens. Very recently, LongBench is updated baselines for retrieval and standard conversational to LongBench V2 (Bai et al., 2024b), which tests tasks. Long context models, including XGen-7B- LLMs on long context comprehension and reason- 8K(Nijkamp et al., 2023), InternLM-7B-8k(Cai ing with a more realistic and challenging setting. et al., 2024), Mixtral-7x8b (Jiang et al., 2024a), We summarize the key insights from these pa- DBRX-Instruct and Gemma2-9B (Mesnard et al., pers into three categories: (1) general insights 2024), offer context windows ranging from 8K to such as chunking strategies, (2) combining the two 32K tokens. These are ideal for extended con- strategies, and (3) comparing the performance be- versations, comprehensive text analysis, and de- tween LC and RAG (see Table 1). tailed summarization tasks. Ultra-long context Some papers reach consensus on chunking strat- models extend beyond 32K tokens. For example, egy that, retrieval units should be longer (Jiang Claude-2 provides a 100K token window, while et al., 2024b) and the number of chunks should Claude-3-Opus, Sonnet, and Haiku handle up to be kept low (Yu et al., 2024). According to (Xu 200K tokens. GPT-4-Turbo(OpenAI et al., 2023), et al., 2024b), selecting the top 5 to 10 chunks typ- GPT-4o, and GPT-o1 all support 128K tokens, as ically yields strong performance, while retrieving Paper Type Findings LongBench (B) Retrieval helps 4k model, but not 16k/32k models. (Bai et al., 2024a) + Models benefit from continuous training on long contexts. + Splitting context into shorter and more chunks is better. Ret-LC LLM (R) ⋆ LC is better for multi-hop benchmarks than 4k RAG. (Xu et al., 2024b) ○ RAG improves on 70B/43B models on all context lengths. + For LC model, best results are obtained from top-5 or top-10. LongRAG (L) ○ Retrieval benefits from long retrieval units. (Jiang et al., 2024b) ChatQA2 (C) ☆ For sequence lengths up to 32K, RAG outperforms LC. (Xu et al., 2024a) ○ From 3K to 24K, greater context window benefits RAG. Self-ROUTE (S) ⋆ LC consistently outperforms RAG, but RAG has lower cost. (Li et al., 2024) OP-RAG (O) ☆ Efficient retrieval can outperform brute-force LC. (Yu et al., 2024) + Too many chunks in RAG harms performance. + Preserving the original order is better than ordering by score. LC LLM-RAG (M) Retrieve more passages first improves performance then drops. (Jin et al., 2024) + Ordering higher score information to front and back helps. LC RAG ○ Most close models’ RAG improves up to 100k tokens. Performance (P) Most open models’ RAG peak at 16k-32k then performance drops. (Leng et al., 2024) LongBench v2 (V) ☆ GPT-4o performs better at 128k without RAG. (Bai et al., 2024b) ○ GPT-4o performance keeps increasing to 128k RAG context. Qwen2.5 & GLM-4-Plus drop with >32k RAG contexts. Table 1: Important findings from existing studies that compare or combine LC with RAG (label in brackets). We group the insights into three categories: 1) General strategies that improve performance marked by +. 2) Combining LC and RAG, where ○ indicates combining is good, and for combining is not helpful, and 3) Comparing LC and RAG, where ☆ indicates RAG outperforms LC, and ⋆ for LC outperforms RAG. more than 20 chunks leads to diminished results. et al., 2024) that RAG for close-source models can LongBench (Bai et al., 2024a) presents a different improve up to 100K input, whereas performance finding, suggesting that splitting a long context into for some open-source models peaks around 16K shorter and more numerous chunks is better. How- tokens. Hence, the varying behaviors might be due ever, at the time of its publication, LLMs generally to different model size and architecture. exhibited weaker long-context capabilities, and the There are even greater discrepancies in the direct study did not incorporate very long retrieval units comparisons between the two methods. Xu et al. (>1000 tokens). Consequently, LongBench’s find- (2024b) claims that long-context models outper- ings are not at odds with the broader consensus. form retrieval with short-context models in multi- Nonetheless, these papers present disagreement hop benchmarks. In contrast, ChatQA2 (Xu et al., regarding performance of retrieval on long-context 2024a) finds that RAG can outperform LC if a LLMs. For instance, LongBench (Bai et al., 2024a) sufficient number of top-k chunks are used. Self- finds that retrieval helps short-context models but ROUTE (Li et al., 2024) fully supports LC, arguing not 7B long-context models. In contrast, Xu et al. that it outperforms RAG in all benchmarks. Mean- (2024b) suggest that RAG improves 70B models while, OP-RAG (Yu et al., 2024) defends RAG, across all context lengths, attributing the discrep- demonstrating that efficient retrieval strategies can ancy to the difference between model sizes. Sim- outperform a brute-force approach of processing ilarly, ChatQA2 (Xu et al., 2024a) observes that extremely long contexts. increasing the context window from 3K to 24K The reasons for the differences among these stud- tokens consistently benefits RAG. Notably, Long- ies are manifold. For instance, There are three Bench V2 (Bai et al., 2024b) shows that GPT-4o categories of retrieval methods (i.e., chunk-based, continues to improve in RAG performance even index-based, and summarization-based retrieval), at 128K input, whereas Qwen2.5 and GLM-4-Plus but current studies rely predominantly on chunk- show performance deterioration beyond 32K input. based retrieval, leaving room for further optimiza- The observations align with findings from (Leng tion. Additionally, evaluation scores often repre- sent weighted averages across different datasets. 3.2 Question (and Context) Expansion Because each dataset has distinct characteristics, RAG and LC produce identical answers for about placing more emphasis on one dataset and less on 60% of the questions in existing evaluations (Li another can alter the final results. Finally, most ex- et al., 2024), leaving relatively few questions to isting studies use only a few datasets with around help us understand the differences between the two. 200 questions each. This small sample size creates To ensure robust statistical significance, we expand greater room for variability and reduces the general the dataset size to approximately 20,000 questions reliability of these findings. by collecting additional samples. To maintain a similar distribution as the origi- 3 Question Filtering and Expansion nal datasets, we follow two principles during data To ensure a fair and comprehensive comparison, collection. First, we collect questions only from we curate our evaluation dataset based on existing the original source of each dataset, avoiding arti- datasets, and apply necessary filtering (§ 3.1) and ficially generated or LLM-augmented questions. augmentation (§ 3.2). We select 12 long-context Second, we add distracting passages to the origi- QA datasets frequently used in studies comparing nal context for each question to extend the context LC and RAG: Natural Questions (Kwiatkowski length, following the implementation described in et al., 2019), 2WikiMultihopQA (Ho et al., 2020), LongBench. For NovelQA, we use all its available HotpotQA (Yang et al., 2018), MuSiQue (Trivedi questions. For Coursera, MultiFieldQA, and Multi- et al., 2022), MultiFieldQA (Bai et al., 2024a), Nar- Doc2Dial datasets, we do not further enlarge their rativeQA (Kočiský et al., 2018), QASPER (Dasigi sizes to avoid introducing artificial data. et al., 2021), QuALTY (Pang et al., 2022), Cours- Hereafter, we refer to the expanded dataset as the era, TOEFL-QA, and MultiDoc2Dial (An et al., full question set and the original, pre-expansion 2024). We also include the NovelQA (Wang et al., dataset as the sample question set. 2024a) dataset, a high-quality, human-annotated re- 3.3 Dataset Statistics source derived from long-form novels. We present an overview of these datasets in Table 2, including After expansion, we obtain 19,188 questions, of their type, context type (single-doc or multi-doc), which 13,651 require context to be answered using context source, average context length, and repre- the filtering method from § 3.1, as listed in Table 3. sentative studies that have utilized each dataset. Notably, questions grounded in factual knowledge, such as those from Coursera, show a high removal 3.1 Question Filtering rate. Similarly, questions drawn from well-known books or requiring multi-hop reasoning often ex- Given the strong capabilities of modern LLMs, hibit a higher likelihood of being directly answered many questions can be directly answered based on by LLMs without context. Comparing the 12 indi- knowledge encoded in their parameters (Basmova vidual datasets, we observe a similar filtering rate et al., 2024), reducing the need for external context between the sample and the full question sets (see in some cases. However, certain queries, such as Tables 2 and 3), indicating that both sets follow a those related to private conversations, will always similar distribution. require additional context. To determine which ap- proach more effectively enhances an LLM’s perfor- 4 Evaluation Methodology mance with long documents, we filter the datasets to include only questions that the LLM cannot 4.1 Evaluation Framework answer correctly without external context. This Our evaluation of RAG and LC is conducted in the ensures that any correct answers obtained subse- following three phases. quently must rely on external knowledge rather Phase 1: Empirical Study on Retrievers. We than the model’s built-in knowledge. evaluate five retrievers: BM25, Contriever, OpenAI For our implementation, we use GPT-4o for Embeddings, Llama-Index, and RAPTOR, on the question filtering due to its strong capabilities. We sample question set. The retriever yielding the employ a strict exact-match scoring metric to en- best performance is then selected for subsequent sure that the model not only provides the correct comparisons with LC on the full question set. answer but also demonstrates a complete under- Phase 2: Comparing RAG and LC. Using the standing of the required information. best retriever, RAG is compared with LC by an- Dataset T Doc Source Avg Len Used by Papers #Q # Kept % Kept Mode NQ K multi Wikipedia 18,164.7 M, P 109 22 20 Open Coursera K multi Coursera 7,934.3 NIL (L-eval) 172 54 32 MCQ NovelQA C single books 67,000.0 NIL (NovelQA) 210 109 52 MCQ 2WikiMHQA R multi Wikipedia 7,191.3 B, S, M 300 152 51 Open HotpotQA R multi Wikipedia 10,602.7 B, R, L, C, S, M 200 93 47 Open MuSiQue R multi Wikipedia 12,974.3 B, R, C, S 200 140 70 Open MultiFieldQA C single papers, reports 5,706.1 B, R, L, C, S 150 121 81 Open NarrativeQA C single books, films 25,274.2 B, R, S 200 171 86 Open QASPER C single papers 5,350.3 B, R, C 224 221 99 Open QuALTY C single stories 5,089.2 R, C 202 202 100 MCQ TOEFL-QA C single exams 729.1 NIL (L-eval) 121 121 100 MCQ MultiDoc2Dial C multi dialogue 3,076.9 NIL (L-eval) 158 158 100 Open Table 2: Overview of the original datasets (i.e., the pre-expanded sample question set) and their characteristics. The column “T” represents dataset type with values “K” for “Knowledge”, “R” for “reasoning”, and “C” for “reading comprehension”. For each dataset, we report the existing papers (with the label) about LC & RAG that use it. If no paper has used it, we report its source like L-eval (An et al., 2024). We also report number of questions in each set (# Q), number and percentage of questions retained after filtering (# Kept and % Kept) out questions needing no context, and mode of question. Dataset # Questions # Kept Q % Kept Q 4.2 Retriever Selection Coursera 172 54 32 NQ 1,109 373 34 Figure 1 shows that existing studies primarily select NovelQA 2,283 869 38 one or more chunk-based retrieval methods, while 2WikiMHQA 2,300 1,036 45 index- and summarization-based retrievers are less HotpotQA 2,200 1,113 51 MuSiQue 2,200 1,663 78 frequently evaluated. In our study, we evaluate MultiFieldQA 150 121 81 various retrieval methods to ensure that RAG is NarrativeQA 2,211 1,880 85 supported by the most effective retrievers. QASPER 2,718 2,674 98 QuALTY 2,725 2,725 100 For chunk-based retrieval, we use TOEFL-QA 962 962 100 BM25 (Robertson and Zaragoza, 2009), Con- MultiDoc2Dial 158 158 100 triever (Izacard et al., 2022), and OpenAI’s Total 19,188 13,628 71 text-embedding-3-Small. BM25 serves as a Table 3: Statistics of the full question set, ordered by classic baseline, while Contriever and text- increasing percentage of questions kept after filtering embedding-3-Small represent embeddings from out questions needing no context. well-performing closed-source and open-source models, respectively. swering questions on the full question set. Both For index-based retrieval, we employ Llama- methods use the same underlying LLM for ques- index and leverage two indexing methods that suit tion answering. For RAG, relevant documents or long documents. Specifically, tree-index organizes chunks are fetched from the available context and documents into a hierarchical tree structure, en- provided to the LLM as input to generate answers. abling efficient retrieval of context. The root node In contrast, for LC, the entire context available to contains a high-level summary, while subsequent the question is given to the LLM, with truncation child nodes store progressively finer-grained repre- from the back of the context applied if the context sentations. When queried, the retrieval process nav- exceeds the model’s context window. The evalua- igates through this hierarchy, starting from the top- tion metrics are explained in § 4.3. level summary and moving down to more specific Phase 3: In-depth Analysis. We focus on 4 spe- nodes as needed. Sentence Window Retriever cific subsets of questions: 1) those answered cor- focuses on local, sentence-level context rather than rectly only by RAG, 2) those answered correctly entire documents or large text chunks. It creates only by LC, 3) those RAG gives better answers, and smaller “windows” of a few sentences each. When 4) those LC gives better answers. These subsets a query arrives, the retriever searches these win- are analyzed to understand the types of questions dows to identify segments most semantically simi- each method excels at, providing insights into the lar to the query. By working at a finer granularity, strengths and limitations of both approaches in dif- the sentence window retriever provides more tar- ferent scenarios. geted and contextually accurate snippets of text, Match (EM) score strictly to all questions to de- termine the correctness of the answers. Excluding Questions Questions Correct Both Only LC Answers the overlap, the top right block indicates the ques- Answered Answered by LC Correctly Correctly (EM) tions that only LC answers correctly, and similarly, the bottom left block indicates the questions that LC answers better only RAG answers correctly. Questions Questions (F1) Only RAG Both The remaining gray block represents the ques- Answered Answered Correctly RAG Wrongly tions that both RAG and LC answer incorrectly, as answers better (F1) judged by Exact Match. Since many questions in- Correct Answers volve long open-ended responses, we calculate the by RAG (EM) F 1 scores of the answers provided by both meth- Figure 2: Evaluation Matrix for In-depth Analysis. ods against the ground truth. If RAG achieves a higher F 1 score than LC, we consider RAG to have answered the question better, and vice versa for LC. improving the model’s ability to answer specific A detailed explanation of F 1 score calculation is questions. provided in appendix A For summarization-based retrieval, we use The loose evaluation setting considers all cases RAPTOR (Sarthi et al., 2024). It constructs a hier- in which one method outperforms the other, includ- archical tree by recursively clustering text chunks ing 1) when one method obtains the correct answer based on semantic similarity, summarizing each and the other is wrong under EM, and 2) when cluster into a parent node, and continuing this pro- one method achieves a higher F 1 score. We adopt cess until no further clustering is possible. After this loose evaluation because references for some constructing the tree, we apply the collapsed tree datasets are long, open-ended answers, making it traversal approach, as previous work has demon- very unlikely to match them exactly under EM. In strated its superior performance. This approach addition, some short answers (about 5–6 words) flattens the hierarchical structure into a single layer may differ slightly from the reference while still and compares the query against all nodes across conveying the correct idea. Although these answers every level simultaneously. The top-k most rele- would be marked incorrect by EM, they might at- vant nodes are then selected based on a predefined tain a high F 1 score. Hence, comparing F 1 scores token limit, ensuring that the retrieved information helps compensate for the strictness of EM. maintains the appropriate level of granularity. Although RAPTOR’s implementation appears 5 Experiments similar to the Llama Tree Index, they differ in both construction and navigation. First, Llama Tree To obtain answers, we use the same prompt “From Index groups consecutive nodes, while RAPTOR the context: [context], answer the questions briefly freely clusters nodes from far positions, and even with no explanation.” for both retrieval and long allows a single node to appear in multiple clusters. context settings. For MCQ questions, we add one Second, Llama Tree Index navigates down the hier- sentence “Answer the question with the letters of archy to retrieve only leaf nodes, while RAPTOR the correct options (e.g. A, BC, C, ACD, etc.) with- evaluates all nodes from all layers simultaneously. out including text”. These prompts ensure LLMs Hence, RAPTOR can retrieve not only original to directly answer the questions, which makes eval- texts but also generated summaries. uation more convenient. 4.3 Evaluation Metric 5.1 Phase 1: Retrievers We use a win-lose rate system to compare LC and Evaluated on the sample question set, Ta- RAG, as illustrated in Figure 2. The horizontal ble 5 reports the results of chunk-, index-, and yellow block represents the questions that the LLM summarization-based retrievers. Among them, answers correctly using LC, while the vertical blue RAPTOR performs the best with a correct answer block represents the questions that the LLM an- rate of 38.5%, while Index-based retrievers outper- swers correctly using RAG. Their overlap in the form chunk-based retrievers. Within index-based top-left corner represents the questions that both retrievers, the “RAG Only” score for Tree Index methods answer correctly. We apply an Exact is much lower than that for Window Parsing (82 Dataset # Questions LC Correct RAG Correct LC Only RAG Only LC Better RAG Better Coursera 54 26 20 10 4 10 4 2WikiMHQA 1,036 594 431 242 79 265 107 HotpotQA 1,113 876 723 212 59 231 67 MultiFieldQA 121 63 60 14 11 44 21 NQ 373 189 138 75 24 104 35 NarrativeQA 1,880 558 405 276 123 685 281 QASPER 2,674 884 863 517 496 1,011 762 QuALITY 2,725 2,290 2,050 402 162 402 162 TOEFL-QA 962 895 884 26 15 26 15 MuiQue 1,663 821 663 344 186 426 225 MultiDoc2Dial 158 14 38 5 29 65 58 NovelQA 869 466 408 164 106 164 106 Overall 13,628 7676 6,683 2,287 1,294 3,433 1,843 Table 4: Performance of LC and RAG across different datasets. We report the number of questions answered correctly by each method, as well as the breakdown of questions where: only LC answers correctly (LC Only), only RAG answers correctly (RAG Only), LC outperforms RAG (LC Better), and RAG outperforms LC (RAG Better). Type Retriever Correct (%) RAG Only RAG Better swers 1,843 questions better than LC. The gap fur- BM25 319 (20.4) 50 141 Chunk Contriever 315 (20.1) 43 143 ther widens compared to strict setting, indicating Text-emb-3-small 338 (21.6) 47 151 long-context LLM’s ability to answer questions Tree Index 470 (30.1) 82 234 Index Window Parsing 555 (35.5) 91 237 with open long answers is also strong. Summarization RAPTOR 602 (38.5) 97 258 Looking at individual datasets, in Multi- Table 5: Comparison of different retrieval methods Doc2Dial, RAG exhibits better performance than LC in strict evaluation (5 vs 29), but is surpassed by LC in loose evaluation (65 vs 58). In contrast, on vs. 91), and their “RAG Better” scores are nearly datasets like NarrativeQA and QuaLITY, LC shows identical (234 vs. 237). This discrepancy suggests a strong lead not just in overall correctness but also that Tree Index may be undervalued in the “RAG in the number of questions that are answered better. Only” metric but still contributes in open question Collectively, the results show that both methods scenarios that require long answers. have unique strengths and limitations. We further observe the questions and contexts Although LC shows better overall results than that each retriever exclusively answers correctly. RAG, out of the 13,628 questions, almost 10% can RAPTOR shows stronger ability than other retriev- be only answered correctly by RAG, which is not ers, especially in scenarios that require an entire un- a small ratio. This shows that retrievers cannot be derstanding of the document, like research papers. simply replaced by long-context LLM in searching. Chunk-based methods struggle when required in- This also motivates us to further examine what kind formation is spread across multiple chunks. Index- of questions (and context) can be only answered based retrievers are not as strong in overall under- correctly by RAG (or LC). standing as RAPTOR, but they show good ability in interpreting dialogues. Therefore, we select RAP- 5.3 Phase 3: In-Depth Analysis TOR as the primary retriever for evaluation on the The overall results are influenced by the combined full question set. effects of different scenarios, so we need to sepa- rately analyze each scenario to see if more detailed 5.2 Phase 2: Comparing LC and RAG results can be obtained. We analyze the perfor- We compare LC and RAG on the filtered, full ques- mance of LC and RAG across different knowledge tion set. The results across 12 datasets are sum- sources (Figure 3) and question types (Figures 4). marized in Table 4. Overall, LC correctly answers Here, we use EM Scores only, for a strict evaluation 56.3% of the questions, while RAG provides cor- standard. We also report the results for loose evalu- rect answers to 49.0%. LC correctly answers more ation standard (i.e., EM Scores and F 1 Scores) in than 2,000 questions that RAG misses, while RAG appendix B, which shows similar trends. exclusively answers almost 1,300 questions. When From Figure 3, it is evident that LC excels with looking at the loose evaluation setting, LC answers knowledge sources such as Wikipedia and sto- 3,433 questions better than RAG, and RAG an- ries. However, the Wikipedia context is collected Dialogue 31 44 LC Who 268 101 LC RAG RAG Paper/Report 531 507 When 182 73 Story 842 391 Where 142 59 Which 250 110 WikiPedia 873 348 0 200 400 600 800 1000 1200 What 919 468 Word Count Why 139 63 Figure 3: Performance breakdown by knowledge source How 333 280 for LC Only and RAG Only. Other 54 140 0 200 400 600 800 1000 1200 by adding extensive noise to create long context, Word Count Figure 4: Performance breakdown by question type for which generally makes the context less relevant LC Only and RAG Only. to the question, with only a small portion being useful. This synthetic context formation partially 0.30 LC simulates the RAG process and may introduce an 0.25 RAG TF-IDF Score unfair bias against the RAG pipeline. In addi- 0.20 tion, summarization-based retrieval methods may 0.15 split Wikipedia articles unnaturally, generating less 0.10 meaningful summaries. LC’s strong performance sotnry firsg fil t no m citel pe y pe mees tas r neet onw time ea e mo rth l de da rme l v op demonstrates that long-context LLMs are robust to un t i rfo co noise in such forms of context. Words In contrast, RAG performs better with dialogue- Figure 5: Top 15 Words based on TF-IDF Score for LC related sources and achieves comparable perfor- Only vs. RAG Only. mance with papers or reports. The information in these sources is naturally segmented, conversations have turns, and papers and reports have clearly de- questions in the datasets where either LC or RAG fined sections or subsections, making the retrieval produced correct answers exclusively. Specifically, of key segments easier. all questions from each dataset are concatenated Figure 4 shows that LC performs better for fact- and treated as a single document for this analysis, based questions such as “Who”, “Where”, and meaning that the TF-IDF scores primarily reflect “Which”. These questions often benefit from having the term frequency within each dataset. Stopwords all the relevant context available in a dense region are removed and not shown in the plot. close to the answer. RAG, however, is largely com- Figure 5 presents the top 15 words that appear parable to LC for more open-ended questions such most frequently combined in both LC only and as “How”, which often require synthesizing infor- RAG only questions. Words such as ‘song’, ‘film’, mation from multiple sources and therefore benefit and ‘novel’ have higher TF-IDF scores for LC, from retrieval-based approaches. suggesting that LC performs better with narrative Furthermore, RAG outperforms LC in the topics. Conversely, words like ‘country’, ‘dataset’, “Other” questions, which consist mainly of general and ‘model’ have higher scores for RAG, indicating questions that can be answered with “Yes” or “No”. its strength in retrieving information on technical We hypothesize that the reason could be due to the or data-oriented topics. This analysis underscores training data. Long-context LLMs are more famil- the complementary strengths and limitations of LC iar with phrasing of common type questions than and RAG in handling different types of questions. general questions. Words like “Who” or “Where” act as keywords for long-context LLMs to search, 5.5 Impact of Generation Model in RAG while retrievers use these keywords not so well. We now evaluate the impact of different generation models on RAG’s performance. Table 6 shows the 5.4 Word Frequency Visualization results of using GPT-4o and GPT-4-Turbo as the To better understand the scenarios that LC and generator with three retrievers (BM25, Tree Index, RAG each excels at, we visualize the word fre- RAPTOR), each of which represents one retriever quencies by their TF-IDF scores, plotted in Fig- type. The results indicate that the performance of ure 5. The TF-IDF scores were calculated from different generation models remains largely con- Retriever Model Correct (%) RAG Only RAG Better Question: What is the debt-to-GDP GPT-4o 319 (20.4) 50 141 ratio of the country where Anthony BM25 Upko was formerly involved in the GPT-4-Turbo 310 (19.8) 51 152 government? GPT-4o 470 (30.1) 82 234 Tree-Index Wrong Answer: The context does not GPT-4-Turbo 458 (29.3) 81 229 provide the debt-to-GDP ratio for GPT-4o 602 (38.5) 97 258 Nigeria. RAPTOR GPT-4-Turbo 589 (37.7) 99 295 Gold: 11 percent Table 6: Results of using different generation models Relevant Sents: 1. Nigeria is the world’s 20th largest economy... the debt-to-GDP ratio is only 11 percent. 2. Anthony Ukpo was Minister of sistent regardless of the retriever used. RAPTOR Information and Culture, and then performs the best across both generation models, Governor of Rivers State, Nigeria. though there is a slight decrease in performance Question: When is the performer of song Swing Down Sweet Chariot ’s when using GPT-4-Turbo compared to GPT-4o. birthday? While GPT-4o slightly outperforms GPT-4- Wrong Answer: May 8, 1940 Gold: January 8, 1935 Turbo across all retrievers, the differences are Relevant Sents: 1. Swing Down Sweet marginal. This implies that both generation models Chariot is a traditional song... are capable of generating high-quality responses, recorded by Elvis Presley. 2. Elvis Aaron Presley (January 8, and the choice between them may depend more 1935 - August 16, 1977), also known as on other factors such as efficiency or resource... availability. The consistency across retrievers also Table 7: Examples cases where RAG made mistakes demonstrates that the retrieval method plays a larger role in determining overall performance than Question: Do the tweets come from a specific region? the specific generation model used. We will report Wrong Answer: Yes, the tweets come the results from other models and the experiment from 16 different countries. is in progress. Gold: No Relevant Sents: This helped us narrow down our query space to 16 countries. 5.6 Case Study Question: Where did Valancourt lose his wealth? For a deeper understanding of the difference be- Wrong Answer: In Gambling. tween LC and RAG, we conduct a case study to Gold: Paris Relevant Sents: Returning to her analyze the frequent errors from each method, and aunt’s estate, Emily learns that present them in Tables 7 and 8. We manually ex- Valancourt has gone to Paris and lost his wealth. amine the questions that only RAG made mistakes, Table 8: Examples representing common cases where and those only LC made mistakes. only RAG answers correctly The most frequent mistake made by RAG is its failure to retrieve the relevant context, leading to its refusal to answer the question. As shown in tence ‘Swing Down Sweet Chariot is a traditional Table 7, the model correctly identifies that Anthony song... recorded by Elvis Presley’ spans too long, Upko was formerly involved in the government of creating ambiguity in linking the birthday to the cor- Nigeria but fails to retrieve the debt-to-GDP ratio as rect person. This type of retrieval failure highlights part of the context. This retrieval failure can arise a core limitation: RAG relies heavily on retriev- due to two possible reasons: the retriever might fail ing continuous text spans, and any fragmentation to locate the relevant sentences from documents, or overly long context can lead to an incomplete or the sentences may be split across two chunks, understanding. In contrast, LC tends to provide with the debt-to-GDP ratio lacking a clear subject. more holistic answers when processing longer con- Interestingly, when provided with the same prompt, texts directly, as it bypasses the dependency on a LC rarely reports a lack of context, suggesting its retrieval module. robustness in handling such cases. Wrong answers by LC are often caused by ques- Another error made by RAG is misinterpreting tion misinterpretation. For instance, as shown in partial context. In the second example, where RAG Table 8, when asked whether the tweets come from incorrectly answered the birthday, the model re- a specific region, LC answers ‘yes’, referencing trieved May 8, 1940, instead of the correct date, that the tweets originate from 16 countries. It fails January 8, 1935. This occurred because the sen- to interpret the relationship between ‘a specific region’ and ‘16 different countries’. In another with this principle in mind. The construction of example, when asked ‘where’ Valancourt lost his long-context datasets can generally be categorized wealth, the model identifies the correct sentence into two types: but answers ‘how’ instead of ‘where’. These exam- Realistic Long Texts: These datasets originate ples highlight that LC sometimes struggles to align from sources such as novels, research papers, or its semantic understanding with the required level other lengthy narratives, exemplified by datasets of specificity or perspective, resulting in answers like NovelQA. Such datasets typically pose chal- that are related but not addressing the question’s in- lenges that involve reading comprehension and re- tent. In both cases, the LLMs are able to locate the quire models to process and synthesize dense infor- related texts from the documents, but the reasoning mation spread across a cohesive, extended text. ability might be affected by the noise. Synthetic Long Texts: These datasets are often created by concatenating smaller, query-relevant 6 Discussion segments of text, such as Wikipedia-sourced 6.1 What is Long Context? datasets in LongBench. This construction process may involve stitching together Wikipedia excerpts, Although we have reviewed 9 studies that either injecting noise, or combining unrelated passages to directly or implicitly compare or integrate RAG simulate a long document. and Long Context, very few studies clearly define A critical observation is that realistic long con- what Long Context is. To this end, we separately texts align more closely with reading comprehen- interpret the two words ‘long’ and ‘context’. sion tasks, where models primarily absorb and rea- Long. Out of the 9 studies reviewed earlier, only son over information. Such datasets have high con- 2 studies, ChatQA2 and LongBench v2 explicitly textual relevance, since the questions are normally define Long Context as greater than 32k and greater based on the documents that users provided. In con- than 8k tokens respectively. For other studies, we trast, synthetic long contexts often resemble factual can only infer their definitions of “long” based on reasoning tasks, where models retrieve and verify the models and datasets they use. It seems that knowledge. Such datasets inherently incorporate three studies consider 8k as a minimum require- a pre-processing step like a RAG pipeline. They ment for long context, and another three studies set can assess the impact of information placement on this requirement at 16k. Lastly, OP-RAG regards model performance, such as the lost-in-the-middle 128k as long context. phenomenon. In short, each work defines ‘Long Context’ based On the other hand, realistic and synthetic long on its own criteria due to the lack of a clear stan- texts can only serve as proxies to reflect context dard. Moreover, as the context windows of lan- relevance to some extent. The scope of the context guage models continue to expand, the terms ‘long’ is question-dependent and difficult to define clearly. and ‘short’ are relative. For example, 4k tokens 6.2 How to Compare or Combine LC & are not considered ‘long context’ in any of the re- RAG? viewed studies but are extremely long for BERT- base models, which support only 512 tokens. As a The lack of a clear definition for long context also result, the definition of ‘long’ remains ambiguous, indicates the absence of a coherent framework for leading to inconsistent use of this concept among comparing or combining LC and RAG. We pro- researchers. In practice, the definition of ‘long’ is pose such a framework by examining three key per- complicated, depending on the context length of spectives: context length, context relevance, and latest LLMs, and the length of the documents in experiment design. targeted domain. Context Length. From the model’s perspective, Context In the English dictionary, ‘context’ is context length refers to the maximum number of defined as “the situation within which something tokens a model can process. From the dataset’s per- happens, and that can help explain it”. By this spective, it denotes the amount of text provided definition, the context of a question is expected to with a question. In synthetic datasets, context “help explain it”, implying that the context should length is flexible, but this introduces a trade-off have strong relevance to the question. However, between length and relevance. Adding irrelevant long-context datasets are not always constructed information as context may help to test a model’s robustness to noise, but such testing may not rep- to see whether chunking or filtering more relevant resent real-world use cases. Therefore, any frame- content through retrieval can outperform or com- work for comparing LC and RAG should clearly plement a fully integrated long-context approach define what is considered ‘long’, while indicating by truncating exceptionally long documents. whether this length criterion originates from the In the first setting, the retrieval pipeline naturally model’s capabilities, the dataset’s design, or both. reduces the number of tokens. In the second set- ting, the context length remains the same for both Context Relevance. An evaluation framework methods, with the only difference being how the must also address the relevance of the text pro- text is processed. vided as input to the model. It is crucial to dis- RAG over Increasing Context: Another possi- tinguish between realistic long contexts and syn- ble goal is understanding how RAG performance thetic long contexts. When benchmarks include changes with increasing context lengths. In this both types, separate evaluations are necessary, as scenario, the “LC” refers specifically to how many synthetic contexts often have low relevance and tokens a model can handle. This line of work can may not accurately reflect real-world scenarios. reveal how well RAG pipelines scale when models Interestingly, the construction of synthetic long absorb increasingly larger inputs. contexts often mirrors RAG pipelines. Providing an On the other hand, findings from evaluations entire curated text to an LLM as context essentially often serve as guidelines for settings that address represents a ‘long context RAG’ approach, given real-world problems. In this sense, RAG and LC that such text is assembled during dataset creation. may complement each other in real-world settings, Further chunking can introduce biases against RAG depending on the characteristics of the data source by disrupting the continuity of information within and the types of questions to be answered. each piece. Additionally, many benchmarks categorize tasks 6.3 Revisiting All Studies as ‘single-doc’ or ‘multi-doc’ based on whether the Based on the earlier discussion, the exploration of text originates from a single source or multiple doc- LC and RAG methods in LLMs highlights some uments. While convenient, this categorization does critical challenges that researchers often overlook. not perfectly align with ‘realistic’ or ‘synthetic’ contexts. A single document may sometimes be Trade-off between Context Length and Rele- artificially composed of smaller fragments, while a vance. Many studies hesitate between using flex- multi-sourced document might involve highly rel- ible synthetic context with noisy concatenated con- evant sources, such as a group of research papers texts, or realistic context with dense information discussing the same problem. but less availability. Among the 9 studies, 6 se- The key issue remains determining to what ex- lect synthetic context as part of the datasets. Our tent the context provided as input to LLMs contains own evaluation has also selected synthetic context sufficient and relevant content to answer the ques- datasets, but we consider the influence of synthetic tion, without introducing unnecessary or unrelated long context and separately evaluate their results information. by context source; e.g. a Wikipedia source with manually added noises represents low context rele- Experiment Settings. When investigating LC vance. and RAG, the experimental objectives can be Several studies have attempted to address this broadly grouped into two categories: comparison challenge. LongBench recently updated v2 which and combination. collects only realistic data. Despite a smaller scale, Short RAG v.s. Long Single Input: one might LongBench v2 shows substantial improvement in compare a short-context RAG pipeline against a context relevance compared to its first version. Lon- long-context single-input setup, analyzing both per- gRAG retrieves from a massive corpus for all ques- formance and computational cost. This provides tions, instead of assigning one context to each ques- insights into the trade-off between running an extra tion. This method avoids retrieving from a syn- retrieval pipeline for shorter contexts versus allow- thetic long context and is hence recommendable. ing the model to process a larger uninterrupted text. Long RAG v.s. Long Single Input: One may also Diversity in Retrieval Mechanisms. In the com- compare a long-context RAG pipeline with a long- parison of RAG and LC, RAG is often under- context single-input approach. Here, the goal is represented due to an over-reliance on traditional retrieval strategies. Among the 9 studies, 5 ex- the dataset to provide a statistically significant ba- periment with different retrievers, only 2 try dif- sis for analysis. The results indicate that LC gen- ferent chunking sizes, and none consider any erally outperforms RAG for tasks involving well- retrieval method beyond chunk-based retrievers. structured, dense contexts—such as Wikipedia ar- Although we experiment with index-based and ticles and books—and is better at answering ques- summarization-based retrievers, we cannot promise tions requiring specific information. By contrast, that our selected method outperforms all retrieval RAG demonstrates advantages in handling frag- strategies. mented information, particularly in dialogue-based For investigating RAG performance over increas- scenarios and for more general questions. ing context, some studies propose their own strate- Beyond merely presenting the experimental re- gies for chunking and placing RAG. OP-RAG pro- sults and findings, we delve deeper into the concept poses preserving the original order of chunks from of long context and examine how LC and RAG the context, while LC LLM-RAG proposes plac- should be compared. Our discussion aims to en- ing higher-scored chunks at the front and back. In sure that the insights gained are more impactful addition to more advanced retrievers, certain in- and applicable to real-world scenarios. formation retrieval (IR) (Manning et al., 2008) techniques like relevance feedback (Harman, 1992) Limitations or query expansion (Carpineto and Romano, 2012) While our study provides valuable insights into might further enhance RAG performance, yet these the comparative strengths and weaknesses of Long have been overlooked in existing frameworks. Context (LC) and Retrieval-Augmented Generation Computational Cost. Most existing studies test (RAG) approaches, it is important to acknowledge on 6 to 8 datasets, and it becomes increasingly three limitations that may impact the generalizabil- expensive to conduct experiments on too many ity and comprehensiveness of the findings: models. This is especially the case when new long- Our analysis is limited to text-based long con- context LLMs are being released at a very fast pace. texts, and neglecting other modalities such as audio, Hence, any work might be questioned because the video, or multi-modal contexts. The applicability experiment results are only applicable to one or of these insights to non-textual long-context sce- a few models. Among all works, LC RAG Per- narios remains unexplored, which may limit the formance includes the largest number of models broader applicability of the findings to multi-modal (20). While their efforts are remarkable, they only applications. experiment on 3 datasets. FinanceBench (Islam Our work focuses on existing papers that com- et al., 2023) looks at finance domain, Databricks pare and combine RAG with long-context LLMs. DocsQA is based on Databricks platform, and NQ Therefore, we mainly survey the retrievers and as shown table 2 as a very low rate of requiring ex- LLMs used in those papers, rather than all available ternal knowledge. This is not meant as criticism but retrievers and long-context LLMs. rather to show the trade-off between testing many Our experiments rely on existing LC and RAG models and having a comprehensive benchmark. implementations, including specific retrieval meth- ods and strong long-context models. As the field 7 Conclusion continues to evolve, newer models or retrieval strategies may alter the comparative outcomes. In this paper, we survey existing studies compar- However, our evaluation framework is still applica- ing or combining LC and RAG, analyzing why ble to future evaluation. different implementations may result in some con- flicts among their insights. Therefore, we present a Ethical Considerations thorough comparison of LC and RAG approaches by leveraging a diverse set of long context QA Advanced Long Context LLMs equipped with datasets. We filtered out questions that could be an- strong RAG capabilities could be misused to gen- swered from parametric knowledge, ensuring a fair erate misleading or harmful content, such as fake comparison by focusing on questions that required news or propaganda. Their long-context capability external context. Along these lines, we have devel- could amplify the scale and believability of such oped a systematic filtering and evaluation process, content. Researchers should prioritize safety and identified the best retrieval method, and expanded transparency in model usage to mitigate the risk. References large language models via positional interpolation. CoRR, abs/2306.15595. Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Xipeng Qiu. 2024. L-eval: Instituting standardized Noah A. Smith, and Matt Gardner. 2021. A dataset evaluation for long context language models. In Pro- of information-seeking questions and answers an- ceedings of the 62nd Annual Meeting of the Associa- chored in research papers. In Proceedings of the tion for Computational Linguistics (Volume 1: Long 2021 Conference of the North American Chapter of Papers), pages 14388–14411, Bangkok, Thailand. the Association for Computational Linguistics: Hu- Association for Computational Linguistics. man Language Technologies, pages 4599–4610, On- line. Association for Computational Linguistics. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-rag: Learning to DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingx- retrieve, generate, and critique through self-reflection. uan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, In The Twelfth International Conference on Learning Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Representations, ICLR 2024, Vienna, Austria, May Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli 7-11, 2024. OpenReview.net. Luo, Guangbo Hao, Guanting Chen, and Guowei Li Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, et al. 2024. Deepseek-v2: A strong, economical, and Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao efficient mixture-of-experts language model. CoRR, Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, abs/2405.04434. and Juanzi Li. 2024a. LongBench: A bilingual, mul- Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, titask benchmark for long context understanding. In Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Proceedings of the 62nd Annual Meeting of the As- Akhil Mathur, Alan Schelten, Amy Yang, and An- sociation for Computational Linguistics (Volume 1: gela Fan et al. 2024. The llama 3 herd of models. Long Papers), pages 3119–3137, Bangkok, Thailand. CoRR, abs/2407.21783. Association for Computational Linguistics. Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xi- Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, aozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Lei Deng, and Wei Han. 2024. Extending context Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024b. window of large language models via semantic com- Longbench v2: Towards deeper understanding and pression. In Findings of the Association for Compu- reasoning on realistic long-context multitasks. CoRR, tational Linguistics, ACL 2024, Bangkok, Thailand abs/2412.15204. and virtual meeting, August 11-16, 2024, pages 5169– 5181. Association for Computational Linguistics. Victoria Basmova, Yoav Goldberg, and Reut Tsarfaty. 2024. Llms’ reading comprehension is affected by Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, parametric knowledge and struggles with hypotheti- Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, cal statements. CoRR, abs/2404.06283. Meng Wang, and Haofen Wang. 2023. Retrieval- augmented generation for large language models: A Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie survey. CoRR, abs/2312.10997. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Shweta Gupta, Sunita Yadav, and Rajesh Prasad. 2018. Askell, Sandhini Agarwal, Ariel Herbert-Voss, Document retrieval using efficient indexing tech- Gretchen Krueger, Tom Henighan, Rewon Child, niques: A review. Information Retrieval and Man- Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, agement: Concepts, Methodologies, Tools, and Ap- Clemens Winter, and Christopher Hesse et al. 2020. plications, pages 1745–1764. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: An- Donna Harman. 1992. Relevance feedback revisited. In nual Conference on Neural Information Processing Proceedings of the 15th Annual International ACM Systems 2020, NeurIPS 2020, December 6-12, 2020, SIGIR Conference on Research and Development in virtual. Information Retrieval. Copenhagen, Denmark, June 21-24, 1992, pages 1–10. ACM. Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, and Akiko Aizawa. 2020. Constructing a multi- Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe hop QA dataset for comprehensive evaluation of Gu, and Tao Gui et al. 2024. Internlm2 technical reasoning steps. In Proceedings of the 28th Inter- report. CoRR, abs/2403.17297. national Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online). Inter- Claudio Carpineto and Giovanni Romano. 2012. A national Committee on Computational Linguistics. survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1–1:50. Pranab Islam, Anand Kannappan, Douwe Kiela, Re- becca Qian, Nino Scherrer, and Bertie Vidgen. 2023. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Financebench: A new benchmark for financial ques- Yuandong Tian. 2023. Extending context window of tion answering. CoRR, abs/2311.11944. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- 2024 Conference on Empirical Methods in Natural bastian Riedel, Piotr Bojanowski, Armand Joulin, Language Processing: EMNLP 2024 - Industry Track, and Edouard Grave. 2022. Unsupervised dense in- Miami, Florida, USA, November 12-16, 2024, pages formation retrieval with contrastive learning. Trans. 881–893. Association for Computational Linguistics. Mach. Learn. Res., 2022. Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Chen. 2023. How to train your dragon: Diverse aug- Madotto, and Pascale Fung. 2023. Survey of halluci- mentation towards generalizable dense retrieval. In nation in natural language generation. ACM Comput. Findings of the Association for Computational Lin- Surv., 55(12):248:1–248:38. guistics: EMNLP 2023, pages 6385–6400, Singapore. Association for Computational Linguistics. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- Jerry Liu. 2022. LlamaIndex. CoRR. ford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Christopher D. Manning, Prabhakar Raghavan, and Hin- Lengyel, Guillaume Bour, Guillaume Lample, rich Schütze. 2008. Introduction to information re- Lélio Renard Lavaud, Lucile Saulnier, Marie- trieval. Cambridge University Press. Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Théophile Gervet, Thibaut Lavril, Thomas Wang, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Timothée Lacroix, and William El Sayed. 2024a. Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Mixtral of experts. CoRR, abs/2401.04088. and Pouya Tafti et al. 2024. Gemma: Open models based on gemini research and technology. CoRR, Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, abs/2403.08295. Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, augmented generation. In Proceedings of the 2023 Congying Xia, Chen Xing, Jesse Vig, Semih Conference on Empirical Methods in Natural Lan- Yavuz, Philippe Laban, Ben Krause, Senthil Purush- guage Processing, pages 7969–7992, Singapore. As- walkam, Tong Niu, Wojciech Kryscinski, Lidiya sociation for Computational Linguistics. Murakhovs’ka, Prafulla Kumar Choubey, Alex Fab- bri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Ziyan Jiang, Xueguang Ma, and Wenhu Chen. 2024b. Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Longrag: Enhancing retrieval-augmented generation Shafiq Rayhan Joty, and Caiming Xiong. 2023. Xgen- with long-context llms. CoRR, abs/2406.15319. 7b technical report. CoRR, abs/2309.03450. Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan Ö. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Arik. 2024. Long-context llms meet RAG: over- Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- coming challenges for long inputs in RAG. CoRR, man, Diogo Almeida, Janko Altenschmidt, Sam Alt- abs/2410.05983. man, Shyamal Anadkat, Red Avila, Igor Babuschkin, Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim- Dyer, Karl Moritz Hermann, Gábor Melis, and Ed- ing Bao, Mohammad Bavarian, and Jeff Belgum ward Grefenstette. 2018. The NarrativeQA reading et al. 2023. GPT-4 technical report. CoRR, comprehension challenge. Transactions of the Asso- abs/2303.08774. ciation for Computational Linguistics, 6:317–328. Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Nikita Nangia, Jason Phang, Angelica Chen, Vishakh field, Michael Collins, Ankur Parikh, Chris Alberti, Padmakumar, Johnny Ma, Jana Thompson, He He, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- and Samuel Bowman. 2022. QuALITY: Question ton Lee, Kristina Toutanova, Llion Jones, Matthew answering with long input texts, yes! In Proceedings Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob of the 2022 Conference of the North American Chap- Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- ter of the Association for Computational Linguistics: ral questions: A benchmark for question answering Human Language Technologies, pages 5336–5358, research. Transactions of the Association for Compu- Seattle, United States. Association for Computational tational Linguistics, 7:452–4

Long Context vs. RAG for LLMs Evaluation PDF

Document Details

Tags

Related

Summary

Full Transcript