2407.01449v2.pdf

ColPali: Efficient Document Retrieval with Vision Language Models Manuel Faysse* 1,3 Hugues Sibille∗1,4 Tony Wu∗1 Bilel Omrani1 Gautier...

ColPali: Efficient Document Retrieval with Vision Language Models Manuel Faysse* 1,3 Hugues Sibille∗1,4 Tony Wu∗1 Bilel Omrani1 Gautier Viaud1 Céline Hudelot3 Pierre Colombo2,3 1 Illuin Technology 2 Equall.ai 3 CentraleSupélec, Paris-Saclay 4 ETH Zürich [email protected] Abstract Documents are visually rich structures that con- vey information through text, as well as tables, figures, page layouts, or fonts. While mod- arXiv:2407.01449v2 [cs.IR] 2 Jul 2024 ern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hin- dering their performance on practical document retrieval applications such as Retrieval Aug- mented Generation. To benchmark current sys- tems on visually rich document retrieval, we in- troduce the Visual Document Retrieval Bench- mark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent short- Figure 1: For each term in a user query, ColPali iden- comings of modern systems motivate the in- tifies the most relevant document image patches (high- troduction of a new retrieval model architec- lighted zones) and computes a query-to-page matching ture, ColPali, which leverages the document score. We can then swiftly retrieve the most relevant understanding capabilities of recent Vision Lan- documents from a large pre-indexed corpus. guage Models to produce high-quality contextu- alized embeddings solely from images of doc- ument pages. Combined with a late interac- index a standard PDF document, many steps are tion matching mechanism, ColPali largely out- required. First, PDF parsers or Optical Charac- performs modern document retrieval pipelines while being drastically faster and end-to-end ter Recognition (OCR) systems are used to extract trainable. We release all project artifacts at words from the pages. Document layout detec- https://huggingface.co/vidore. tion models can then be run to segment paragraphs, titles, and other page objects such as tables, fig- 1 Introduction ures, and headers. A chunking strategy is then defined to group text passages with some seman- Document Retrieval consists in matching a user tical coherence, and modern retrieval setups may query to relevant documents in a given corpus. It even integrate a captioning step to describe visu- is central to many industrial applications, either as ally rich elements in a natural language form, more a standalone ranking system (search engines) or suitable for embedding models. In our experiments as part of more complex information extraction or (Table 2), we typically find that optimizing the in- Retrieval Augmented Generation (RAG) pipelines. gestion pipeline yields much greater performance Over recent years, pretrained language models have on visually rich document retrieval than optimizing enabled large improvements in text embedding the text embedding model. models. In practical industrial settings, however, Contribution 1: ViDoRe. In this work, we ar- the main performance bottleneck for efficient doc- gue that document retrieval systems should not ument retrieval is not in embedding model perfor- be evaluated solely on the capabilities of text em- mance but in the prior data ingestion pipeline. To bedding models (Bajaj et al., 2016; Thakur et al., * Equal Contribution 2021; Muennighoff et al., 2022), but should also 1 Figure 2: ColPali simplifies document retrieval w.r.t. standard retrieval methods while achieving stronger perfor- mances with better latencies. Latencies and results are detailed in section 5 and subsection B.5. consider the context and visual elements of the doc- with respect to a query q. Computing the similarity uments to be retrieved. To this end, we create and score s(q, d) ∈ R+ for each of the |D| documents openly release ViDoRe, a comprehensive bench- in the corpus creates a ranking we can use to ex- mark to evaluate systems on page-level document tract the most relevant documents. In this work, retrieval with a wide coverage of domains, visual we focus on page-level retrieval: given a query, is elements, and languages. ViDoRe targets practical the correct document page retrieved by the system? document retrieval settings, in which user queries For coherence with existing literature, we further may require both textual and visual understanding use the term document to refer to individual pages, to be correctly matched to relevant documents. We i.e. the atomic retrieved elements in our setting. As highlight the shortcomings of current text-centric we focus on practical industrial retrieval applica- systems in these settings.1 tions (RAG, search engines) with potentially large Contribution 2: ColPali. We propose a novel corpora sizes, latency constraints are imposed on model architecture and training strategy based on scoring systems. Most current retrieval systems Vision Language Models (VLMs) to efficiently in- can be decomposed into (1) an offline indexation dex documents purely from their visual features, phase in which a document index is built and (2) an allowing for subsequent fast query matching with online querying phase in which a query is matched late interaction mechanisms (Khattab and Zaharia, to documents from the index and where low latency 2020). Our method, ColPali, outperforms all other is vital to the user experience. retrieval systems on ViDoRe while being fast and Efficient document retrieval systems exhibit end-to-end trainable. We release models and code joint properties of high retrieval performance at https://huggingface.co/vidore. (R1), low latency during querying (R2), and high throughput during indexation (R3). 2 Problem Formulation & Related Work 2.1 Textual Retrieval Methods Problem Setting. In our setting, a retrieval system scores how relevant a document d from corpus D is Document Retrieval in Text Space. Statistical 1 methods based on word frequency like TF-IDF The benchmark leaderboard is hosted pub- licly at https://huggingface.co/spaces/vidore/ (Sparck Jones, 1972) and BM25 (Robertson et al., vidore-leaderboard to encourage further developments. 1994) are still widely used due to their simplicity 2 and efficiency. More recently, neural embedding (Yao et al., 2021) framework extends the late inter- models based on fine-tuned large language models action mechanism to cross-modal vision-language display state-of-the-art performance on a variety of models, relying on max similarity operations be- text embedding tasks and top the retrieval leader- tween text tokens and image patches. boards (Muennighoff et al., 2022). Visually Rich Document Understanding. To Neural Retrievers. In bi-encoder models (Reimers go beyond text, some document-focused models and Gurevych, 2019; Karpukhin et al., 2020; Wang jointly encode text tokens alongside visual or docu- et al., 2022), documents are independently mapped ment layout features (Appalaraju et al., 2021; Kim offline to a dense vector space. Queries are em- et al., 2021; Huang et al., 2022; Tang et al., 2022). bedded online and matched to documents through Large Language transformer Models (LLMs) with a fast cosine distance computation. A slower, but strong reasoning capabilities have recently been slightly more performant alternative, cross-encoder combined with Vision Transformers (ViTs) (Doso- systems (Wang et al., 2020; Cohere, 2024) concate- vitskiy et al., 2020) to create VLMs (Alayrac et al., nate query and document as a single input sequence 2022; Liu et al., 2023b; Bai et al., 2023; Laurençon and iteratively attribute matching scores to each et al., 2024) where image patch vectors from con- possible combination. This enables full attention trastively trained ViT models (Zhai et al., 2023) are computation between query and document terms fed as input embeddings to the language model and but comes at the cost of computational efficiency, concatenated with the text-token embeddings. as |D| encoding passes must be done online. PaliGemma. The PaliGemma-3B model (Lu- Multi-Vector retrieval via late interaction. In cas Beyer* et al., 2024) extends concepts the late interaction paradigm (Khattab and Zaharia, from Pali3 (Chen et al., 2023), and projects 2020), an embedding is pre-computed and indexed SigLIP-So400m/14 (Alabdulmohsin et al., 2023) per document token. At runtime, similarity can be patch embeddings into Gemma-2B’s text vector computed with individual query token embeddings. space (Gemma Team et al., 2024). Along with its The idea is to benefit from the rich interaction be- reasonable size w.r.t. other performant VLMs, an tween individual query and document terms while interesting property of PaliGemma’s text model is taking advantage of the offline computation and that it is fine-tuned with full-block attention on the fast query matching enabled by bi-encoders. prefix (instruction text and image tokens). Retrieval Evaluation. Although benchmarks and VLMs display enhanced capabilities in Visual Ques- leaderboards have been developed to evaluate text tion Answering, captioning, and document under- embedding models (Thakur et al., 2021; Muen- standing (Yue et al., 2023), but are not optimized nighoff et al., 2022), as previously stated, much for retrieval tasks. of the performance improvements in industrial use cases of embedding models stem from the prior 3 The ViDoRe Benchmark data ingestion pipeline. While documents often rely on visual elements to more efficiently convey Existing benchmarks for contrastive vision- information to human readers, text-only systems language models primarily evaluate retrieval for barely tap into these visual cues. natural images (Lin et al., 2014; Borchmann et al., To our knowledge, no benchmark evaluates docu- 2021; Thapliyal et al., 2022). On the other hand, ment retrieval methods by considering both textual textual retrieval benchmarks (Muennighoff et al., and visual document features like a human would. 2022) are evaluated at the textual passage level and are not tailored for document retrieval tasks. We fill 2.2 Integrating Visual features the gap with ViDoRe, a comprehensive benchmark Contrastive Vision Language Models. Mapping for document retrieval using visual features. latent representations of textual content to corre- sponding representations of visual content has been 3.1 Benchmark Design done by aligning disjoint visual and text encoders ViDoRe is designed to comprehensively evaluate through contrastive losses (Radford et al., 2021; retrieval systems on their capacity to match queries Zhai et al., 2023). While some OCR capabilities to relevant documents at the page level. This bench- exist in these models, the visual component is often mark encompasses multiple orthogonal subtasks, not optimized for text understanding. The Fine- with focuses on various modalities - text, figures, grained Interactive Language-Image Pre-training infographics, tables; thematic domains - medical, 3 business, scientific, administrative; or languages - metrics from the retrieval literature (NDCG, Re- English (eng), French (fra). call@K, MRR). We report NDCG@5 values as the main performance metric in this work and release Dataset # Queries Domain the complete sets of results along with the models3. Academic Tasks To validate compliance with practical industrial DocVQA (eng) 500 (500) Industrial constraints, we also consider query latencies (R2) InfoVQA (eng) 500 (500) Infographics TAT-DQA (eng) 1600 (1600) Varied Modalities and indexing throughputs (R3). arXiVQA (eng) 500 (500) Scientific Figures TabFQuAD (fra) 210 (210) Tables 3.2 Assessing Current Systems Practical Tasks Energy (eng) 100 (1000) Scientific Unstructured. We evaluate retrieval systems rep- Government (eng) 100 (1000) Administrative resentative of those found in standard industrial Healthcare (eng) 100 (1000) Medical RAG pipelines. As is common practice, we rely on AI (eng) 100 (1000) Scientific Shift Project (fra) 100 (1000) Environment the Unstructured4 off-the-shelf tool in the high- est resolution settings to construct high-quality text Table 1: ViDoRe comprehensively evaluates multimodal chunks from PDF documents. Unstructured or- retrieval methods. The size of the document corpus is chestrates the document parsing pipeline, relying indicated in parentheses. on deep learning vision models to detect titles and document layouts (Ge et al., 2021), OCR engines Academic Tasks. We repurpose widely used visual (Smith, 2007) to extract text in non-native PDFs, question-answering benchmarks for retrieval tasks: specialized methods or models to detect and recon- for each page-question-answer triplet, we use the struct tables, and implements a chunking strategy question as the query, and the associated page as (by-title) that leverages the detected document the gold document (Table 1). These academic structure to preserve section boundaries when con- datasets either focus on single specific modalities catenating texts. As is common practice, in our (Mathew et al., 2020, 2021; Li et al., 2024) or simplest Unstructured configuration (text-only), target more varied visually rich documents (Zhu only textual elements are kept, and figures, images, et al., 2022). Moreover, we consider TabFQuAD, and tables are considered noisy information and a human-labeled dataset on tables extracted from are filtered out. French industrial PDF documents released with Unstructured + X. While Unstructured is a this work. Details can be found in subsection A.1. strong baseline by itself, we further augment Practical tasks. We construct topic-specific re- Unstructured’s output by integrating the visual trieval benchmarks spanning multiple domains to elements. In (+ OCR), tables, charts, and images go beyond repurposed QA datasets and evaluate are run through an OCR engine, processed by Un- retrieval in more realistic industrial situations (e.g. structured, and chunked independently. In (+ Cap- RAG). To achieve this, we collect publicly acces- tioning), we set up a fully-fledged captioning strat- sible PDF documents and generate queries per- egy (Zhao et al., 2023), in which we feed visual taining to document pages using Claude-3 Sonnet, elements to a strong proprietary Vision Language a high-quality proprietary vision-language model Model (Claude-3 Sonnet (Anthropic, 2024)) to ob- (Anthropic, 2024). In total, we collect 1,000 doc- tain highly detailed textual descriptions of the ele- ument pages per topic, which we associate with ments. Both strategies aim to integrate visual ele- 100 queries extensively filtered for quality and rele- ments in the retrieval pipeline but incur significant vance by human annotators. The corpus topics are latency and resource costs (subsection 5.2). intentionally specific to maximize syntactic prox- Embedding Model. To embed textual chunks, we imity between documents, creating challenging re- evaluate Okapi BM25, the de facto standard sparse trieval tasks and covering an array of orthogonal statistical retrieval method, and the dense encoder domains (Table 1). Query-page pair examples are of BGE-M3 (Chen et al., 2024), a multilingual shown in Appendix E.2 neural method with SOTA performance in its size Evaluation Metrics. We evaluate performance on category. Chunks are embedded and scored inde- our benchmark (Requirement R1) using standard pendently, and page-level scores are obtained by 2 Answers are generated alongside queries to (1) ground queries and improve their quality and (2) provide resources to 3 https://huggingface.co/vidore foster future work. 4 www.unstructured.io 4 max-pooling over the page’s chunk scores.5 and the promising performances on various doc- Contrastive VLMs. We also evaluate the strongest ument understanding benchmarks. We add a pro- available vision-language embedding models; Jina jection layer to map the output language model- CLIP (Koukounas et al., 2024), Nomic Embed Vi- ing embeddings to a vector space of reduced di- sion (Nomic, 2024), and SigLIP-So400m/14 (Al- mension D = 128 as used in the ColBERT paper abdulmohsin et al., 2023). (Khattab and Zaharia, 2020) to keep lightweight Results. From a performance perspective, best re- bag-of-embedding representations. sults are obtained by combining the Unstructured Late Interaction. Given query q and document parser with visual information, either from caption- d, we denote as Eq ∈ RNq ×D and Ed ∈ RNd ×D ing strategies or by running OCR on the visual ele- their respective multi-vector representation in the ments (Table 2). Little difference is seen between common embedding space RD. The late interaction BM25 and BGE-M3 embeddings highlighting the operator, LI (q, d), is the sum over all query vectors visual information bottleneck. Contrastive VLMs Ed (j) , of its maximum dot product ⟨·|·⟩ with each lag behind. Beyond retrieval performance (R1), the of the Nd document embedding vectors Ed(1:Nd ). indexing latencies (R2) reported in Figure 3 illus- trate that PDF parsing pipelines can be very lengthy, X especially when incorporating OCR or captioning LI (q, d) = max ⟨Eq (i) |Ed (j) ⟩ (1) j∈[|1,Nd |] strategies. Querying latencies at runtime (R3) are i∈[|1,Nq |] very good for all evaluated systems (≤ 22 ms on NVIDIA L4) due to fast query encoding and cosine Contrastive Loss. The Late Interaction opera- similarity matching. tion is fully differentiable, enabling backpropaga- tion. Let a batch {qk , dk }k∈[|1,b|] composed of b PDF Parser query-page pairs, where for all k ∈ [|1, b|], the (7.22s) document page dk is the document correspond- Siglip (0.12s) ing to query qk. Following Khattab and Zaharia ColPali (2020), we define our in-batch contrastive loss L as (0.39s) 0 1 2 3 4 5 6 7 the softmaxed cross-entropy of the positive scores Latency (s) Layout Detection OCR Captioning Page Encoding s+k = LI (dk , qk ) w.r.t. to the maximal negative scores s−k = max LI (qk , pl ). l,l̸=k Figure 3: Offline indexing with ColPali is much sim- pler and faster compared to standard retrieval methods. 4.2 Model training Indexing speeds reported are computed on Nvidia L4 Dataset. Our training dataset of 127,460 query- GPUs and detailed in subsection B.5. page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic 4 Late interaction based Vision Retrieval dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated 4.1 Architecture (Claude-3 Sonnet) pseudo-questions (37%). Our Vision-Language Models. Encouraged by their training set is fully English by design, enabling us strong document understanding capabilities, we to study zero-shot generalization to non-English propose adapting recent VLMs for retrieval. The languages6. We explicitly verify no multi-page key concept is to leverage the alignment between PDF document is used both ViDoRe and in the output embeddings of text and image tokens ac- train set to prevent evaluation contamination. A quired during multi-modal finetuning. To this ex- validation set is created with 2% of the samples to tent, we introduce ColPali, a Paligemma-3B exten- tune hyperparameters. sion that is capable of generating ColBERT-style Parameters. All models are trained for 1 epoch on multi-vector representations of text and images the train set. Unless specified otherwise, we train (Figure 2). PaliGemma-3B is a strong candidate models in bfloat16 format, use low-rank adapters due to its small size, the many released checkpoints (LoRA, Hu et al. (2021)) with α = 32 and r = 32 fine-tuned for different image resolutions and tasks, on the transformer layers from the language model, 5 6 We empirically validated the max-pooling strategy over Multilingual data is present in the pretraining corpus of sub-page chunks to be more effective than concatenating all the language model (Gemma-2B) and potentially occurs dur- page chunks before embedding pagewise. ing PaliGemma-3B’s multimodal training. 5 as well as the final randomly initialized projection ColPali: Adding Late Interaction. One benefit layer, and use a paged_adamw_8bit optimizer. We of inputting image patch embeddings through a train on an 8 GPU setup with data parallelism, a language model is that they are natively mapped learning rate of 5e − 5 with linear decay with 2.5% to a latent space similar to textual input (query). warmup steps, and a batch size of 32. This enables leveraging the ColBERT strategy to Query Augmentation. As in Khattab and Za- compute interactions between text tokens and im- haria (2020), we append 5 tokens to age patches, which enables a step-change improve- the query tokens to serve as a soft, differentiable ment in performance compared to BiPali. Re- query expansion or re-weighting mechanism. sults in Table 2 show that our ColPali model also largely outperforms the strong baselines based on 5 Results Unstructured and captioning, as well as all evalu- ated text-image embedding models. The difference 5.1 Performance (R1) is particularly stark on the more visually complex We iteratively construct ColPali, starting from an benchmark tasks, such as InfographicVQA, Arx- off-the-shelf SigLIP model (Table 2). ivQA, and TabFQuAD representing respectively BiSigLIP: Improving a strong model. SigLIP7 infographics, figures, and tables. However, text- is a strong vision-language bi-encoder model, pre- centric documents are also better retrieved by the trained on the English split of WebLI (Chen et al., ColPali models across all evaluated domains and 2023), a corpus of billions of image-text pairs. languages, making our approach the overall best- We find that SigLIP largely outperforms both Jina performing document-retrieval model. CLIP and Nomic-vision on document retrieval Negative Results. For extensiveness, we also tasks. Further fine-tuning the textual component train ColSigLIP, a late interaction variant of the of this model on our document-oriented dataset BiSigLIP model but obtain abysmal performances. (BiSigLIP) yields clear improvements across the We attribute this to the large gaps w.r.t. SigLIP’s board, particularly on figure retrieval (ArxivQA) pre-training, in which only a pooled latent repre- and table retrieval tasks (TabFQuAD). sentation is used in the contrastive loss, which BiPali: Pairing with a language model. In the does not optimize the representations of individ- PaliGemma model architecture, SigLIP-generated ual patch and token embeddings. Similarly, we patch embeddings are fed to a text language model train a BiSigLIPP aliGemma variant, in which we to obtain LLM contextualized output patch embed- retrieve the image representations from the SigLIP dings.8 We average pool these representations to model that has been further updated by PaliGemma obtain a single dense vector, effectively creating a fine-tuning, and use the text representations from PaliGemma bi-encoder model (BiPali). After fine- PaliGemma’s text model. After fine-tuning on tuning on the training dataset, we obtain a model our dataset, performance is severely inferior to that performs slightly worse in English than the SigLIPV anilla which simply encodes with SigLIP’s tuned BiSigLIP variant. This can be explained original text and vision components. This indicates by the fact that contrary to SigLIP, the original a logical misalignment between SigLIP embed- PaliGemma is not trained on contrastive matching dings, and Gemma embeddings after PaliGemma tasks, but rather on next token prediction. Our training. We detail these results in Table 5. contrastive fine-tuning phase on 100K images to transform PaliGemma into a bi-encoder is 5 orders 5.2 Latencies & Memory Footprint of magnitude smaller than SigLIP’s original con- Online Querying. (R2) Logically, querying la- trastive training. However, we see notable improve- tencies differ between ColPali and a BGE-M3 em- ments in French tasks, indicating that BiPali’s LLM bedding model. For BGE, encoding takes about (Gemma 2B) helps multilingual text understanding. 22 ms for 15 tokens, while encoding a query with This is particularly notable as our training dataset ColPali’s language model takes about 30 ms9. For does not contain non-English samples. smaller corpus sizes, computing the late interaction 7 https://huggingface.co/google/ operation induces marginally small overheads (≈ 1 siglip-so400m-patch14-384 ms per 1000 pages in the corpus), and the cosine 8 Note that the SigLIP model used in PaliGemma slightly similarity computation between bi-encoder vectors differs in terms of number patches - 1024 patches for 9 PaliGemma’s vision encoder, and 729 for the standalone Computed for a batch size of 1 (online), and averaged SigLIP model. over 1000 queries. See subsection B.5 6 ArxivQ DocQ InfoQ TabF TATQ Shift AI Energy Gov. Health. Avg. Unstructured Text only - BM25 - 34.1 - - 44.0 59.6 90.4 78.3 78.8 82.6 - - BGE-M3 - 28.4↓5.7 - - 36.1↓7.9 68.5↑8.9 88.4↓2.0 76.8↓1.5 77.7↓1.1 84.6↑2.0 - Unstructured + OCR - BM25 31.6 36.8 62.9 46.5 62.7 64.3 92.8 85.9 83.9 87.2 65.5 - BGE-M3 31.4↓0.2 25.7↓11.1 60.1↓2.8 70.8↑24.3 50.5↓12.2 73.2↑8.9 90.2↓2.6 83.6↓2.3 84.9↑1.0 91.1↑3.9 66.1↑0.6 Unstructured + Captioning - BM25 40.1 38.4 70.0 35.4 61.5 60.9 88.0 84.7 82.7 89.2 65.1 - BGE-M3 35.7↓4.4 32.9↓5.4 71.9↑1.9 69.1↑33.7 43.8↓17.7 73.1↑12.2 88.8↑0.8 83.3↓1.4 80.4↓2.3 91.3↑2.1 67.0↑1.9 Contrastive VLMs Jina-CLIP 25.4 11.9 35.5 20.2 3.3 3.8 15.2 19.7 21.4 20.8 17.7 Nomic-vision 17.1 10.7 30.1 16.3 2.7 1.1 12.9 10.9 11.4 15.7 12.9 SigLIP (Vanilla) 43.2 30.3 64.1 58.1 26.2 18.7 62.5 65.7 66.1 79.1 51.4 Ours SigLIP (Vanilla) 43.2 30.3 64.1 58.1 26.2 18.7 62.5 65.7 66.1 79.1 51.4 BiSigLIP (+fine-tuning) 58.5↑15.3 32.9↑2.6 70.5↑6.4 62.7↑4.6 30.5↑4.3 26.5↑7.8 74.3↑11.8 73.7↑8.0 74.2↑8.1 82.3↑3.2 58.6↑7.2 BiPali (+LLM) 56.5↓-2.0 30.0↓-2.9 67.4↓-3.1 76.9↑14.2 33.4↑2.9 43.7↑17.2 71.2↓-3.1 61.9↓-11.7 73.8↓-0.4 73.6↓-8.8 58.8↑0.2 ColPali (+Late Inter.) 79.1↑22.6 54.4↑24.5 81.8↑14.4 83.9↑7.0 65.8↑32.4 73.2↑29.5 96.2↑25.0 91.0↑29.1 92.7↑18.9 94.4↑20.8 81.3↑22.5 Table 2: Comprehensive evaluation of baseline models and our proposed method on ViDoRe. Results are presented using NDCG@5 metrics, and illustrate the impact of different components. Text-only metrics are not computed for benchmarks with only visual elements. is even faster. Optimized late interaction engines nisms as proposed in the Performance-optimized (Santhanam et al., 2022; Lee et al., 2023) enable to Late Interaction Driver (Santhanam et al., 2022). easily scale corpus sizes to millions of documents with reduced latency degradations. 5.3 Interpretability Offline Indexing. (R3) Standard retrieval methods By superimposing the late interaction heatmap on using bi-encoders represent each chunk as a single top of the original image, we can visualize the most vector embedding, which is easy to store and fast salient image patches with respect to each term to compute. However, processing a PDF to get of the query, yielding interpretable insights into the different chunks is the most time-consuming model focus zones. As epitomized in Figure 1, we part (layout detection, OCR, chunking), and us- observe ColPali exhibits strong OCR capabilities as ing captioning to handle multimodal data will only both the words "hourly" and "hours" present a high exacerbate this already lengthy process. On the similarity score with the query token. We other hand, ColPali directly encodes pages from also note particular focus on other non-trivial image their image representation. Although the encoder features such as the x-axis representing hours being model is larger than standard retrieval encoders, salient. Other visualization examples with similar skipping the preprocessing allows large speedups trends of the model transcending pure OCR are at indexing10 (Figure 3). shown in Appendix C. Memory Footprint. Our method requires stor- ing a vector per image patch. We project each 6 Ablation study PaliGemma vector to a lower dimensional space (D=128) to maximize efficiency, leading to a mem- Should we scale models or patch numbers ? ory footprint of 256 KB per page (subsection B.4). We train a variant of PaliGemma with half the num- Importantly, the memory footprint of the naive ber of image patches (512). While there is a clear ColBERT indexing strategy can be drastically im- performance degradation w.r.t. to the 1024-patch proved through compression and clustering mecha- ColPali model (Figure 4), memory usage is much lower.11 As an alternative to PaliGemma, we train 10 Measures a NVIDIA L4 GPU, averaged on 100 pages, 11 with a batch size of 4 pages for ColPali and 8 text chunks for While another PaliGemma variant exists with 2048 Bi-Encoders. On average, a page is divided into 2.1 chunks. patches, the different training datamix and the large memory See subsection B.5. requirements make this model impractical for both training 7 10 Relative NDCG@5 (%) Is the Pairwise CE loss best? 0 Training with an in-batch negative contrastive loss, 10 instead of the pairwise CE loss that only considers 20 the hardest negative sample, leads to a slight per- 30 formance degradation (−2.4%) on the aggregated 40 ColPali Idefics2 No Mem. Full IB Train TabF benchmark. (512) (64) Tokens Loss Vision Tuning Can the model adapt to new tasks? Figure 4: Relative NDCG@5 performance gain w.r.t. Contrary to more complex multi-step retrieval the default ColPali (1024 patches). TabFQuAD fine- pipelines, ColPali can be trained end-to-end, di- tuning measures the performance difference on the rectly optimizing the downstream retrieval task TabFQuAD task after the introduction of targeted data in the training set. All other results refer to performance which greatly facilitates fine-tuning to boost per- deltas averaged on all ViDoRe tasks. formance on specialized domains, multilingual re- trieval, or specific visual elements the model strug- gles with. To demonstrate, we add 1552 samples Idefics2-8B (Laurençon et al., 2024), a VLM with representing French tables and associated queries a similar architecture and based on a Mistral-7B to the training set. This represents the only French (Jiang et al., 2023) language backbone and a SigLIP data in the training set, with all other examples be- vision encoder paired with a perceiver resampler. ing kept unchanged. We see significant NDCG@5 The most notable differences with PaliGemma lie improvements (Figure 4) and even starker Re- in the size of the language model (2B and 7B resp.) call@1 gains (+6.63%) on the TabFQuAD bench- and the number of image patches (between 512 and mark, with no performance degradation on the rest 2048 for PaliGemma, and 64 post-resampling for of the benchmark tasks (+0.34%). Idefics212 ). Our results (Figure 4) suggest language model size has a strong impact on performance, and 7 Conclusions along with the trained resampler enables more effi- cient representations for smaller numbers of image Through the conception of a new benchmark Vi- embeddings - ColIdefics2 with 64 patches edges DoRe, we established the limits of both modern out ColPali with 512 patches. Scaling the number industrial document retrieval pipelines and off-the- of patches of the smaller ColPali model from 512 shelf image-text contrastive models for visually to 1024, enables largely surpassing the 60-patch rich document retrieval. We introduced ColPali, a ColIdefics2 while being about twice as fast in terms novel retrieval model that leverages the latest gen- of training and inference latency. These results sug- erative Vision Language models to create highly gest there are tradeoffs between performance (R1), performing multi-vector embeddings purely from latencies during online querying (R2) and offline visual document features. ColPali largely outper- indexation phases (R3), and index memory size. forms the best existing document retrieval meth- ods while enabling faster corpus indexing time and maintaining low querying latencies, suggesting a Should we fine-tune the vision component? very high potential for industrial document retrieval We run our contrastive finetuning on a ColPali applications. We hope to encourage future work by model in which we also train the vision encoder publicly releasing the ViDoRe benchmark and all and the projection layer. Results in Figure 4 show models and baselines from our study. this leads to no significant improvements. Future Work. Further performance gains could Do "query augmentation" tokens help? be obtained by exploring sub-image decomposi- In ColBERT, special tokens are concatenated to the tion (Liu et al., 2023a), optimal image patch re- input query to serve as soft query augmentation sampling strategies (Laurençon et al., 2024), or buffers. Training without these tokens, we observe hard-negative mining. Subsequently, our vision is no significant performance difference (Figure 4) in to combine visual retrieval and visually grounded the English benchmarks. However, performance query answering to create RAG systems that purely on the French tasks seems to improve (Table 5) function from visual features. An interesting line of research could be attempting to generate answers and inference time. 12 With the option of adding 4 sub-image crops of 64 tokens leveraging information stored in the indexed multi- each to the sequence, for a total of 320 tokens vector patch embeddings. 8 Limitations storage. Overall, a training run represents about 40 hours of Mi250x AMD GPUs. Our experi- Focus. In this work, we evaluate models on doc- ments, in total, represent 1405 Mi250x GPU hours ument retrieval tasks, covering several modalities from highly efficient compute clusters running on (figures, text, tables, infographics). We however low-carbon nuclear energy, representing a total of primarily focus on PDF-type documents, and eval- around 15kg CO2 eq. uating systems on image retrieval with documents Impact. We believe our work could have a strong stemming from web page screenshots or hand- impact on improving industrial document retrieval written documents might be an interesting general- systems. Our method is efficient, performs well, ization. We also focus on high-resource languages and the additional support towards visually rich in- (English and French) and although we have shown formation from documents could go a long way in the capacity of the ColPali model to generalize to unlocking knowledge sources previously difficult languages outside of its fine-tuning set, it is un- to index or query. clear how the model would perform on languages Resource Release. For transparency, and to foster that are not as represented in the model’s language future work, we release our comprehensive bench- backbone. Finally, our setup assumes relevant doc- mark under open license and host a public leader- uments exist, but abstention methods for Informa- board14. Our models are released under the same tion Retrieval systems might be interesting to ex- usage license as the base model (Gemma Research plore in more practical settings in which confidence license for ColPali, Apache2.0 for ColIdefics2) and estimation might be important (Gisserot-Boukhlef should be used as intended by the VLM license. et al., 2024). Support. This work relies on multi-vector retriev- Acknowledgements ing derived from the ColBERT late interaction mechanism. Although some vector databases sup- This work is partially supported by Illuin Tech- port late interaction engines13 , many widely used nology, and by a grant from ANRT France. vector retrieval frameworks do not propose native This work was performed using HPC resources multi-vector support, and some engineering infras- from the CINES ADASTRA through Grant 2024- tructure efforts may be required to adapt them to AD011015443. We extend our warm thanks to work with ColPali (or ColBERT) models. Jonathan Dong, Caio Corro, Victor Pellegrain and Data. In the creation of ViDoRe, we partially rely Ender Konukoglu for their valuable feedback on on synthetic query generation based on a commer- the paper. cial large language model, which may induce some amount of bias in the generated queries. To com- pensate for this, we have iterated on the prompt- References ing strategy and given real query examples to the Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander models to help ground generation in realistic set- Kolesnikov, and Lucas Beyer. 2023. Getting ViT tings. We have further manually verified all syn- in Shape: Scaling Laws for Compute-Optimal Model thetic queries through a lengthy process to validate Design. Publisher: arXiv Version Number: 5. their relevance and their quality. Our benchmark Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- also includes many benchmark tasks with no syn- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, thetic data, and result trends observed between all Arthur Mensch, Katie Millican, Malcolm Reynolds, tasks are correlated, further confirming the coher- Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne ence of our benchmark design. Monteiro, Jacob Menick, Sebastian Borgeaud, An- drew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Ethical Considerations Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Carbon Footprint. Our work fully leverages prior Flamingo: a Visual Language Model for Few-Shot pretrained models and training is not particularly Learning. Publisher: arXiv Version Number: 2. compute-intensive. Furthermore, we rely on low- Anthropic. 2024. The Claude 3 Model Family: Opus, rank adapters to further reduce the computational Sonnet, Haiku. resources needed, both during training and for 14 https://huggingface.co/spaces/vidore/ 13 Vespa Engine, RAGatouille, QDrant, colbert.ai vidore-leaderboard 9 Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Yusheng Xie, and R. Manmatha. 2021. DocFormer: Jian Sun. 2021. YOLOX: Exceeding YOLO Series End-to-End Transformer for Document Understand- in 2021. arXiv preprint. Version Number: 2. ing. arXiv preprint. Version Number: 2. Gemma Team, Thomas Mesnard, Cassidy Hardin, Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Laurent Sifre, Morgane Rivière, Mihir Sanjay and Jingren Zhou. 2023. Qwen-VL: A Versatile Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Vision-Language Model for Understanding, Local- Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam ization, Text Reading, and Beyond. Publisher: arXiv Roberts, Aditya Barua, Alex Botev, Alex Castro- Version Number: 3. Ros, Ambrose Slone, Amélie Héliou, Andrea Tac- chetti, Anna Bulanova, Antonia Paterson, Beth Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Tsai, Bobak Shahriari, Charline Le Lan, Christo- Jianfeng Gao, Xiaodong Liu, Rangan Majumder, An- pher A. Choquette-Choo, Clément Crepy, Daniel Cer, drew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Daphne Ippolito, David Reid, Elena Buchatskaya, Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Eric Ni, Eric Noland, Geng Yan, George Tucker, and Tong Wang. 2016. MS MARCO: A Human Gen- George-Christian Muraru, Grigory Rozhdestvenskiy, erated MAchine Reading COmprehension Dataset. Henryk Michalewski, Ian Tenney, Ivan Grishchenko, arXiv preprint. Version Number: 3. Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Bren- Burton H. Bloom. 1970. Space/time trade-offs in nan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin hash coding with allowable errors. Commun. ACM, Mao-Jones, Katherine Lee, Kathy Yu, Katie Milli- 13(7):422–426. Place: New York, NY, USA Pub- can, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, lisher: Association for Computing Machinery. Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Łukasz Borchmann, Michał Pietruszka, Tomasz Stanis- Bachem, Oscar Chang, Oscar Wahltinez, Paige Bai- lawek, Dawid Jurkiewicz, Michał Turski, Karolina ley, Paul Michel, Petko Yotov, Rahma Chaabouni, Szyndler, and Filip Graliński. 2021. DUE: End-to- Ramona Comanescu, Reena Jana, Rohan Anil, Ross End Document Understanding Benchmark. In Thirty- McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, fifth Conference on Neural Information Processing Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Systems Datasets and Benchmarks Track (Round 2). Shree Pandya, Siamak Shakeri, Soham De, Ted Kli- menko, Tom Hennigan, Vlad Feinberg, Wojciech Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Stokowiec, Yu-hui Chen, Zafarali Ahmed, Zhitao Defu Lian, and Zheng Liu. 2024. BGE M3- Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Embedding: Multi-Lingual, Multi-Functionality, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Multi-Granularity Text Embeddings Through Self- Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Knowledge Distillation. arXiv preprint. Version Douglas Eck, Joelle Barral, Fernando Pereira, Eli Number: 3. Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Open Models Based on Gemini Research and Tech- Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil nology. arXiv preprint. Version Number: 4. Mustafa, Sebastian Goodman, Ibrahim Alabdul- mohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Hippolyte Gisserot-Boukhlef, Manuel Faysse, Em- Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, manuel Malherbe, Céline Hudelot, and Pierre Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Colombo. 2024. Towards trustworthy reranking: A 2023. PaLI-3 Vision Language Models: Smaller, simple yet effective abstention mechanism. Preprint, Faster, Stronger. arXiv preprint. Version Number: 2. arXiv:2402.12997. Cohere. 2024. Introducing Rerank 3: A New Foun- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan dation Model for Efficient Enterprise Search & Re- Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and trieval. Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. Publisher: arXiv Version Timothée Darcet, Maxime Oquab, Julien Mairal, and Number: 2. Piotr Bojanowski. 2023. Vision Transformers Need Registers. Publisher: [object Object] Version Num- Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and ber: 2. Furu Wei. 2022. LayoutLMv3: Pre-training for Doc- ument AI with Unified Text and Image Masking. Pub- Alexey Dosovitskiy, Lucas Beyer, Alexander lisher: arXiv Version Number: 3. Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- Minderer, Georg Heigold, Sylvain Gelly, Jakob sch, Chris Bamford, Devendra Singh Chaplot, Diego Uszkoreit, and Neil Houlsby. 2020. An Image de las Casas, Florian Bressand, Gianna Lengyel, Guil- is Worth 16x16 Words: Transformers for Image laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Recognition at Scale. Publisher: arXiv Version Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Number: 2. Thibaut Lavril, Thomas Wang, Timothée Lacroix, 10 and William El Sayed. 2023. Mistral 7B. Publisher: Rong, Matthias Minderer, Ioana Bica, Ivana Balaze- arXiv Version Number: 1. vic, Joan Puigcerver, Julian Eisenschlos, Manoj Ku- mar, Matko Bošnjak, Matthias Bauer, Fangyu Liu, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Adam Grycner, Alexey Gritsenko, Paul Voigtlaender, Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Pinelopi Papalampidi, Olivier Henaff, Skanda Kop- Wen-tau Yih. 2020. Dense Passage Retrieval for pula, Xi Xiong, Radu Soricut, Model release contrib- Open-Domain Question Answering. arXiv preprint. utors and general support, Tris Warkentin, Kat Black, Version Number: 3. Luiz Gustavo Martins, Glenn Cameron, Raj Gund- luru, Manvinder Singh, Meg Risdal, Nilay Chauhan, Omar Khattab and Matei Zaharia. 2020. ColBERT: Nate Keating, Nesh Devanathan, Elisa Bandy, Joe Efficient and Effective Passage Search via Contextu- Fernandez, Antonia Paterson, Jenny Brennan, Tom alized Late Interaction over BERT. Eccles, Pankil Botadra, Ben Bariach, Lav Rai, Min- woo Park, Dustin Luong, Daniel Vlasic, Bo Wu, Wen- Geewook Kim, Teakgyu Hong, Moonbin Yim, ming Ye, Divyashree Sreepathihalli, Kiranbir Sodhia, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Won- Alek Andreev, Armand Joulin, Surya Bhupatiraju, seok Hwang, Sangdoo Yun, Dongyoon Han, and Minh Giang, Joelle Barral, and Zoubin Ghahramani. Seunghyun Park. 2021. OCR-free Document Un- 2024. PaliGemma. derstanding Transformer. arXiv preprint. Version Number: 5. Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimos- thenis Karatzas, Ernest Valveny, and C. V Jawahar. Andreas Koukounas, Georgios Mastrapas, Michael Gün- 2021. InfographicVQA. arXiv preprint. Version ther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Number: 2. Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maxi- Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawa- milian Werk, Nan Wang, and Han Xiao. 2024. Jina har. 2020. DocVQA: A Dataset for VQA on Docu- CLIP: Your CLIP Model Is Also Your Text Retriever. ment Images. arXiv preprint. Version Number: 1. Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Nils Reimers. 2022. MTEB: Massive Text Embed- Victor Sanh. 2024. What matters when build- ding Benchmark. arXiv preprint. Version Number: ing vision-language models? arXiv preprint. 3. ArXiv:2405.02246 [cs]. Nomic. 2024. Nomic Embed Vision: Expanding The Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Nomic Latent Space. Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vin- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya cent Y. Zhao. 2023. Rethinking the Role of Token Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- Retrieval in Multi-Vector Retrieval. arXiv preprint. try, Amanda Askell, Pamela Mishkin, Jack Clark, Version Number: 3. Gretchen Krueger, and Ilya Sutskever. 2021. Learn- ing Transferable Visual Models From Natural Lan- Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong guage Supervision. Publisher: arXiv Version Num- Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ber: 1. arxiv: A dataset for improving scientific compre- hension of large vision-language models. Preprint, Nils Reimers and Iryna Gurevych. 2019. Sentence- arXiv:2403.00231. BERT: Sentence Embeddings using Siamese BERT- Networks. arXiv preprint. Version Number: 1. Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Stephen E. Robertson, Steve Walker, Susan Jones, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dol- Micheline Hancock-Beaulieu, and Mike Gatford. lár. 2014. Microsoft COCO: Common Objects in 1994. Okapi at TREC-3. In Proceedings of The Third Context. arXiv preprint. Version Number: 3. Text REtrieval Conference, TREC 1994, Gaithers- burg, Maryland, USA, November 2-4, 1994, volume Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae 500-225 of NIST Special Publication, pages 109– Lee. 2023a. Improved Baselines with Visual Instruc- 126. National Institute of Standards and Technology tion Tuning. arXiv preprint. Version Number: 2. (NIST). Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Keshav Santhanam, Omar Khattab, Christopher Potts, Lee. 2023b. Visual Instruction Tuning. Publisher: and Matei Zaharia. 2022. PLAID: An Efficient En- arXiv Version Number: 1. gine for Late Interaction Retrieval. arXiv preprint. Version Number: 1. Lucas Beyer*, Andreas Steiner*, André Susano Pinto*, Alexander Kolesnikov*, Xiao Wang*, Xiaohua R. Smith. 2007. An Overview of the Tesseract OCR Zhai*, Daniel Salz, Maxim Neumann, Ibrahim Al- Engine. In Ninth International Conference on Doc- abdulmohsin, Michael Tschannen, Jeremiah Harm- ument Analysis and Recognition (ICDAR 2007) Vol sen, Daniel Keysers, Neil Houlsby, Xi Chen, 2, pages 629–633, Curitiba, Parana, Brazil. IEEE. Emanuele Bugliarello, Thomas Unterthiner, Keran ISSN: 1520-5363. 11 Karen Sparck Jones. 1972. A STATISTICAL INTER- A Benchmark Datasets PRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL. Journal of Docu- A.1 Academic Datasets mentation, 28(1):11–21. DocVQA (Mathew et al., 2020) includes collected Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, images from the UCSF Industry Documents Li- Yang Liu, Chenguang Zhu, Michael Zeng, Cha brary. Questions and answers were manually anno- Zhang, and Mohit Bansal. 2022. Unifying Vision, Text, and Layout for Universal Document Processing. tated. arXiv preprint. Version Number: 3. InfoVQA (Mathew et al., 2021) includes infograph- ics collected from the Internet using the search Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- hishek Srivastava, and Iryna Gurevych. 2021. BEIR: query “infographics”. Questions and answers were A Heterogenous Benchmark for Zero-shot Evalua- manually annotated. tion of Information Retrieval Models. arXiv preprint. TAT-DQA (Zhu et al., 2022) is a large-scale Docu- Version Number: 4. ment VQA dataset that was constructed from pub- Ashish V. Thapliyal, Jordi Pont-Tuset, Xi Chen, and licly available real-world financial reports. It fo- Radu Soricut. 2022. Crossmodal-3600: A Massively cuses on rich tabular and textual content requiring Multilingual Multimodal Evaluation Dataset. arXiv numerical reasoning. Questions and answers were preprint. Version Number: 2. manually annotated by human experts in finance. Liang Wang, Nan Yang, Xiaolong Huang, Binxing arXivQA (Li et al., 2024) is a VQA dataset based Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, on figures extracted from arXiv publications. The and Furu Wei. 2022. Text Embeddings by Weakly- Supervised Contrastive Pre-training. arXiv preprint. questions were generated synthetically using GPT- Version Number: 2. 4 Vision. TabFQuAD (Table French Question Answering Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self- Dataset) is designed to evaluate TableQA models Attention Distillation for Task-Agnostic Compres- in realistic industry settings. We create additional sion of Pre-Trained Transformers. arXiv preprint. queries to augment the existing human-annotated ArXiv:2002.10957 [cs]. ones using the same method described in subsec- Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, tion A.2. Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. FILIP: Fine- A.2 Practical Datasets grained Interactive Language-Image Pre-Training. arXiv preprint. Version Number: 1. Methodology. Creating a relevant retrieval dataset close to real use cases is a major challenge as the Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, dataset needs to be both sufficiently large for effec- Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu tive fine-tuning and sufficiently diverse to cover a Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan broad range of modalities (full text, tables, charts, Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang,...), domains (industry, healthcare,...), and query- Huan Sun, Yu Su, and Wenhu Chen. 2023. MMMU: document interactions (extractive questions, open- A Massive Multi-discipline Multimodal Understand- ended questions,...). Our approach to building this ing and Reasoning Benchmark for Expert AGI. arXiv preprint. Version Number: 3. dataset involves several steps: (1) we use a web crawler to collect publicly available documents on Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, various themes and sources, (2) we convert these and Lucas Beyer. 2023. Sigmoid Loss for Language Image Pre-Training. Publisher: [object Object] Ver- PDFs into a series of images, one per page, and (3) sion Number: 4. we generate queries related to each image using a VLM. Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Web-Crawler. We implemented a web crawler to Xiaobao Guo, Minzhi Li, Xingxuan Li, and Shafiq efficiently collect large volumes of documents re- Joty. 2023. Retrieving Multimodal Information for lated to a given topic. The crawler is seeded with Augmented Generation: A Survey. arXiv preprint. a user-defined query (e.g. "artificial intelligence") Version Number: 3. and then uses GPT-3.5 Turbo to brainstorm related Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, topics and subtopics. This query augmentation Haozhou Zhang, and Tat-Seng Chua. 2022. Towards strategy aims at both broadening and deepening the Complex Document Understanding By Discrete Rea- soning. Publisher: arXiv Version Number: 3. search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic. This 12 query set is then consumed by a pool of parallel You are an assistant specialized in workers whose job is to fetch the associated most ,→ Multimodal RAG tasks.\n relevant documents. We use SerpAPI15 along with The task is the following: given an image ,→ from a pdf page, you will have to a filetype filter (PDF documents only) to program- generate questions that can be asked by a matically scrape Google Search rankings. Each ,→ user to retrieve information from file is hashed and stored in a Bloom filter (Bloom, a large documentary corpus. The question should be relevant to the 1970) shared among workers to avoid duplicate doc- ,→ page, and should not be too specific uments in the final corpus. Unique scraped files are or too general. The question should be downloaded, and inserted into a SQLite database ,→ about the subject of the page, and the answer needs to be found in the page. along with additional metadata. ,→ \n Remember that the question is asked by a ,→ user to get some information from a large documentary corpus that contains ,→ multimodal data. Generate a question Datamix. Using the web crawler, we collected that could be asked by a user without ,→ knowing the existence and the content approximately 1,000 documents for each of the of the corpus. \n following four seeds: "energy", "government re- Generate as well the answer to the ports", "healthcare industry", and "artificial intel- ,→ question, which should be found in the page. And the format of the answer should ligence". These seeds were meticulously hand- ,→ be a list of words answering the picked to align with real-use cases for retrieval question. \n models and visually rich pages. We also removed Generate at most THREE pairs of questions all documents containing any private information. ,→ and answers per page in a At this stage, we randomly selected 900 files for dictionary with the following format, ,→ answer ONLY this dictionary the training set and 100 files for the test set, ensur- NOTHING ELSE: \n ing that data leakage into the test set was avoided during subsequent processing steps. { "questions": [ { "question": "XXXXXX", "answer": ["YYYYYY"] }, { Query Generation. To increase the efficiency of "question": "XXXXXX", our query generation scheme and to limit API calls, "answer": ["YYYYYY"] we generate at most 3 questions per image. From }, { all the documents collected, we randomly sample "question": "XXXXXX", 10,000 images per theme and call Claude-3 Sonnet "answer": ["YYYYYY"] with the following prompt: }, ] } where XXXXXX is the question and ,→ ['YYYYYY'] is the corresponding list ,→ of answers that could be as long as needed. \n Note: If there are no questions to ask ,→ about the page, return an empty list. Focus on making relevant questions ,→ concerning the page. \n Here is the page: \n Human Validation. We manually validate every single synthetically created query in ViDoRe to en- sure quality, query relevance, and consistency with the benchmark objective of evaluating retrieval in practical industrial settings. During this step, we 15 https://serpapi.com/ randomly assign document-pair queries to 4 vol- 13 unteer annotators and instruct them to filter out lions of documents. With this criterion in view, we queries that do not fit the above-listed criteria. We have compared the embedding sizes of the models also instruct annotators to flag any documents they in our study. As shown in Table 3, ColPali’s em- deem to contain PII information or content not bedding size is an order of magnitude larger than suited for an academic benchmark. No flag was BM25 and two orders of magnitude larger than raised during the entirety of the process, validating BGE-M3. However, this study is limited to the our prior PDF collection strategy. 100 queries per naive method of storing ColPali’s multi-vector em- topic are collected in this manner. Annotators are beddings. In practical scenarios, using cluster cen- colleagues and collaborators of the authors who troids can reduce the size of ColPali multi-vector volunteered to help. Each annotator spent approxi- embeddings by up to an order of magnitude (San- mately 3 hours filtering the larger query set down thanam et al., 2022) and make it a competitive to 100 high-quality queries per topic. retrieval system. B Implementation details Model Embedding size (KB) B.1 Codebase BGE-M3 8.60 The codebase is written in PyTorch16 and leverages BM25 (dense emb.) 3.00 HuggingFace tooling for model implementations BM25 (sparse emb.) 1.56 ± 0.51 and trainers17. ColPali (float16) 256 B.2 Pairwise CE loss Table 3: Comparison of the embedding sizes for the Our in-batch contrastive loss L is defined as the DocVQA test set from ViDoRe w.r.t. different retrieval models. The lower the size the smaller the storage softmaxed cross-entropy of the positive scores footprint of the model. The mean ± std size is given for s+ k = LI (dk , qk ) w.r.t. to the maximal negative the sparse embeddings. scores s− k = max LI (qk , pl ). l,l̸=k For numerical stability, we reformulate the loss with the softplus function, leading to: b B.5 Latency computations 1X softplus s− − s+ L= k k (2) b All latency computations are done on a NVIDIA k=1 L4 GPU. Queries are encoded independently (batch B.3 Hyperparameters size of 1) to simulate online querying, and pages Hyperparameters are tuned on a validation split are encoded with a batch size of 4 for PaliGemma composed of 2% of the training dataset. We find derived models, and 8 for BGE-M3. Reported bi-encoder methods to be more sensible to learning times include image and text processing time be- rate variations than late interaction-based models fore the model forward pass, as well as query-to- and achieve the best performance for all models index matching times. We note an interesting fea- with a learning rate of 5e − 5. We experiment with ture of ColPali is that all documents have the same LoRA rank and α values and do not notice partic- sequence length, leading to prior knowledge of run- ular improvements past r = α = 32. Per-device time and memory consumptions. Query latency batch sizes are kept small due to long sequence experiments are averaged over 1000 queries, and lengths that complicate scaling past b = 4. Simu- indexing times are measured for a 100 page docu- lating larger batch sizes for in-batch negative sam- ment. Per page time is obtained by diving total time pling should enable even better results. We find by 100, corresponding to inverse page throughput. the best results with global batch size b = 32 for 1 epoch on our training set. B.6 Captioning B.4 Embedding size Minimizing storage footprint can be essential to Examples of captions generated for visually rich industrial retrieval systems if databases contain mil- document chunks with Claude-3 Sonnet are shown 16 https://pytorch.org/ in Figure 6 and Figure 5. The prompt used for 17 https://huggingface.co generating the description is the following: 14 You are an assistant specialized in document analy- sis. Given a table or a figure, you have to provide a detailed summary of the content in maximum 3000 characters. Your summary should be qualitative and not quantitative. Here is the table/figure to analyze: {image}. Answer ONLY with the caption of the ta- ble/figure. Figure 6: Example from the "Government Reports" test set. Caption: The image shows a table titled "System of Record" which outlines the different types of documents or records maintained across various systems or depart- ments within an organization related to project man- agement and construction. The rows list documents like project plans, budgets, schedules, contracts, pur- Figure 5: Example from the "Energy" test set. chase orders, invoices, change requests, bid submissions, Caption: The image depicts the hourly energy gener- drawings, manuals, meeting minutes, and reports. The ation profile, illustrating the contributions of various columns indicate the system or department responsible energy sources over 24 hours. The data is presented for maintaining each record, such as County Servers, as a stacked bar chart, with the x-axis representing the Project View, OnBase, CGI Advantage Financial Sys- hours of the day from 1 to 2, and the y-axis showing tem, and Purchasing Department. The table uses "W" the average hourly generation in MW. The bars are seg- and "T" markers to denote which system or department mented into different colors, each representing a distinct serves as the primary source (writer) or storage loca- energy source: nuclear, bio, geothermal, solar, wind, hy- tion (trailer) for each type of document. dro, natural gas, and other imports. The chart provides insights into the temporal variations in energy genera- tion across different sources, highlighting the interplay C More similarity maps between baseload and intermittent sources throughout the day. In Figure 7, ColPali assigns a high similarity to all patches with the word "Kazakhstan" when given the token. Moreover, our model seems to exhibit world knowledge capabilities as the patch around the word "Kashagan" - an off- shore oil field in Kazakhstan - also shows a high similarity score. On the other hand, in Figure 8, we observe that ColPali is also capable of complex image understanding. Not only are the patches con- taining the word "formulations" highly similar to the query token _formula, but so is the upper-left molecule structure. It is also interesting to highlight that both similar- ity maps showcase a few white patches with high similarity scores. This behavior might first seem surprising as the white patches should not carry a meaningful signal from the original images. We believe the vectors associated with these patches share a similar role with the ViT registers (Darcet et al., 2023), i.e. these patches were repurposed for internal computations and stored the global infor- mation from the whole image. 15 Figure 7: Similarity of the image patches w.r.t. the underlined token in the user query. This example is from the Shift test set. Figure 8: Similarity of the image patches w.r.t. the underlined token in the user query. This example is from the Healthcare Industry test set. 16 D Additional results D.1 Other Metrics ArxivQ DocQ InfoQ TabF TATQ Shift AI Energy Gov. Health. Avg. Unstructured Text only BM25 - 26.6 - - 34.6 45.0 86.0 70.0 68.0 74.0 - BGE-M3 - 22.8↓3.8 - - 26.1↓8.5 51.0↑6.0 81.0↓5.0 72.0↑2.0 67.0↓1.0 77.0↑3.0 - Unstructured + OCR BM25 26.7 28.9 54.0 30.4 50.0 52.0 86.0 77.0 74.0 80.0 55.9 BGE-M3 28.1↑1.4 22.9↓6.0 53.8↓0.2 55.7↑25.3 38.6↓11.4 56.0↑4.0 82.0↓4.0 79.0↑2.0 76.0↑2.0 83.0↑3.0 57.5↑1.6 Unstructured + Captioning BM25 35.5 30.2 61.5 24.3 49.0 47.0 79.0 76.0 75.0 81.0 55.9 BGE-M3 29.3↓6.2 26.0↓4.2 62.1 ↑0.6 58.6↑34.3 30.6↓18.4 55.0↑8.0 80.0↑1.0 78.0↑2.0 69.0↓6.0 83.0↑2.0 57.2↑1.3 Contrastive VLMs Jina-CLIP 19.4 7.3 26.7 12.5 1.6 2.0 11.0 13.0 15.0 17.0 12.6 Nomic-vision 10.4 6.7 22.1 9.6 1.6 0.0 9.0 9.0 7.0 13.0 8.8 SigLIP (Vanilla) 34.2 21.3 51.8 46.1 17.9 13.0 50.0 51.0 47.0 65.0 39.7 Ours (Copied) SigLIP (Vanilla) 34.2 21.3 51.8 46.1 17.9 13.0 50.0 51.0 47.0 65.0 39.7 BiSigLIP (+fine-tuning) 49.2↑15.0 23.8↑2.5 59.0↑7.2 52.1↑6.0 20.7↑2.8 16.0↑3.0 62.0↑12.0 61.0↑10.0 55.0↑8.0 72.0↑7.0 47.1↑7.4 BiPali (+LLM) 46.4↓-2.8 20.0↓-3.8 54.6↓-4.4 63.2↑11.1 20.4↓-0.4 34.0↑18.0 59.0↓-3.0 45.0↓-16.0 57.0↑2.0 56.0↓-16.0 45.6↓-1.5 ColPali (+Late Inter.) 72.4↑26.0 45.6↑25.6 74.6↑20.0 75.4↑12.1 53.1↑32.7 55.0↑21.0 93.0↑34.0 85.0↑40.0 85.0↑28.0 88.0↑32.0 72.7↑27.1 Table 4: Comprehensive evaluation of baseline models and our proposed method on ViDoRe. Results are presented using Recall@1 metrics. Text-only metrics are not computed for benchmarks with only visual elements. D.2 Model Variants ArxivQ DocQ InfoQ TabF TATQ Shift AI Energy Gov. Health. Avg. ColSigLIP (PaliGemma) 3.1 3.0 5.1 6.2 2.5 1.0 3.4 3.4 2.3 2.2 3.2 BiSigLIP (PaliGemma) 18.5 14.6 33.4 39.5 16.1 5.2 27.6 32.6 36.6 35.7 26.0 ColSigLIP (Original) 2.6 2.2 2.3 5.7 1.8 1.0 2.6 4.1 1.4 1.5 2.5 ColPali (No Mem. Tokens) 80.4 53.2 82.4 77.4 65.7 63.4 97.0 89.9 93.6 92.4 79.6 ColPali (Best) 79.1 54.4 81.8 83.9 65.8 73.2 96.2 91.0 92.7 94.4 81.3 Table 5: Evaluation of some "negative results" and ablations on ViDoRe; ColPali for reference. Results are presented using NDCG@5 metrics. Text-only metrics are not computed for benchmarks with only visual elements. 17 E ViDoRe examples Energy Query: What types of accounts or Query: What is the projected peak Query: What is the estimated total sav- products allow investors to defer pay- electricity demand in California for the ings for a PV system in Durham under ing taxes? year 2030? the net metering (flat rate) billing op- tion over the system’s useful life of 25 years? Artificial Intelligence Query: What are some common out- Query: hat did the robot monitor to Query: What is the key approach used come areas targeted by TAII for differ- determine when to activate or deacti- in the PDP architecture? ent age groups? vate the blower motor and blinker? 18 Healthcare Industry Query: What is the chemical formula Query: What government entities are Query: What does the AVPU scale for the ferroelectric material Lead Zir- involved in public financing for health- stand for in assessing the level of con- conium Titanate (PZT)? care in the US? sciousness of a seriously ill child? Government Reports Query: What are some mandates for Query: What is the strategy of KPMG Query: What is the trust signal score the EPA under the Pollution Prevention Hazem Hassan? for the consumer industry best-in-class Act? archetype? 19 Shift Query: Selon le graphique, quelle Query: Quelle partie de la production Query: Quels sont les pays ayant la est la capacité d’import et la consom- pétrolière du Kazakhstan provient de plus grande part des découvertes cu- mation réelle de carburants SAF (bio- champs en mer ? mulées de pétrole brut en 2020 (en carburants durables pour l’aviation) milliers de barils, hors découvertes cu- prévues en 2050 ? mulées) ? 20

Document Details

Tags

Related

Full Transcript

Upgrade to continue