PAPER_TITLE

Xu, Dannong; Yang, Zhongyu; Chen, Jun; Yuan, Yingfang; Hu, Ming; Sun, Lei; Van Gool, Luc; Paudel, Danda Pani; Feng, Chun-Mei

MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Dannong Xu^1,*, Zhongyu Yang^2,*, Jun Chen³, Yingfang Yuan⁴, Ming Hu⁵, Lei Sun¹, Luc Van Gool¹, Danda Pani Paudel¹, Chun-Mei Feng^6,†

¹INSAIT, Sofia University “St. Kliment Ohridski” ²Lanzhou University ³King Abdullah University of Science and Technology
⁴Heriot-Watt University ⁵Monash University ⁶University College Dublin
^*Equal Contribution ^†Corresponding Author
[email protected] [email protected]

Code

Benchmark arXiv

Comparison with existing visual question answering benchmarks. Existing benchmarks often suffer from three key limitations: (i) ambiguous evidence that leads to multiple possible answers, (ii) retrieval restricted to a single modality, and (iii) small candidate pools (often limited to a single image, document, or video). In contrast, MultiHaystack provides questions grounded in uniquely verifiable evidence over a large-scale multimodal corpus of 46K+ items spanning documents, images, and videos, requiring both modality selection and fine-grained reasoning.

Abstract

Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.

Examples of six tasks in MultiHaystack: Visual Parsing & Positioning (spatial layouts), Contextual Understanding (embedded text), Video Temporal Reasoning (motion/order), Statistical Reasoning (charts/tables), Metadata Identification (affiliations/timestamps), and Factual Knowledge Retrieval (corpus-grounded facts).

Benchmark construction pipeline. MultiHaystack is built in four stages: collecting diverse multimodal sources, generating specific QA pairs, filtering for unique and grounded answers, and enriching with data. This design ensures coverage across six tasks (Fig. 3) and overcomes the unimodal, small-scale, or ambiguous limitations of prior benchmarks.

Error cases illustrating retrieval errors, including modality bias (retrieving images instead of video evidence) and semantic drift (violating temporal constraints), and reasoning errors, including visual numeracy (misreading numbers in charts) and layout-aware multi-step reasoning (failing to integrate structured cues across layouts).

Performance on MultiHaystack. "Gold in Top-1/5" directly provides answer-containing files; "Single-Modality" and "Cross-Modality" require retrieval within one or across multiple modalities.

Top-k ablation analysis for MLLMs integrated with E5-V.

Comparison in three distinct modalities. (a) represents the video modality, (b) represents the image modality, and (c) represents the document modality.

Pool-size controlled comparison. Recall of MM-Embed under single-modality retrieval and mixed-modality retrieval with an identical total pool size. Performance remains substantially lower in the mixed-modality condition, indicating that the cross-modal gap arises from modality heterogeneity rather than pool size.

Effect of data enrichment under varying candidate pool sizes, showing that recall consistently drops as the pool expands.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3