by Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo
This repository provides the RETINA benchmark, a novel and large-scale dataset for Multimodal Knowledge-Based Visual Question Answering (MKB-VQA).
RETINA was introduced to overcome a critical limitation in existing MKB-VQA datasets: the "visual shortcut." Models could often succeed by simply matching the query image to the target document's primary subject entity.
RETINA is explicitly designed to eliminate this bias, forcing models to rely on true relational knowledge.
The benchmark's construction process ensures that the query image is of a secondary, related entity mentioned in the document, rather than the main subject.
For instance, given a question "What animal mainly eats this fruit?" with an image of an 🍎Apple, the answer might be in the document about "🐻Bears" not "🍎Apples".
This setup reflects complex, real-world scenarios where knowledge retrieval must go beyond direct visual matching.
TODO-List
- Release RETINA bench.
- Clean up directory structure and paths.
The RETINA bench, including the large training set and the human-curated test set, is available for download and use on Hugging Face:
For EVQA and Infoseek, including the images and textual KB, please refer to Lin Weizhe et al.:
For the document images please refer to Lianghao Deng et al. and images.zip.
| Component | Size | Note |
|---|---|---|
| Training Set | 120k samples | Automatically generated via an LLM-driven pipeline. |
| Test Set | 2k samples | Human-curated. |
We build apon M2KR_Images, M2KR_passages and lhdeng-gh/MuKA. For image verification we use Imagehash toolkit. The RETINA dataset is intended for non-commercial research purposes. Users are solely responsible for any and all utilization of the dataset. The creators of this benchmark and their affiliated institutions shall not be held liable for any damages, consequences, or legal issues that may arise from its use.
@misc{lee2025breakingvisualshortcutsmultimodal,
title={Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering},
author={Dosung Lee and Sangwon Jung and Boyoung Kim and Minyoung Kim and Sungyeon Kim and Junyoung Sung and Paul Hongsuck Seo},
year={2025},
eprint={2511.22843},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.22843},
}
