Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

by Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo

✨ RETINA: Relational Entity Text-Image kNowledge Augmented Benchmark

This repository provides the RETINA benchmark, a novel and large-scale dataset for Multimodal Knowledge-Based Visual Question Answering (MKB-VQA).

RETINA was introduced to overcome a critical limitation in existing MKB-VQA datasets: the "visual shortcut." Models could often succeed by simply matching the query image to the target document's primary subject entity.

🚀 Key Feature: Breaking the Shortcut

RETINA is explicitly designed to eliminate this bias, forcing models to rely on true relational knowledge.

The benchmark's construction process ensures that the query image is of a secondary, related entity mentioned in the document, rather than the main subject.

For instance, given a question "What animal mainly eats this fruit?" with an image of an 🍎Apple, the answer might be in the document about "🐻Bears" not "🍎Apples".

This setup reflects complex, real-world scenarios where knowledge retrieval must go beyond direct visual matching.

💾 Dataset Access

TODO-List

Release RETINA bench.
Clean up directory structure and paths.

The RETINA bench, including the large training set and the human-curated test set, is available for download and use on Hugging Face:

Access the RETINA Dataset

For EVQA and Infoseek, including the images and textual KB, please refer to Lin Weizhe et al.:

Access the M2KR Dataset

For the document images please refer to Lianghao Deng et al. and images.zip.

Statistics

Component	Size	Note
Training Set	120k samples	Automatically generated via an LLM-driven pipeline.
Test Set	2k samples	Human-curated.

⚖️ Acknowledgements

We build apon M2KR_Images, M2KR_passages and lhdeng-gh/MuKA. For image verification we use Imagehash toolkit. The RETINA dataset is intended for non-commercial research purposes. Users are solely responsible for any and all utilization of the dataset. The creators of this benchmark and their affiliated institutions shall not be held liable for any damages, consequences, or legal issues that may arise from its use.

@misc{lee2025breakingvisualshortcutsmultimodal,
      title={Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering}, 
      author={Dosung Lee and Sangwon Jung and Boyoung Kim and Minyoung Kim and Sungyeon Kim and Junyoung Sung and Paul Hongsuck Seo},
      year={2025},
      eprint={2511.22843},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.22843}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

✨ RETINA: Relational Entity Text-Image kNowledge Augmented Benchmark

🚀 Key Feature: Breaking the Shortcut

💾 Dataset Access

Statistics

⚖️ Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

✨ RETINA: Relational Entity Text-Image kNowledge Augmented Benchmark

🚀 Key Feature: Breaking the Shortcut

💾 Dataset Access

Statistics

⚖️ Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages