Skip to content

leeds1219/RETINA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 

Repository files navigation



Korea University    MIIL    kaist   


Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

arXiv | Project

by Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, Paul Hongsuck Seo

✨ RETINA: Relational Entity Text-Image kNowledge Augmented Benchmark

This repository provides the RETINA benchmark, a novel and large-scale dataset for Multimodal Knowledge-Based Visual Question Answering (MKB-VQA).

RETINA was introduced to overcome a critical limitation in existing MKB-VQA datasets: the "visual shortcut." Models could often succeed by simply matching the query image to the target document's primary subject entity.

🚀 Key Feature: Breaking the Shortcut

RETINA is explicitly designed to eliminate this bias, forcing models to rely on true relational knowledge.

The benchmark's construction process ensures that the query image is of a secondary, related entity mentioned in the document, rather than the main subject.

For instance, given a question "What animal mainly eats this fruit?" with an image of an 🍎Apple, the answer might be in the document about "🐻Bears" not "🍎Apples".

This setup reflects complex, real-world scenarios where knowledge retrieval must go beyond direct visual matching.

Examples

💾 Dataset Access

TODO-List

  • Release RETINA bench.
  • Clean up directory structure and paths.

The RETINA bench, including the large training set and the human-curated test set, is available for download and use on Hugging Face:

Access the RETINA Dataset

For EVQA and Infoseek, including the images and textual KB, please refer to Lin Weizhe et al.:

Access the M2KR Dataset

For the document images please refer to Lianghao Deng et al. and images.zip.

Statistics

Component Size Note
Training Set 120k samples Automatically generated via an LLM-driven pipeline.
Test Set 2k samples Human-curated.

⚖️ Acknowledgements

We build apon M2KR_Images, M2KR_passages and lhdeng-gh/MuKA. For image verification we use Imagehash toolkit. The RETINA dataset is intended for non-commercial research purposes. Users are solely responsible for any and all utilization of the dataset. The creators of this benchmark and their affiliated institutions shall not be held liable for any damages, consequences, or legal issues that may arise from its use.

@misc{lee2025breakingvisualshortcutsmultimodal,
      title={Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering}, 
      author={Dosung Lee and Sangwon Jung and Boyoung Kim and Minyoung Kim and Sungyeon Kim and Junyoung Sung and Paul Hongsuck Seo},
      year={2025},
      eprint={2511.22843},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.22843}, 
}

About

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors