- [2025-09-19] 🎉 KRIS-Bench is accepted by NeurIPS 2025 !
KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark) is a cognitively-informed diagnostic benchmark for instruction-based image editing. It organizes editing tasks into three foundational knowledge types—Factual, Conceptual, and Procedural—spanning 7 reasoning dimensions and 22 representative tasks with 1,267 high-quality annotated instances.
Beyond Visual Consistency, Visual Quality and Instruction Following, KRIS-Bench introduces a novel Knowledge Plausibility metric, augmented with knowledge hints and calibrated by human studies. Empirical results across a broad set of state-of-the-art models reveal clear reasoning gaps, underscoring the need for knowledge-centric evaluation to advance intelligent image editing systems.
- Python 3.8+
Set your OpenAI API Key as an environment variable before running:
export OPENAI_API_KEY=your_openai_api_keyFirst, download the benchmark and place them in ./KRIS_Bench directory. You can fetch the full dataset from Hugging Face dataset KRIS-Bench. For convenience, we also keep the benchmark in this repository.
Evaluate a list of models across all categories using the main script:
python metrics_common.py --models doubao gpt geminiArguments
--models: Space-separated model names to evaluate.
The script iterates over models × categories, calls GPT-4o (when applicable) for automated judging, and writes:
results/{model}/{category}/metrics.json
If you want to run specific task families, use:
# Viewpoint change with ground-truth image
python metrics_view_change.py --models your_model
# Knowledge plausibility tasks
python metrics_knowldge.py --models your_model
# Multi-element composition
python metrics_multi_element.py --models your_model
# Temporal prediction
python metrics_temporal_prediction.py --models your_modelNote: Ensure model-generated images exist under:
results/{model}/{category}/and are named as {image_id}, which corresponds to the index of the input sample.
Each category produces a metrics.json like:
{
"1": {
"instruction": "...",
"explain": "...",
"consistency_score": 5,
"consistency_reasoning": "...",
"instruction_score": 5,
"instruction_reasoning": "...",
"quality_score": 4,
"quality_reasoning": "..."
},
"2":{
...
},
}- Ensure
KRIS_Bench/{category}/annotation.jsonand original images are present. - Check your generated images are correctly named and placed in
results/{model}/{category}/. - OpenAI API usage is subject to rate limits and costs. Adjust
max_workersand batch size as needed.
ByteDance-Seed/Bagel: The Bagel team has evaluated their models on KRIS_Bench and released the evaluation code.
If you find KRIS-Bench helpful, please cite:
@article{wu2025kris,
title={KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models},
author={Wu, Yongliang and Li, Zonghui and Hu, Xinting and Ye, Xinyu and Zeng, Xianfang and Yu, Gang and Zhu, Wenbo and Schiele, Bernt and Yang, Ming-Hsuan and Yang, Xu},
journal={arXiv preprint arXiv:2505.16707},
year={2025}
}For questions or submissions, please open an issue or email [email protected].
