KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

📢 News

[2025-09-19] 🎉 KRIS-Bench is accepted by NeurIPS 2025 !

🧩Overview

KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark) is a cognitively-informed diagnostic benchmark for instruction-based image editing. It organizes editing tasks into three foundational knowledge types—Factual, Conceptual, and Procedural—spanning 7 reasoning dimensions and 22 representative tasks with 1,267 high-quality annotated instances.

Beyond Visual Consistency, Visual Quality and Instruction Following, KRIS-Bench introduces a novel Knowledge Plausibility metric, augmented with knowledge hints and calibrated by human studies. Empirical results across a broad set of state-of-the-art models reveal clear reasoning gaps, underscoring the need for knowledge-centric evaluation to advance intelligent image editing systems.

Requirements

Python 3.8+

Set your OpenAI API Key as an environment variable before running:

export OPENAI_API_KEY=your_openai_api_key

🚀 Usage

First, download the benchmark and place them in ./KRIS_Bench directory. You can fetch the full dataset from Hugging Face dataset KRIS-Bench. For convenience, we also keep the benchmark in this repository.

Evaluate a list of models across all categories using the main script:

python metrics_common.py --models doubao gpt gemini

Arguments

--models: Space-separated model names to evaluate.

The script iterates over models × categories, calls GPT-4o (when applicable) for automated judging, and writes:

results/{model}/{category}/metrics.json

Per-task Scripts

If you want to run specific task families, use:

# Viewpoint change with ground-truth image
python metrics_view_change.py --models your_model

# Knowledge plausibility tasks
python metrics_knowldge.py --models your_model

# Multi-element composition
python metrics_multi_element.py --models your_model

# Temporal prediction
python metrics_temporal_prediction.py --models your_model

Note: Ensure model-generated images exist under:
results/{model}/{category}/
and are named as {image_id}, which corresponds to the index of the input sample.

📤 Output Format

Each category produces a metrics.json like:

{
  "1": {
    "instruction": "...",
    "explain": "...",
    "consistency_score": 5,
    "consistency_reasoning": "...",
    "instruction_score": 5,
    "instruction_reasoning": "...",
    "quality_score": 4,
    "quality_reasoning": "..."
  },
  "2":{
      ...
  },
}

📮Notes

Ensure KRIS_Bench/{category}/annotation.json and original images are present.
Check your generated images are correctly named and placed in results/{model}/{category}/.
OpenAI API usage is subject to rate limits and costs. Adjust max_workers and batch size as needed.

📜Related Repositories

ByteDance-Seed/Bagel: The Bagel team has evaluated their models on KRIS_Bench and released the evaluation code.

✍️Citation

If you find KRIS-Bench helpful, please cite:

@article{wu2025kris,
  title={KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models},
  author={Wu, Yongliang and Li, Zonghui and Hu, Xinting and Ye, Xinyu and Zeng, Xianfang and Yu, Gang and Zhu, Wenbo and Schiele, Bernt and Yang, Ming-Hsuan and Yang, Xu},
  journal={arXiv preprint arXiv:2505.16707},
  year={2025}
}

📬 Contact

For questions or submissions, please open an issue or email [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
KRIS_Bench		KRIS_Bench
assets		assets
utils		utils
LICENSE		LICENSE
README.md		README.md
metrics_common.py		metrics_common.py
metrics_knowledge.py		metrics_knowledge.py
metrics_multi_element.py		metrics_multi_element.py
metrics_temporal_prediction.py		metrics_temporal_prediction.py
metrics_view_change.py		metrics_view_change.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

📢 News

🧩Overview

Requirements

🚀 Usage

Per-task Scripts

📤 Output Format

📮Notes

📜Related Repositories

✍️Citation

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

📢 News

🧩Overview

Requirements

🚀 Usage

Per-task Scripts

📤 Output Format

📮Notes

📜Related Repositories

✍️Citation

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages