ImageDoctor is a unified evaluation framework for text-to-image (T2I) generation.
It produces both multi-aspect scalar scores (semantic alignment, aesthetics, plausibility, overall) and spatially grounded heatmaps, following a novel “look–think–predict” paradigm inspired by human diagnosis.
Recent advances in text-to-image (T2I) generation have yielded increasingly realistic and instruction-following images.
However, evaluating such results remains challenging — most existing evaluators output a single scalar score, which fails to capture localized flaws or provide interpretable feedback.
ImageDoctor fills this gap by introducing dense, grounded evaluation:
- It scores each image across multiple dimensions,
- Localizes artifacts and misalignments using heatmaps,
- And explains its reasoning step-by-step using grounded image reasoning.
-
🎯 Multi-Aspect Evaluation
Predicts four interpretable quality dimensions:
Semantic Alignment · Aesthetics · Plausibility · Overall Quality -
🗺️ Spatially Grounded Feedback
Generates heatmaps highlighting artifact and misalignment regions, providing fine-grained supervision and interpretability. -
🧠 Grounded Image Reasoning
Follows a look–think–predict paradigm:- Look: Identify potential flaw regions
- Think: Analyze and reason about these regions
- Predict: Produce final scores and diagnostic heatmaps
The model can zoom in on localized regions when reasoning, mimicking human evaluators.
-
⚙️ GRPO Fine-Tuning
ImageDoctor is refined through Group Relative Policy Optimization (GRPO) with a grounding reward, improving spatial awareness and preference alignment. -
🧩 Versatile Applications
- ✅ Evaluation metric
- ✅ Reward function in RL for T2I models (DenseFlow-GRPO)
- ✅ Verifier for test-time scaling and re-ranking
git clone https://github.com/EthanG97/ImageDoctor.git
cd ImageDoctor
# Create a new conda environment from environment.yaml
conda env create -f environment.yaml
# Activate it
conda activate imagedoctor
For the cold-start stage, we follow LLaMA-Factory to perform supervised fine-tuning (SFT).
In Stage 1, we train the Qwen2.5-VL using only the final answer so that the model can first learn to capture human preferences.
llamafactory-cli train training/configs/cold_start_1.yamlTo enable Qwen2.5-VL to generate heatmaps, we initialize the model from the Stage 1 checkpoint and attach a heatmap decoder.
Please replace transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py with modify/transformers/modeling_qwen2_5_vl.py, and copy the segment_anything directory from SAM to the same directory.
To enable the supervision on the heatmap prediction, we also corresponding provide the way to modify for 'llamafactory'
Then, run the following command to initialize the model:
python add_token.py --stage1_model /path/to/stage1/model --model_init /path/to/model_init.pt --output /path/to/outputIn Stage 2, we further optimize the model to generate reasoning in the desired format together with heatmap predictions.
llamafactory-cli train training/configs/cold_start_2.yamlOur reinforcement fine-tuning stage is built upon VLM-R1. Before running, please modify the dataset path and instruction configuration in modify/VLM-R1/run_scripts/run_grpo_np_grounding.sh.
The necessary data for training can be found in Google Drive.
Then, run the following command to start reinforcement fine-tuning:
bash run_grpo_np_grounding.shpython inference.py\
--checkpoint GYX97/ImageDoctor \
--image_path ./examples/cat.png \
--prompt "a close-up photo of a fluffy orange cat with green eyes" \
--output_dir ./outputs| Role | Description |
|---|---|
| Metric | Evaluate text-image alignment and visual plausibility with interpretable scores. |
| Verifier | Select best image among candidates in test-time scaling setups. |
| Reward Model | Provide dense spatial feedback for RLHF in diffusion/flow models (DenseFlow-GRPO). |
To demonstrate the effectiveness of spatial feedback, we further develop DenseFlow-GRPO based on Flow-GRPO. Specifically, we modify the log-probability computation to enable the transition from image-level feedback to dense spatial feedback.
If you use ImageDoctor in your research, please cite:
@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
author = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
title = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning},
eprint = {2510.01010},
archivePrefix={arXiv},
year = {2025},
url = {https://arxiv.org/abs/2510.01010}, ImageDoctor builds upon:
- Qwen2.5-VL – Vision-Language foundation
- RichHF-18K – Multi-aspect human preference dataset
- Flow-GRPO – Reinforcement Learning base framework
- VLM-R1 - Reinforcement fine-tuning framework for vision-language models
Released under the Apache 2.0 License for research and non-commercial use.
