ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

ImageDoctor is a unified evaluation framework for text-to-image (T2I) generation.
It produces both multi-aspect scalar scores (semantic alignment, aesthetics, plausibility, overall) and spatially grounded heatmaps, following a novel “look–think–predict” paradigm inspired by human diagnosis.

🧩 Table of Contents

📘 Overview

Recent advances in text-to-image (T2I) generation have yielded increasingly realistic and instruction-following images.
However, evaluating such results remains challenging — most existing evaluators output a single scalar score, which fails to capture localized flaws or provide interpretable feedback.

ImageDoctor fills this gap by introducing dense, grounded evaluation:

It scores each image across multiple dimensions,
Localizes artifacts and misalignments using heatmaps,
And explains its reasoning step-by-step using grounded image reasoning.

🚀 Key Features

🎯 Multi-Aspect Evaluation
Predicts four interpretable quality dimensions:
Semantic Alignment · Aesthetics · Plausibility · Overall Quality
🗺️ Spatially Grounded Feedback
Generates heatmaps highlighting artifact and misalignment regions, providing fine-grained supervision and interpretability.
🧠 Grounded Image Reasoning
Follows a look–think–predict paradigm:
- Look: Identify potential flaw regions
- Think: Analyze and reason about these regions
- Predict: Produce final scores and diagnostic heatmaps
  The model can zoom in on localized regions when reasoning, mimicking human evaluators.
⚙️ GRPO Fine-Tuning
ImageDoctor is refined through Group Relative Policy Optimization (GRPO) with a grounding reward, improving spatial awareness and preference alignment.
🧩 Versatile Applications
- ✅ Evaluation metric
- ✅ Reward function in RL for T2I models (DenseFlow-GRPO)
- ✅ Verifier for test-time scaling and re-ranking

🧱 Environments

git clone https://github.com/EthanG97/ImageDoctor.git
cd ImageDoctor
# Create a new conda environment from environment.yaml
conda env create -f environment.yaml

# Activate it
conda activate imagedoctor

Training

Cold Start

For the cold-start stage, we follow LLaMA-Factory to perform supervised fine-tuning (SFT).

In Stage 1, we train the Qwen2.5-VL using only the final answer so that the model can first learn to capture human preferences.

llamafactory-cli train training/configs/cold_start_1.yaml

To enable Qwen2.5-VL to generate heatmaps, we initialize the model from the Stage 1 checkpoint and attach a heatmap decoder.

Please replace transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py with modify/transformers/modeling_qwen2_5_vl.py, and copy the segment_anything directory from SAM to the same directory.

To enable the supervision on the heatmap prediction, we also corresponding provide the way to modify for 'llamafactory'

Then, run the following command to initialize the model:

python add_token.py --stage1_model /path/to/stage1/model --model_init /path/to/model_init.pt --output /path/to/output

In Stage 2, we further optimize the model to generate reasoning in the desired format together with heatmap predictions.

llamafactory-cli train training/configs/cold_start_2.yaml

Reinforcement Fine-Tuning

Our reinforcement fine-tuning stage is built upon VLM-R1. Before running, please modify the dataset path and instruction configuration in modify/VLM-R1/run_scripts/run_grpo_np_grounding.sh.

The necessary data for training can be found in Google Drive.

Then, run the following command to start reinforcement fine-tuning:

bash run_grpo_np_grounding.sh

🧠 Inference

python inference.py\
  --checkpoint GYX97/ImageDoctor \
  --image_path ./examples/cat.png \
  --prompt "a close-up photo of a fluffy orange cat with green eyes" \
  --output_dir ./outputs

🧩 Applications

Role	Description
Metric	Evaluate text-image alignment and visual plausibility with interpretable scores.
Verifier	Select best image among candidates in test-time scaling setups.
Reward Model	Provide dense spatial feedback for RLHF in diffusion/flow models (DenseFlow-GRPO).

DenseFlow-GRPO

To demonstrate the effectiveness of spatial feedback, we further develop DenseFlow-GRPO based on Flow-GRPO. Specifically, we modify the log-probability computation to enable the transition from image-level feedback to dense spatial feedback.

🧾 Citation

If you use ImageDoctor in your research, please cite:

@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
  author    = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
  title     = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, 
  eprint    = {2510.01010},
  archivePrefix={arXiv},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.01010},

🙏 Acknowledgement

ImageDoctor builds upon:

Qwen2.5-VL – Vision-Language foundation
RichHF-18K – Multi-aspect human preference dataset
Flow-GRPO – Reinforcement Learning base framework
VLM-R1 - Reinforcement fine-tuning framework for vision-language models

📄 License

Released under the Apache 2.0 License for research and non-commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
images		images
modify		modify
training/configs		training/configs
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

🧩 Table of Contents

📘 Overview

🚀 Key Features

🧱 Environments

Training

Cold Start

Reinforcement Fine-Tuning

🧠 Inference

🧩 Applications

DenseFlow-GRPO

🧾 Citation

🙏 Acknowledgement

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

🧩 Table of Contents

📘 Overview

🚀 Key Features

🧱 Environments

Training

Cold Start

Reinforcement Fine-Tuning

🧠 Inference

🧩 Applications

DenseFlow-GRPO

🧾 Citation

🙏 Acknowledgement

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages