Skip to content

EthanG97/ImageDoctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

Arxiv Model Website

ImageDoctor is a unified evaluation framework for text-to-image (T2I) generation.
It produces both multi-aspect scalar scores (semantic alignment, aesthetics, plausibility, overall) and spatially grounded heatmaps, following a novel “look–think–predict” paradigm inspired by human diagnosis.

ImageDoctor Teaser


🧩 Table of Contents


📘 Overview

Recent advances in text-to-image (T2I) generation have yielded increasingly realistic and instruction-following images.
However, evaluating such results remains challenging — most existing evaluators output a single scalar score, which fails to capture localized flaws or provide interpretable feedback.

ImageDoctor fills this gap by introducing dense, grounded evaluation:

  • It scores each image across multiple dimensions,
  • Localizes artifacts and misalignments using heatmaps,
  • And explains its reasoning step-by-step using grounded image reasoning.

🚀 Key Features

  • 🎯 Multi-Aspect Evaluation
    Predicts four interpretable quality dimensions:
    Semantic Alignment · Aesthetics · Plausibility · Overall Quality

  • 🗺️ Spatially Grounded Feedback
    Generates heatmaps highlighting artifact and misalignment regions, providing fine-grained supervision and interpretability.

  • 🧠 Grounded Image Reasoning
    Follows a look–think–predict paradigm:

    • Look: Identify potential flaw regions
    • Think: Analyze and reason about these regions
    • Predict: Produce final scores and diagnostic heatmaps
      The model can zoom in on localized regions when reasoning, mimicking human evaluators.
  • ⚙️ GRPO Fine-Tuning
    ImageDoctor is refined through Group Relative Policy Optimization (GRPO) with a grounding reward, improving spatial awareness and preference alignment.

  • 🧩 Versatile Applications

    • ✅ Evaluation metric
    • ✅ Reward function in RL for T2I models (DenseFlow-GRPO)
    • ✅ Verifier for test-time scaling and re-ranking

🧱 Environments

git clone https://github.com/EthanG97/ImageDoctor.git
cd ImageDoctor
# Create a new conda environment from environment.yaml
conda env create -f environment.yaml

# Activate it
conda activate imagedoctor

Training

Cold Start

For the cold-start stage, we follow LLaMA-Factory to perform supervised fine-tuning (SFT).

In Stage 1, we train the Qwen2.5-VL using only the final answer so that the model can first learn to capture human preferences.

llamafactory-cli train training/configs/cold_start_1.yaml

To enable Qwen2.5-VL to generate heatmaps, we initialize the model from the Stage 1 checkpoint and attach a heatmap decoder.

Please replace transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py with modify/transformers/modeling_qwen2_5_vl.py, and copy the segment_anything directory from SAM to the same directory.

To enable the supervision on the heatmap prediction, we also corresponding provide the way to modify for 'llamafactory'

Then, run the following command to initialize the model:

python add_token.py --stage1_model /path/to/stage1/model --model_init /path/to/model_init.pt --output /path/to/output

In Stage 2, we further optimize the model to generate reasoning in the desired format together with heatmap predictions.

llamafactory-cli train training/configs/cold_start_2.yaml

Reinforcement Fine-Tuning

Our reinforcement fine-tuning stage is built upon VLM-R1. Before running, please modify the dataset path and instruction configuration in modify/VLM-R1/run_scripts/run_grpo_np_grounding.sh.

The necessary data for training can be found in Google Drive.

Then, run the following command to start reinforcement fine-tuning:

bash run_grpo_np_grounding.sh

🧠 Inference

python inference.py\
  --checkpoint GYX97/ImageDoctor \
  --image_path ./examples/cat.png \
  --prompt "a close-up photo of a fluffy orange cat with green eyes" \
  --output_dir ./outputs

🧩 Applications

Role Description
Metric Evaluate text-image alignment and visual plausibility with interpretable scores.
Verifier Select best image among candidates in test-time scaling setups.
Reward Model Provide dense spatial feedback for RLHF in diffusion/flow models (DenseFlow-GRPO).

DenseFlow-GRPO

To demonstrate the effectiveness of spatial feedback, we further develop DenseFlow-GRPO based on Flow-GRPO. Specifically, we modify the log-probability computation to enable the transition from image-level feedback to dense spatial feedback.

🧾 Citation

If you use ImageDoctor in your research, please cite:

@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
  author    = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
  title     = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, 
  eprint    = {2510.01010},
  archivePrefix={arXiv},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.01010}, 

🙏 Acknowledgement

ImageDoctor builds upon:

  • Qwen2.5-VL – Vision-Language foundation
  • RichHF-18K – Multi-aspect human preference dataset
  • Flow-GRPO – Reinforcement Learning base framework
  • VLM-R1 - Reinforcement fine-tuning framework for vision-language models

📄 License

Released under the Apache 2.0 License for research and non-commercial use.

About

The official implementation for "ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors