Official Implementation | ICCV 2025
This repository provides the official implementation of ICT and HP reward models from our paper "Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment".
Current reward models exhibit inherent bias by inappropriately assigning low scores to images with rich details and high aesthetic value, creating significant discrepancy with actual human aesthetic preferences. Our dual-component framework addresses these critical limitations through:
β ICT (Image-Contained-Text) Reward Model: Novel training objective that quantifies how well images contain textual information without penalizing visual richness
β HP (High-Preference) Reward Model: Pure image modality assessment for aesthetic quality and human preferences
We provide pre-trained reward models and the Pick-High dataset to facilitate research and enable reproducible results.
The Image-Contained-Text (ICT) Reward Model employs a novel contrastive learning objective with hierarchical prompt structures to assess text-image alignment quality. By learning from both basic and refined prompt-image pairs, ICT mitigates the bias against visually rich content that plagues existing alignment metrics.
The High-Preference (HP) Reward Model leverages a fine-tuned CLIP backbone coupled with a specialized Multi-Layer Perceptron to predict human aesthetic preferences from pure visual modality. Trained with margin ranking loss on preference triplets, HP provides orthogonal quality assessment that captures aesthetic nuances beyond semantic alignment.
Pick-High is a high-quality dataset containing 360,000 images generated by SD3.5-Large using refined prompts from Claude-3.5-Sonnet chain-of-thought reasoning, combined with Pick-a-Pic to form image triplets with comprehensive preference annotations for training and evaluating reward models.
- Python 3.8+
- PyTorch 1.12+
- CUDA-compatible GPU (8GB+ VRAM recommended)
- 16GB+ RAM for training
git clone https://github.com/BarretBa/ICTHP.git
cd ICTHP
pip install -r requirements.txtDownload ICT and HP model weights:
# Download ICT model weights
git clone https://huggingface.co/8y/ICT ./ICTHP_models/ICT
# Download HP model weights
git clone https://huggingface.co/8y/HP ./ICTHP_models/HPEvaluate the models on sample images:
# Execute evaluation pipeline with demo image sets
./eval.shFollow these steps to train ICT and HP reward models from scratch:
# Method 1: Using huggingface-cli (Recommended)
pip install huggingface_hub[cli]
huggingface-cli download 8y/Pick-High-Dataset --repo-type dataset
# Method 2: Using Git LFS
git lfs install
git clone https://huggingface.co/datasets/8y/Pick-High-Dataset
# Method 3: Using datasets library
pip install datasets
python -c "from datasets import load_dataset; load_dataset('8y/Pick-High-Dataset')"Update trainer/conf/experiment/train_ict.yaml and trainer/conf/experiment/train_hp.yaml:
dataset:
dataset_name: ./Pick-High-Dataset/Pick-High/
easy_folder: ./Pick-High-Dataset/pick_easy_img
refine_folder: ./Pick-High-Dataset/pick_refine_imgTrain the ICT (Image-Contained-Text) model:
# Multi-GPU training
./train_ict.shTrain the HP (High-Preference) model:
# Multi-GPU training
./train_hp.shTo explore the dataset structure, use our provided data loading script:
# Run the data loading example
cd trainer/datasets
python pick-high.pyDataset Structure:
Pick-High-Dataset/
βββ Pick-High/
β βββ train.pkl # Training annotations
β βββ val.pkl # Validation annotations
β βββ test.pkl # Test annotations
βββ pick_easy_img/ # Basic quality images
β βββ train/val/test/
βββ pick_refine_img/ # High-quality refined images
βββ train/val/test/
If you find this work helpful for your research or use our ICT (Image-Contained-Text) reward model, HP (High-Preference) reward model, or Pick-High dataset, we would appreciate it if you could cite our paper:
@misc{ba2025enhancingrewardmodelshighquality,
title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment},
author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen},
year={2025},
eprint={2507.19002},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.19002},
}This project is licensed under the MIT License - see the LICENSE file for details.
- π€ ICT Model: 8y/ICT - Text-Image Alignment Model
- π€ HP Model: 8y/HP - Aesthetic Quality Model
- π€ Pick-High Dataset: 8y/Pick-High-Dataset - High-Quality Dataset
- π Paper: Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
- π Base Project: PickScore - Pick-a-Pic Dataset and PickScore Model
- PickAPic Dataset: Foundation for our Pick-High dataset
- OpenAI CLIP: Base architecture for our reward models
- Hugging Face: Transformers and Accelerate libraries
- Community: Contributors and researchers in the field
Star β this repository to stay updated with the latest developments!