Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li , Liqiang Nie, and Tat-Seng Chua
This repository contains code and links to the SILMM method for compositional text-to-image (T2I) generation to improve text-image alignment. In this work, we introduce a model-agnostic iterative self-improvement framework that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO (KC-DPO) for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.
Schematic illustration of SILMM, comprising five steps: 1) LMMs generate compositional prompts by sampling based on provided instructions. 2) Diverse representations and images are generated using either discrete nucleus sampling or the proposed continuous DivDrop. 3) LMMs divide each compositional prompt into semantic units and generate questions for each unit. 4) VQA is conducted to answer these questions, with the answers and likelihoods aggregated into alignment scores as self-feedback. 5) For alignment tuning, DPO is applied for discrete LMMs, while the proposed KC-DPO is used for continuous LMMs
- Release the training code.
- Release the inference code for compositional text-to-image synthesis.
- Release the paper of SILMM on arXiv.
If you find our work useful in your research, please consider citing SILMM:
@article{qu2024silmm,
title={SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation},
author={Qu, Leigang and Li, Haochuan and Wang, Wenjie and Liu, Xiang and Li, Juncheng and Nie, Liqiang and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2412.05818},
year={2024}
}We thank the authors of SEED-LLaMA and DreamLLM, for making their code available.
