Welcome to Mengping Yang (杨孟平)'s Homepage

	DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers Mengping Yang, Zhiyu Tan, Binglei Li, Xiaomeng Yang, Hesen Chen, Hao Li* CVPR 2026, [PDF] [BibTeX] [Code] We propose DiverseDiT, a novel and efficient framework explicitly designed to enhance representation diversity with long residual connections to diversify input and represen tation diversity loss to encourage distinct features across blocks, without relying on external guidance.
	Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion Binglei Li, Mengping Yang (Project Lead), Zhiyu Tan, Junping Zhang, Hao Li Arxiv Preprint 2026, [PDF] [BibTeX] We systematically analyze block-wise contributions and their interactions with text conditions, offering a better understanding of the inter nal mechanisms within MMDiT-based generative models. Meanwhile, our analysis reveals several valuable findings that unlock new possibilities for improving the synthesis quality. Based on these findings, we propose training free techniques for improved text alignment, precise se mantic editing, and accelerated inference.
	Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation Binglei Li, Mengping Yang (Project Lead), Zhiyu Tan, Junping Zhang, Hao Li Arxiv Preprint 2026, [PDF] [BibTeX] Wepresented Diff-Aid, a plug-in module for rectified text to-image diffusion transformers that learns to adaptively adjust block-wise interactions between token-level textual conditions across denoising trajectories.
	Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, Hao Li Arxiv Preprint 2026, [PDF] [BibTeX] [Code] We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions.
	Omni-Video: Democratizing Unified Video Understanding and Generation Zhiyu Tan^†, Hao Yang^†, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li Arxiv Preprint 2026, [PDF] [BibTeX] [Code] We present Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues.
	Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision Luozheng Qin^†, Jia Gong^†, Yuqing Sun^†, Tianjiao Li, Haoyu Pan, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li ICLR 2026, [PDF] [BibTeX] [Code] We introduce Uni-CoT, a unified reasoning framework that extends CoT principles to the multimodal domain, empowering Multimodal Large Language Models (MLLMs) to perform interpretable, step-by-step reasoning across both text and vision.
	Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation Xiaomeng Yang^†, Mengping Yang^†, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li ICLR 2026, [PDF] [BibTeX] [Project & Code] We introduce Dual-Iterative Optimiza tion (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation.
	FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers Yanbing Zhang, Zhe Wang, Qin Zhou, Mengping Yang* (Project Lead) ICCV 2025, [PDF] [BibTeX] [Code] We propose FreeCus, a training-free framework for zero-shot subject-driven image generation using diffusion transformers (DiTs). The method introduces three innovations: (1) pivotal attention sharing (PAS) to preserve subject layout while maintaining editability, (2) adjusted noise shifting (ANS) to enhance fine-grained detail extraction, and (3) semantic feature compensation (SFC) via multimodal LLMs to address missing attributes.
	Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption Luozheng Qin, Zhiyu Tan, Mengping Yang (Colead the work with ZY), Xiaomeng Yang, Hao Li* Arxiv 2025, [PDF] [Project & Code] [BibTeX] We develop Cockatiel, a three stage training pipeline which assembles the advantages of multiple video-detailed caption (VDC) models and human preferences by conduct ing ensembling synthetic and human preference training on VDC models based on our caption quality scorer.
	An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li* ECCV 2024, [PDF] [Project] [BibTeX] We propose an effective approach for incorporating LLMs into text-to-image diffusion models, improving the awareness of LLMs towards the CLIP visual and textual space, thus facilitating more expressive language understanding. Moreover, we devise an efficient three-stage training pipeline that accomplish fast adaptation of LLM textual features with a small amount of resources, serving as an strong baseline of integrating LLMs into diffusion models and paving the way of this important topic.
	Attention Calibration for Disentangled Text-to-Image Personalization Yanbing Zhang, Mengping Yang (Project Lead), Qin Zhou, Zhe Wang* CVPR 2024, (Oral Presentation), [PDF] [BibTeX] We propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts.
	Revisiting the Evaluation of Image Synthesis with GANs Mengping Yang^†, Ceyuan Yang^†, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai NeruIPS Datasets and Benchmarks 2023, [PDF] [BibTeX] We make in-depth analyses on how to represent a data point in the feature space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Together with these analysis, we build a comprehensive system for synthesis comparison, providing reliable and consistent ranks for unsupervised image generation models including GANs and Diffusion Models.
	Image Synthesis under Limited Data: A Survey and Taxonomy Mengping Yang, Zhe Wang* IJCV 2025, [PDF] [Project] [BibTeX] We provide a comprehensive survey on image synthesis under limited data, including data-efficient generative modeling, few-shot generative adaptation, few-shot and one-shot image synthesis.
	Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation Mengping Yang, Zhe Wang, Wenyi Feng, Qian Zhang, Ting Xiao ACM MM* 2023, [PDF] [Project] [BibTeX] We propose textural modulation (TexMod) and strctural discriminator (StructD) for improving the performance of few-shot image generaion.
	Semantic-Aware Generator and Low-level Feature Augmentation for Few-shot Image Generation Zhe Wang, Jiaoyan Guan, Mengping Yang* (Project Lead), Ting Xiao, Ziqiu Chi ACM MM 2023, [PDF] [BibTeX] We propose semantic-aware generator (SAG) and low-level feature augmentation (LFA) for improving the performance of few-shot image generaion.
	ProtoGAN: Towards high diversity and fidelity image synthesis under limited data Mengping Yang, Zhe Wang, Ziqiiu Chi, Wenli Du InS 2023, [PDF] [BibTeX] we propose ProtoGAN, a GAN that incorporates the metric-learning-based prototype mechanism into adversarial learning by aligning the prototypes and features of synthesized distribution and the real distribution.
	DFSGAN: Introducing editable and representative attributes for few-shot image generation Mengping Yang, Saisai Niu, Zhe Wang, Dongdong Li, Wenli Du EAAI 2023, [PDF] [BibTeX] we propose DFSGAN for few-shot image generation, which takes dynamic Gaussian mixture (DGM) latent codes as the generator’s input.
	FreGAN: Exploiting Frequency Components for Training GANs under Limited Data Mengping Yang, Zhe Wang, Ziqiu Chi, Yanbing Zhang NeurIPS 2022, [PDF] [Project] [BibTeX] We propose a frequency-aware model for training GANs under limited data, facilitating high-quality few-shot image syntheisi.
	WaveGAN: Frequency-Aware GAN for High-Fidelity Few-Shot Image Generation Mengping Yang, Zhe Wang, Ziqiu Chi, Yanbing Zhang ECCV 2022, [PDF] [Project] [BibTeX] We propose a frequency-aware model for few-shot image generation, enabling high-fidelity synthesis for downstream tasks.
	Better Embedding and More Shots for Few-shot Learning Ziqiu Chi, Zhe Wang, Mengping Yang, Wei Guo, Xinlei Xu IJCAI 2022, [PDF] We develop Better Embedding and More Shots to address the distorted embedding of target data in few-shot learning.