Mengping Yang (杨孟平)

Email     Google Scholar   Github   Zhihu   CV   Blogs

I am currently a researcher at SAIS-FUXI and a Postdoctoral Researcher at Fudan University, focusing on the research and application of multi-modal generative modeling. Before that, I obtained my Ph.D degree with dozens of honors (from Sep. 2019 - Jun. 2024) from Ecust China University of Science and Technology (ECUST). My research interests mainly include multi-model learning/ AIGC, e.g., content generation of 2D images and videos, with Generative Adversarial Networks, Diffusion and auto-regressive models.

I am always open for long-term collabration/full-time job opportunities, working on fundamental research and application related problems of multi-modal learning and generative models. Here is my CV (lastely updated: May, 2025), feel free to reach me for any potential opportunities!

Looking for self-motivated research interns on visual generation, email me ([email protected]) if interested.

profile photo
News
  • [02/2026] One papers about representation learning for diffusion transformers got accepted by CVPR, thanks to my co-authors.
  • [01/2026] We released Omni-Video 2, a flexible framework for unified video modeling.
  • [01/2026] Two papers about video generation post training and unified CoT got accepted by ICLR, congrats to co-authors.
  • [06/2025] One paper got accepted by ICCV, congrats to co-authors.
  • [06/2025] I was recognized as one of the Outstanding Reviewers for CVPR 2025 (711/12593).
  • [03/2025] Check out our video captioning work Cockatiel, which ranks top1 on the VDC leaderboard.
  • [01/2025] Finally, our survey paper on image synthesis under limited data got accepted by IJCV.
  • [12/2024] One collabrated paper got accepted by TNNLS, congrats to co-authors.
  • [10/2024] I recently began to write personal blogs about multi-modal learning and visual generation, check out the details at  Here, and some random thought at  tinymind
  • [07/2024] One paper about Evaluating text-to-image diffusion models released to  ArXiv.
  • [07/2024] One paper about LLM-driven text-to-image diffusion models got accepted by ECCV-2024.
  • [06/2024] Honored to present our gratitude to ECUST, on behalf of all graduates News.
  • [05/2024] Finally completed my Ph.D degree from ECUST (got all As for blind review and an average of 92.8 for thesis defence), and won the Outstanding Graduates of Shanghai.
  • [03/2024] Three collabrated paper respectively got accepted by CVPR-2024 (Oral, 3%), PR, EAAI, congrats to co-authors.
  • [01/2024] Honored to win the grand prize of president's scholarship (one student per year).
  • [10/2023] One collabrated paper got accepted by KBS, congrats to co-authors.
  • [09/2023] One paper got accepted by NeurIPS Datasets and Benchmarks Track. Many thanks to my collabrators!
  • [09/2023] One paper got accepted by EAAI.
  • [08/2023] One survey paper on image synthesis under limited data released to  ArXiv.
  • [07/2023] Two papers got accepted by ACM Multimedia 2023.
  • [04/2023] One paper on evaluating synthesis quality released to  ArXiv.
  • [03/2023] One paper got accepted by Information Sciences.
  • [11/2022] I was honored to present our blessings at   ECUST's 70th anniversary celebration. Happy birthday!
  • [10/2022] One paper got accepted by EAAI.
  • [09/2022] One paper got accepted by NeurIPS 2022.
  • [07/2022] One paper got accepted by ECCV 2022. This is my first first-authored top-tier conference paper!
  • [05/2022] One paper got accepted by IJCAI 2022.
Publications
* Corresponding author, † Equal contributions.
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
Mengping Yang, Zhiyu Tan, Binglei Li, Xiaomeng Yang, Hesen Chen, Hao Li*
CVPR 2026,
[PDF] [BibTeX] [Code]

We propose DiverseDiT, a novel and efficient framework explicitly designed to enhance representation diversity with long residual connections to diversify input and represen tation diversity loss to encourage distinct features across blocks, without relying on external guidance.

Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion
Binglei Li, Mengping Yang (Project Lead), Zhiyu Tan, Junping Zhang*, Hao Li*
Arxiv Preprint 2026,
[PDF] [BibTeX]

We systematically analyze block-wise contributions and their interactions with text conditions, offering a better understanding of the inter nal mechanisms within MMDiT-based generative models. Meanwhile, our analysis reveals several valuable findings that unlock new possibilities for improving the synthesis quality. Based on these findings, we propose training free techniques for improved text alignment, precise se mantic editing, and accelerated inference.

Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation
Binglei Li, Mengping Yang (Project Lead), Zhiyu Tan, Junping Zhang*, Hao Li*
Arxiv Preprint 2026,
[PDF] [BibTeX]

Wepresented Diff-Aid, a plug-in module for rectified text to-image diffusion transformers that learns to adaptively adjust block-wise interactions between token-level textual conditions across denoising trajectories.

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing
Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang*, Hao Li*
Arxiv Preprint 2026,
[PDF] [BibTeX] [Code]

We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions.

Omni-Video: Democratizing Unified Video Understanding and Generation
Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang*, Hao Li*
Arxiv Preprint 2026,
[PDF] [BibTeX] [Code]

We present Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues.

Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Haoyu Pan, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan*, Hao Li*
ICLR 2026,
[PDF] [BibTeX] [Code]

We introduce Uni-CoT, a unified reasoning framework that extends CoT principles to the multimodal domain, empowering Multimodal Large Language Models (MLLMs) to perform interpretable, step-by-step reasoning across both text and vision.

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan*, Hao Li*
ICLR 2026,
[PDF] [BibTeX] [Project & Code]

We introduce Dual-Iterative Optimiza tion (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation.

FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers
Yanbing Zhang, Zhe Wang, Qin Zhou*, Mengping Yang (Project Lead)
ICCV 2025,
[PDF] [BibTeX] [Code]

We propose FreeCus, a training-free framework for zero-shot subject-driven image generation using diffusion transformers (DiTs). The method introduces three innovations: (1) pivotal attention sharing (PAS) to preserve subject layout while maintaining editability, (2) adjusted noise shifting (ANS) to enhance fine-grained detail extraction, and (3) semantic feature compensation (SFC) via multimodal LLMs to address missing attributes.

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Luozheng Qin, Zhiyu Tan, Mengping Yang (Colead the work with ZY), Xiaomeng Yang, Hao Li*
Arxiv 2025,
[PDF] [Project & Code] [BibTeX]

We develop Cockatiel, a three stage training pipeline which assembles the advantages of multiple video-detailed caption (VDC) models and human preferences by conduct ing ensembling synthetic and human preference training on VDC models based on our caption quality scorer.

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li*
ECCV 2024,
[PDF] [Project] [BibTeX]

We propose an effective approach for incorporating LLMs into text-to-image diffusion models, improving the awareness of LLMs towards the CLIP visual and textual space, thus facilitating more expressive language understanding. Moreover, we devise an efficient three-stage training pipeline that accomplish fast adaptation of LLM textual features with a small amount of resources, serving as an strong baseline of integrating LLMs into diffusion models and paving the way of this important topic.

Attention Calibration for Disentangled Text-to-Image Personalization
Yanbing Zhang, Mengping Yang (Project Lead), Qin Zhou, Zhe Wang*
CVPR 2024, (Oral Presentation),
[PDF] [BibTeX]

We propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts.

Revisiting the Evaluation of Image Synthesis with GANs
Mengping Yang, Ceyuan Yang, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai
NeruIPS Datasets and Benchmarks 2023,
[PDF] [BibTeX]

We make in-depth analyses on how to represent a data point in the feature space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Together with these analysis, we build a comprehensive system for synthesis comparison, providing reliable and consistent ranks for unsupervised image generation models including GANs and Diffusion Models.

Image Synthesis under Limited Data: A Survey and Taxonomy
Mengping Yang, Zhe Wang*
IJCV 2025,
[PDF] [Project] [BibTeX]

We provide a comprehensive survey on image synthesis under limited data, including data-efficient generative modeling, few-shot generative adaptation, few-shot and one-shot image synthesis.

Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation
Mengping Yang, Zhe Wang*, Wenyi Feng, Qian Zhang, Ting Xiao
ACM MM 2023,
[PDF] [Project] [BibTeX]

We propose textural modulation (TexMod) and strctural discriminator (StructD) for improving the performance of few-shot image generaion.

Semantic-Aware Generator and Low-level Feature Augmentation for Few-shot Image Generation
Zhe Wang*, Jiaoyan Guan, Mengping Yang (Project Lead), Ting Xiao, Ziqiu Chi
ACM MM 2023,
[PDF] [BibTeX]

We propose semantic-aware generator (SAG) and low-level feature augmentation (LFA) for improving the performance of few-shot image generaion.

ProtoGAN: Towards high diversity and fidelity image synthesis under limited data
Mengping Yang, Zhe Wang, Ziqiiu Chi, Wenli Du
InS 2023,
[PDF] [BibTeX]

we propose ProtoGAN, a GAN that incorporates the metric-learning-based prototype mechanism into adversarial learning by aligning the prototypes and features of synthesized distribution and the real distribution.

DFSGAN: Introducing editable and representative attributes for few-shot image generation
Mengping Yang, Saisai Niu, Zhe Wang, Dongdong Li, Wenli Du
EAAI 2023,
[PDF] [BibTeX]

we propose DFSGAN for few-shot image generation, which takes dynamic Gaussian mixture (DGM) latent codes as the generator’s input.

FreGAN: Exploiting Frequency Components for Training GANs under Limited Data
Mengping Yang, Zhe Wang, Ziqiu Chi, Yanbing Zhang
NeurIPS 2022,
[PDF] [Project] [BibTeX]

We propose a frequency-aware model for training GANs under limited data, facilitating high-quality few-shot image syntheisi.

WaveGAN: Frequency-Aware GAN for High-Fidelity Few-Shot Image Generation
Mengping Yang, Zhe Wang, Ziqiu Chi, Yanbing Zhang
ECCV 2022,
[PDF] [Project] [BibTeX]

We propose a frequency-aware model for few-shot image generation, enabling high-fidelity synthesis for downstream tasks.

Better Embedding and More Shots for Few-shot Learning
Ziqiu Chi, Zhe Wang, Mengping Yang, Wei Guo, Xinlei Xu
IJCAI 2022,
[PDF]

We develop Better Embedding and More Shots to address the distorted embedding of target data in few-shot learning.

Experiences
Research intern on generarive models
Mentor: Dr. Ceyuan Yang and Dr. Bo Dai
Working on fundamental research problems and potential applications of deep generative models, mainly GANs and Diffusion Models.
Published one paper about evaluating generative models at NeurIPS D&B 2023, rendered text-to-video generation demo at waic.
2022.07 —— 2023.07
Research intern on large-scale generarive models
Mentor: Prof. Hao Li
Training and evaluating large-scale text-to-image/videl diffusion models from scratch.
Published one paper about large-language model powered T2I diffusion model in ECCV 2024, and one paper about finetuning multimodal models for evaluating T2I models with human alignment (ArXiv).
2023.11 —— 2024.05
Algorithem Reseacher on Multimodal Learning
Developing multimodal models to reliably identify real/fake goods.
2024.07 —— 2025.06
Researcher on Multi-Modal Generative Modeling
Coducting research on foundamental problems and potential applications of multi-modal generative modeling, including text-to-image, text-to-video, and multimodal generation.
2025.06 —— Present (Internship available)
Selected Honors & Awards
  • Outstanding Graduates of Shanghai,
    2024
  • Grand prize of president's scholarship (校长奖学金特等奖, one student per year),
    2023
  • First Class Scholarship of Graduate,
    2019-2024
  • Shanghai Sparkling Youth, (上海市闪光青年)
    2022
  • Jiangxi Building Material Scholarship,
    2021
  • Suzhou Industrial Park Scholarship,
    2022
  • Chinese University Student of the Year (中国大学生年度人物),
    2020
  • Shanghai University Student of the Year (上海市大学生年度人物),
    2020
  • Chinese University Student of the Year (校大学生年度人物),
    2019
  • Outstanding students,
    2016-2024
  • Second Prize of Mathematics Competition of Chinese Graduate Students,
    2020
Professional activities
  • Conference Reviewer for CVPR (2023, 2024, 2025), NeurIPS (2023, 2024, 2025), IJCAI (2022, 2023), ACMMM (2023, 2024, 2025), ICCV (2025)
  • Journal Reviewer for TPAMI, TCSVT, PR, SI & VP
Blogs
Misk
  • I like reading books (mostly Social Sciences and Philosophy), watching movies (mainly Sci-Fi, Martial Arts Chivalry) during free time.
  • I used to playing basketball a lot (once a week at present), Kobe Bryant is my favorite, always GOAT in my heart.

This page has been accessed several times since Feb 2022.


The website template was adapted from Jon Barron.