NeurIPS 2025

Improving Progressive Generation with
Decomposable Flow Matching

Snap Research Rice University
Snap Research Rice University
TL;DR

Decomposable Flow Matching (DFM) is a simple framework to progressively generate visual modalities scale-by-scale, achieving up to 50% faster convergence compared to Flow Matching. Read the paper and explore the code for more details.

Decomposable Flow Matching (DFM): A generative model combining multiscale decomposition with Flow Matching. DFM progressively synthesizes different representation scales by generating coarse-structure scale first and incrementally refining it with finer scales.

๐Ÿ”ฌ Method

Decomposable Flow Matching (DFM): A generative model combining multiscale decomposition with Flow Matching. DFM progressively synthesizes different representation scales by generating coarse-structure scale first and incrementally refining it with finer scales.

DFM Architecture Overview

DFM Architecture

Our framework (DFM) progressively synthesizes images by combining multiscale decomposition with Flow Matching. We modify the DiT architecture to use per-scale patchification and timestep-embedding layers while keeping the core DiT architecture untouched.

๐Ÿ“Š Results

Training Convergence

Across image and video generation, DFM outperforms the best-performing baselines, achieving the same Frรฉchet DINO Distance (FDD) of Flow Matching baselines with up to 2x less training compute.

DFM convergence comparison

๐Ÿ–ผ๏ธ Qualitative Results

Large-Scale Finetuning

Finetuning FLUX-dev with DFM (FLUX-DFM) achieves superior results than finetuning with standard full-finetuning (DFM-FT) for the same training compute.

FLUX-DFM vs DFM-FT comparison

Training From Scratch for Image Generation

When trained from scratch on ImageNet-1k 512px, DFM achieves better quality than baselines using the same training resources.

DFM ImageNet-1k results

Training From Scratch for Video Generation

DFM is also suited for video generation, achieving better structural and visual quality than baselines when trained on the Kinetics-700 dataset with the same compute budget.

Ablations

We found that DFM benefits from more sampling steps in the coarse-structure stage and needs only a few in the high-frequency stage, and it stays largely insensitive to the choice of sampling per-stage noise threshold, especially at high CFG values.

Sampling steps ablation CFG threshold ablation

๐Ÿ“ Citation

If you find this work useful in your research, please consider citing:

@inproceedings{dfm,
  title={Improving Progressive Generation with Decomposable Flow Matching},
  author={Moayed Haji-Ali and Willi Menapace and Ivan Skorokhodov and 
    Arpit Sahni and Sergey Tulyakov and Vicente Ordonez and 
    Aliaksandr Siarohin},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025}
}