Decomposable Flow Matching (DFM): A generative model combining multiscale decomposition with Flow Matching. DFM progressively synthesizes different representation scales by generating coarse-structure scale first and incrementally refining it with finer scales.
Our framework (DFM) progressively synthesizes images by combining multiscale decomposition with Flow Matching. We modify the DiT architecture to use per-scale patchification and timestep-embedding layers while keeping the core DiT architecture untouched.
Across image and video generation, DFM outperforms the best-performing baselines, achieving the same Frรฉchet DINO Distance (FDD) of Flow Matching baselines with up to 2x less training compute.
Finetuning FLUX-dev with DFM (FLUX-DFM) achieves superior results than finetuning with standard full-finetuning (DFM-FT) for the same training compute.
When trained from scratch on ImageNet-1k 512px, DFM achieves better quality than baselines using the same training resources.
DFM is also suited for video generation, achieving better structural and visual quality than baselines when trained on the Kinetics-700 dataset with the same compute budget.
We found that DFM benefits from more sampling steps in the coarse-structure stage and needs only a few in the high-frequency stage, and it stays largely insensitive to the choice of sampling per-stage noise threshold, especially at high CFG values.
If you find this work useful in your research, please consider citing:
@inproceedings{dfm,
title={Improving Progressive Generation with Decomposable Flow Matching},
author={Moayed Haji-Ali and Willi Menapace and Ivan Skorokhodov and
Arpit Sahni and Sergey Tulyakov and Vicente Ordonez and
Aliaksandr Siarohin},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}