|
Simon Coessens
CentraleSupélec • Paris, France
|
Arijit Samal
CentraleSupélec • Paris, France
|
This project was initiated during our Master’s at CentraleSupélec, and we have continued (and are still continuing) to develop it as a side project.
Diffusion models have become a leading approach in generative image modeling, but many still operate in dense pixel space, a representation that is computationally intensive and lacks geometric structure. We propose Gaussian-Diffusion, a framework that performs the denoising process entirely in a latent space composed of 2D Gaussians. Each image is encoded as a set of 150 anisotropic Gaussian splats, parameterized by position, covariance, and color. To model their dynamics, we introduce GaussianTransformer, a permutation-equivariant transformer that serves as the denoising network. Evaluated on MNIST and Sprites datasets, our method achieves visual quality comparable to a pixel space U-Net baseline, while reducing the number of sampling steps from 1000 to 200 and the per-step cost from 11.4 GFLOPs to 4 GFLOPs, resulting in an overall 22× improvement in generation time on an A100 GPU. In contrast to latent diffusion models, our approach does not require an auxiliary autoencoder and preserves full editability of the latent. These findings suggest that structured geometric representations can offer efficient and interpretable alternatives to latent and pixel-based diffusion.
GaussianDiffusion is an ongoing research effort. Early evidence suggests that learning in a structured Gaussian latent delivers competitive visual quality with significantly fewer sampling steps and lower compute. We are actively expanding experiments and ablations across datasets and architectural choices, and have submitted to the ICCV 2025 Structural Priors for Vision (SP4V) workshop to gather feedback and iterate; see the reviews and workshop link below. We have intentionally not included quantitative evaluations yet, as we are prioritizing core improvements; we will release comprehensive metrics (e.g., FID/IS/KID) as we finalize the model.
- Paper (preprint): GaussianDiffusion — Learning Image Generation Process in Gaussian Representation Space

- Review PDF (SP4V workshop): ICCV Structural Priors for Vision (SP4V) reviews

- SP4V workshop website: sp4v.github.io
- HAL submission: hal-05243514 (in progress)
- arXiv: submission in progress
- Gaussian latent space: Represent each image as a set of K Gaussians with per-point parameters (e.g., σx, σy, ρ, RGB/alpha, x, y)
- Diffusion over sets: Transformer-based noise predictor with timestep conditioning (DDPM-style schedule)
- Differentiable rendering: Reconstruct images from Gaussians via splatting
- Datasets: MNIST (K=70, 7-dim), Sprites (K≈150–500, 8–9 dim), CIFAR-10 (K=200, 8-dim)
- Metrics: FID, Inception Score, KID
This project uses precomputed Gaussian representations:
- MNIST: directory
mnist_gaussian_representations/, each file containsWwith shape[70, 7]. - Sprites (32×32 / 64×64): HDF5 under
W/sprite_*withscaling, rotation, opacity, features, xyz. - CIFAR-10: HDF5 under
W/img_*with the same per-parameter datasets.
See loaders in src/dataset.py for exact formats.
src/models/— Transformer architectures (e.g.,transformer_model.py,flash_transformer_model.py)src/train/— Training scripts for MNIST, Sprites, CIFAR-10src/metrics/— Sampling and evaluation utilities (FID/IS/KID)src/utils/— Normalization/denormalization and rendering helperssrc/ddpm.py— DDPM schedules (linear/cosine)src/dataset.py— Dataset readers for Gaussian latentsPreliminary work/— Notebooks and dataset preparation drafts
- Model API:
src/models/transformer_model.py - Schedules:
src/ddpm.py - Rendering:
src/utils/gaussian_to_image.py
- Ensure your dataset paths are correct and match the expected structure (
src/dataset.py). - For module imports to work consistently, prefer
python -m src.<...>from the repo root, or setPYTHONPATH=$(pwd). - Some scripts contain HPC paths; replace with your local absolute paths.

