Skip to content

KlingAIResearch/Send-VAE

Repository files navigation

Boosting Latent Diffusion Models via Disentangled Representation Alignment

John Page1*·   Xuesong Niu1*·   Kai Wu1 †·   Kun Gai1

1 Kolors Team, Kuaishou Technology
*Equal Contribution   Project Leads  

📃 Paper

Key Highlights of Send-VAE

  • Bridging the Representational Gap: As shown above, unlike previous direct alignment-based methods, our non-linear mapper effectively bridges the gap between the VAE’s local structures and the VFM’s dense semantics.
  • Semantic Disentanglement: Send-VAE facilitates the seamless injection of contextual knowledge while actively preserving the VAE’s native structured semantics, encouraging emergent disentanglement without requiring explicit regularization constraints.
  • Remarkable Acceleration: When integrated with flow-based transformers (SiTs), Send-VAE significantly accelerates the convergence of diffusion models, clearly outperforming both REPA and the enhanced REPA-E baselines.

Overview

What characteristics make VAE-based generation more “friendly”? To answer this question, we conduct experiments with three recently proposed evaluation methods for VAE latent space and show the results above. We reveal a surprising finding: the richness and accessibility of structured, fine-grained semantics is a more fundamental prerequisite for VAE latents than high-level semantic alignment. Building on this finding, we introduce Send-VAE, which leverages the rich representations of VFMs through a carefully designed non-linear mapping architecture.

Getting Started

1. Environment Setup

To set up our environment, please run:

git clone https://github.com/REPA-E/REPA-E.git
cd REPA-E
conda env create -f environment.yml -y
conda activate repa-e

2. Prepare the training data

Download and extract the training split of the ImageNet-1K dataset. Once it's ready, run the following command to preprocess the dataset:

python preprocessing.py --imagenet-path /PATH/TO/IMAGENET_TRAIN

Replace /PATH/TO/IMAGENET_TRAIN with the actual path to the extracted training images.

3. Train the Send-VAE model

Using the following script to train Send-VAE:

./train_sendvae.sh

You can adjust the following options:

  • --output-dir: Directory to save checkpoints and logs
  • --exp-name: Experiment name (a subfolder will be created under output-dir)
  • --vae: Choose between [f8d4, f16d32]
  • --vae-ckpt: Path to a provided or custom VAE checkpoint
  • --disc-pretrained-ckpt: Path to a provided or custom VAE discriminator checkpoint
  • --enc-type: [dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, clip-vit-L, jepa-vit-h]

4. Use Send-VAE for Accelerated Training and Better Generation

Cache latents firstly for fast training:

./cache_latent.sh

Configure the VAE checkpoint path based on your setup. Then, train the latent diffusion models using:

./train_ldm.sh

You can adjust the following options:

  • --output-dir: Directory to save checkpoints and logs
  • --exp-name: Experiment name (a subfolder will be created under output-dir)
  • --vae: Choose between [f8d4, f16d32]
  • --vae-ckpt: Path to a provided or custom VAE checkpoint
  • --models: Choose from [SiT-B/2, SiT-L/2, SiT-XL/2, SiT-B/1, SiT-L/1, SiT-XL/1]. The number indicates the patch size. Select a model compatible with your VAE architecture.
  • --enc-type: [dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, clip-vit-L, jepa-vit-h]
  • --encoder-depth: Any integer from 1 up to the full depth of the selected encoder
  • --proj-coeff: REPA-E projection coefficient for SiT alignment (float > 0)

5. Generate samples and run evaluation

Generate samples and save them as .npz files using the following script. Set the --exp-path and --train-steps based on your setup.

./sample.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors