John Page1* · Xuesong Niu1* · Kai Wu1 † · Kun Gai1
1 Kolors Team, Kuaishou Technology
*Equal Contribution †Project Leads
✨ Key Highlights of Send-VAE
- Bridging the Representational Gap: As shown above, unlike previous direct alignment-based methods, our non-linear mapper effectively bridges the gap between the VAE’s local structures and the VFM’s dense semantics.
- Semantic Disentanglement: Send-VAE facilitates the seamless injection of contextual knowledge while actively preserving the VAE’s native structured semantics, encouraging emergent disentanglement without requiring explicit regularization constraints.
- Remarkable Acceleration: When integrated with flow-based transformers (SiTs), Send-VAE significantly accelerates the convergence of diffusion models, clearly outperforming both REPA and the enhanced REPA-E baselines.
What characteristics make VAE-based generation more “friendly”? To answer this question, we conduct experiments with three recently proposed evaluation methods for VAE latent space and show the results above. We reveal a surprising finding: the richness and accessibility of structured, fine-grained semantics is a more fundamental prerequisite for VAE latents than high-level semantic alignment. Building on this finding, we introduce Send-VAE, which leverages the rich representations of VFMs through a carefully designed non-linear mapping architecture.
To set up our environment, please run:
git clone https://github.com/REPA-E/REPA-E.git
cd REPA-E
conda env create -f environment.yml -y
conda activate repa-eDownload and extract the training split of the ImageNet-1K dataset. Once it's ready, run the following command to preprocess the dataset:
python preprocessing.py --imagenet-path /PATH/TO/IMAGENET_TRAINReplace /PATH/TO/IMAGENET_TRAIN with the actual path to the extracted training images.
Using the following script to train Send-VAE:
./train_sendvae.shYou can adjust the following options:
--output-dir: Directory to save checkpoints and logs--exp-name: Experiment name (a subfolder will be created underoutput-dir)--vae: Choose between[f8d4, f16d32]--vae-ckpt: Path to a provided or custom VAE checkpoint--disc-pretrained-ckpt: Path to a provided or custom VAE discriminator checkpoint--enc-type:[dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, clip-vit-L, jepa-vit-h]
Cache latents firstly for fast training:
./cache_latent.shConfigure the VAE checkpoint path based on your setup. Then, train the latent diffusion models using:
./train_ldm.shYou can adjust the following options:
--output-dir: Directory to save checkpoints and logs--exp-name: Experiment name (a subfolder will be created underoutput-dir)--vae: Choose between[f8d4, f16d32]--vae-ckpt: Path to a provided or custom VAE checkpoint--models: Choose from[SiT-B/2, SiT-L/2, SiT-XL/2, SiT-B/1, SiT-L/1, SiT-XL/1]. The number indicates the patch size. Select a model compatible with your VAE architecture.--enc-type:[dinov2-vit-b, dinov2-vit-l, dinov2-vit-g, clip-vit-L, jepa-vit-h]--encoder-depth: Any integer from 1 up to the full depth of the selected encoder--proj-coeff: REPA-E projection coefficient for SiT alignment (float > 0)
Generate samples and save them as .npz files using the following script. Set the --exp-path and --train-steps based on your setup.
./sample.sh
