Skip to content

KlingAIResearch/VQRAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

🎉 CVPR2026 Paper

Sinan Du¹ ³*, Jiahao Guo² ³*, Bo Li³✉, Shuhao Cui³, Zhengzhuo Xu¹, Yifu Luo¹, Yongxian Wei¹,
Kun Gai³, Xinggang Wang², Kai Wu³†, Chun Yuan¹✉,

¹Tsinghua University
²Huazhong University of Science and Technology (HUST)
³Kolors Team, Kuaishou Technology

*Equal Contribution | †Project Lead | ✉Corresponding Author

license model paper

📰 Overview

This repository contains official PyTorch/GPU implementations of our paper:

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction.

We propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration to produce Continuous semantic features for image understanding and Discrete tokens for visual generation simultaneously within a unified tokenizer.

🚀 Environment

conda create -n vqrae python=3.11 -y
conda activate vqrae

pip install -r requirements.txt

📦 Data & Model Preparation

cd VQRAE
pip install huggingface_hub

1. InternViT
hf download OpenGVLab/InternVL3-8B
    --local-dir huggingface_hub/InternVL3-8B

2. BLIP3o (training data)
hf download BLIP3o/BLIP3o-Pretrain-Long-Caption
    --local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-Long-Caption

hf download BLIP3o/BLIP3o-Pretrain-JourneyDB
    --local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-JourneyDB

hf download BLIP3o/BLIP3o-Pretrain-Short-Caption
    --local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-Short-Caption

3. Imagenet (eval data)
hf download ILSVRC/imagenet-1k
    --local-dir huggingface_hub/ILSVRC/imagenet-1k

4. VQRAE-InternViT (optional, for reproduce purpose)
hf download Kwai-Kolors/VQRAE
    --local-dir huggingface_hub/VQRAE

5. Ta-Tok (optional, for ablation)
hf download ByteDance-Seed/Tar-TA-Tok
    --local-dir huggingface_hub/Tar/TA-Tok

🎓 Training Stage 1

bash cmds/vqrae/vqrae_internvit_stage1.sh trains the ViT decoder and the VQ codebook while keeping the encoder frozen.

bash cmds/vqrae/vqrae_internvit_stage1_wigan.sh enables GAN for lower rFID.

bash cmds/vqrae/vqrae_internvit_stage1_randinit_1e-4.sh dismisses the text-aligned codebook initialization. This might impact the result slightly (using 36M samples in stage 1).


Metrics rFID PSNR SSIM
randinit 2.94 18.49 0.63
text-aligned 2.46 18.91 0.65

bash cmds/vqrae/vqrae_internvit_stage1_wo_semantic_1e-4.sh uses no pretrained ViT encoder weights.

🔬 Training Stage 2

bash cmds/vqrae/vqrae_internvit_stage2.sh trains the ViT decoder, the VQ codebook and the ViT encoder with self-distillation constraints.

bash cmds/vqrae/vqrae_internvit_stage2_wigan.sh continues training with GAN enabled for lower rFID.

bash cmds/vqrae/vqrae_internvit_stage2_wo_semantic_1e-4.sh uses no pretrained ViT encoder weights.

🔎 Evaluation

bash scripts/run_eval.sh evaluates the official release VQRAE-InternViT version, you can adapt it for other trained checkpoints.

Update

After releasing our paper, we reproduce the result with the new codebase, so there might be some diffs with previous metrics reported in our paper. You can obtain the result below with VQRAE-InternViT.


Metrics rFID PSNR SSIM
reported 1.31 22.23 0.762
reproduced 0.61 23.41 0.795

Others

We also find that VQRAE can be trained successfully without semantic prior (pretrained ViT encoder weights) in our CVPR rebuttal period (thank you for the reviewer). We did not enable GAN training (rFID can be further improved).

You can acquire this with bash cmds/vqrae/vqrae_internvit_stage1_wo_semantic_1e-4.sh and bash cmds/vqrae/vqrae_internvit_stage2_wo_semantic_1e-4.sh.


Metrics rFID PSNR SSIM
wo semantic prior 1.51 22.77 0.778

🙏 Acknowledgement

We gratefully acknowledge the following open-source projects:

  • Ta-Tok - for the insightful semantic aligned codebook.
  • TiTok - for the skillful codebase.
  • BLIP-3o - for the open-sourced data.

📝 Citation

If you find our work useful, please cite our paper:

@article{du2025vqrae,
  title={VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction},
  author={Du, Sinan and Guo, Jiahao and Li, Bo and Cui, Shuhao and Xu, Zhengzhuo and Luo, Yifu and Wei, Yongxian and Gai, Kun and Wang, Xinggang and Wu, Kai and others},
  journal={arXiv preprint arXiv:2511.23386},
  year={2025}
}

📄 License

This project is released under the MIT License.

About

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors