GitHub - KlingAIResearch/VQRAE: VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

🚀 VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

🎉 CVPR2026 Paper

Sinan Du¹ ³*, Jiahao Guo² ³*, Bo Li³✉, Shuhao Cui³, Zhengzhuo Xu¹, Yifu Luo¹, Yongxian Wei¹,
Kun Gai³, Xinggang Wang², Kai Wu³†, Chun Yuan¹✉,

¹Tsinghua University
²Huazhong University of Science and Technology (HUST)
³Kolors Team, Kuaishou Technology

*Equal Contribution | †Project Lead | ✉Corresponding Author

📰 Overview

This repository contains official PyTorch/GPU implementations of our paper:

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction.

We propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration to produce Continuous semantic features for image understanding and Discrete tokens for visual generation simultaneously within a unified tokenizer.

🚀 Environment

conda create -n vqrae python=3.11 -y
conda activate vqrae

pip install -r requirements.txt

📦 Data & Model Preparation

cd VQRAE
pip install huggingface_hub

1. InternViT
hf download OpenGVLab/InternVL3-8B
    --local-dir huggingface_hub/InternVL3-8B

2. BLIP3o (training data)
hf download BLIP3o/BLIP3o-Pretrain-Long-Caption
    --local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-Long-Caption

hf download BLIP3o/BLIP3o-Pretrain-JourneyDB
    --local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-JourneyDB

hf download BLIP3o/BLIP3o-Pretrain-Short-Caption
    --local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-Short-Caption

3. Imagenet (eval data)
hf download ILSVRC/imagenet-1k
    --local-dir huggingface_hub/ILSVRC/imagenet-1k

4. VQRAE-InternViT (optional, for reproduce purpose)
hf download Kwai-Kolors/VQRAE
    --local-dir huggingface_hub/VQRAE

5. Ta-Tok (optional, for ablation)
hf download ByteDance-Seed/Tar-TA-Tok
    --local-dir huggingface_hub/Tar/TA-Tok

🎓 Training Stage 1

bash cmds/vqrae/vqrae_internvit_stage1.sh trains the ViT decoder and the VQ codebook while keeping the encoder frozen.

bash cmds/vqrae/vqrae_internvit_stage1_wigan.sh enables GAN for lower rFID.

bash cmds/vqrae/vqrae_internvit_stage1_randinit_1e-4.sh dismisses the text-aligned codebook initialization. This might impact the result slightly (using 36M samples in stage 1).

Metrics	rFID	PSNR	SSIM
randinit	2.94	18.49	0.63
text-aligned	2.46	18.91	0.65

bash cmds/vqrae/vqrae_internvit_stage1_wo_semantic_1e-4.sh uses no pretrained ViT encoder weights.

🔬 Training Stage 2

bash cmds/vqrae/vqrae_internvit_stage2.sh trains the ViT decoder, the VQ codebook and the ViT encoder with self-distillation constraints.

bash cmds/vqrae/vqrae_internvit_stage2_wigan.sh continues training with GAN enabled for lower rFID.

bash cmds/vqrae/vqrae_internvit_stage2_wo_semantic_1e-4.sh uses no pretrained ViT encoder weights.

🔎 Evaluation

bash scripts/run_eval.sh evaluates the official release VQRAE-InternViT version, you can adapt it for other trained checkpoints.

Update

After releasing our paper, we reproduce the result with the new codebase, so there might be some diffs with previous metrics reported in our paper. You can obtain the result below with VQRAE-InternViT.

Metrics	rFID	PSNR	SSIM
reported	1.31	22.23	0.762
reproduced	0.61	23.41	0.795

Others

We also find that VQRAE can be trained successfully without semantic prior (pretrained ViT encoder weights) in our CVPR rebuttal period (thank you for the reviewer). We did not enable GAN training (rFID can be further improved).

You can acquire this with bash cmds/vqrae/vqrae_internvit_stage1_wo_semantic_1e-4.sh and bash cmds/vqrae/vqrae_internvit_stage2_wo_semantic_1e-4.sh.

Metrics	rFID	PSNR	SSIM
wo semantic prior	1.51	22.77	0.778

🙏 Acknowledgement

We gratefully acknowledge the following open-source projects:

Ta-Tok - for the insightful semantic aligned codebook.
TiTok - for the skillful codebase.
BLIP-3o - for the open-sourced data.

📝 Citation

If you find our work useful, please cite our paper:

@article{du2025vqrae,
  title={VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction},
  author={Du, Sinan and Guo, Jiahao and Li, Bo and Cui, Shuhao and Xu, Zhengzhuo and Luo, Yifu and Wei, Yongxian and Gai, Kun and Wang, Xinggang and Wu, Kai and others},
  journal={arXiv preprint arXiv:2511.23386},
  year={2025}
}

📄 License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
cmds/vqrae		cmds/vqrae
configs/vqrae		configs/vqrae
data		data
evaluator		evaluator
modeling		modeling
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

📰 Overview

🚀 Environment

📦 Data & Model Preparation

🎓 Training Stage 1

🔬 Training Stage 2

🔎 Evaluation

Update

Others

🙏 Acknowledgement

📝 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

📰 Overview

🚀 Environment

📦 Data & Model Preparation

🎓 Training Stage 1

🔬 Training Stage 2

🔎 Evaluation

Update

Others

🙏 Acknowledgement

📝 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages