🚀 VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Sinan Du¹ ³*, Jiahao Guo² ³*, Bo Li³✉, Shuhao Cui³, Zhengzhuo Xu¹, Yifu Luo¹, Yongxian Wei¹,
Kun Gai³, Xinggang Wang², Kai Wu³†, Chun Yuan¹✉,
¹Tsinghua University
²Huazhong University of Science and Technology (HUST)
³Kolors Team, Kuaishou Technology
*Equal Contribution | †Project Lead | ✉Corresponding Author
This repository contains official PyTorch/GPU implementations of our paper:
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction.
We propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration to produce Continuous semantic features for image understanding and Discrete tokens for visual generation simultaneously within a unified tokenizer.
conda create -n vqrae python=3.11 -y
conda activate vqrae
pip install -r requirements.txtcd VQRAE
pip install huggingface_hub
1. InternViT
hf download OpenGVLab/InternVL3-8B
--local-dir huggingface_hub/InternVL3-8B
2. BLIP3o (training data)
hf download BLIP3o/BLIP3o-Pretrain-Long-Caption
--local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-Long-Caption
hf download BLIP3o/BLIP3o-Pretrain-JourneyDB
--local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-JourneyDB
hf download BLIP3o/BLIP3o-Pretrain-Short-Caption
--local-dir huggingface_hub/BLIP3o/BLIP3o-Pretrain-Short-Caption
3. Imagenet (eval data)
hf download ILSVRC/imagenet-1k
--local-dir huggingface_hub/ILSVRC/imagenet-1k
4. VQRAE-InternViT (optional, for reproduce purpose)
hf download Kwai-Kolors/VQRAE
--local-dir huggingface_hub/VQRAE
5. Ta-Tok (optional, for ablation)
hf download ByteDance-Seed/Tar-TA-Tok
--local-dir huggingface_hub/Tar/TA-Tokbash cmds/vqrae/vqrae_internvit_stage1.sh trains the ViT decoder and the VQ codebook while keeping the encoder frozen.
bash cmds/vqrae/vqrae_internvit_stage1_wigan.sh enables GAN for lower rFID.
bash cmds/vqrae/vqrae_internvit_stage1_randinit_1e-4.sh dismisses the text-aligned codebook initialization. This might impact the result slightly (using 36M samples in stage 1).
| Metrics | rFID | PSNR | SSIM |
|---|---|---|---|
| randinit | 2.94 | 18.49 | 0.63 |
| text-aligned | 2.46 | 18.91 | 0.65 |
bash cmds/vqrae/vqrae_internvit_stage1_wo_semantic_1e-4.sh uses no pretrained ViT encoder weights.
bash cmds/vqrae/vqrae_internvit_stage2.sh trains the ViT decoder, the VQ codebook and the ViT encoder with self-distillation constraints.
bash cmds/vqrae/vqrae_internvit_stage2_wigan.sh continues training with GAN enabled for lower rFID.
bash cmds/vqrae/vqrae_internvit_stage2_wo_semantic_1e-4.sh uses no pretrained ViT encoder weights.
bash scripts/run_eval.sh evaluates the official release VQRAE-InternViT version, you can adapt it for other trained checkpoints.
After releasing our paper, we reproduce the result with the new codebase, so there might be some diffs with previous metrics reported in our paper. You can obtain the result below with VQRAE-InternViT.
| Metrics | rFID | PSNR | SSIM |
|---|---|---|---|
| reported | 1.31 | 22.23 | 0.762 |
| reproduced | 0.61 | 23.41 | 0.795 |
We also find that VQRAE can be trained successfully without semantic prior (pretrained ViT encoder weights) in our CVPR rebuttal period (thank you for the reviewer). We did not enable GAN training (rFID can be further improved).
You can acquire this with bash cmds/vqrae/vqrae_internvit_stage1_wo_semantic_1e-4.sh and bash cmds/vqrae/vqrae_internvit_stage2_wo_semantic_1e-4.sh.
| Metrics | rFID | PSNR | SSIM |
|---|---|---|---|
| wo semantic prior | 1.51 | 22.77 | 0.778 |
We gratefully acknowledge the following open-source projects:
- Ta-Tok - for the insightful semantic aligned codebook.
- TiTok - for the skillful codebase.
- BLIP-3o - for the open-sourced data.
If you find our work useful, please cite our paper:
@article{du2025vqrae,
title={VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction},
author={Du, Sinan and Guo, Jiahao and Li, Bo and Cui, Shuhao and Xu, Zhengzhuo and Luo, Yifu and Wei, Yongxian and Gai, Kun and Wang, Xinggang and Wu, Kai and others},
journal={arXiv preprint arXiv:2511.23386},
year={2025}
}This project is released under the MIT License.