Xianjin Wu1, Dingkang Liang1†, Tianrui Feng1, Kui Xia2, Yumeng Zhang2, Xiaofan Li2, Xiao Tan2, Xiang Bai1
This repository contains the official implementation of VEGA-3D for the paper Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding.
While Multimodal Large Language Models demonstrate strong semantic capabilities, they often suffer from spatial blindness and struggle with fine-grained geometric reasoning and physical dynamics. Existing solutions usually depend on explicit 3D modalities or heavy geometric scaffolding, which are costly to scale and often limited by data availability and generalization.
This work explores a different direction: instead of adding explicit 3D supervision, we leverage the implicit spatial prior learned inside large-scale video generation models. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a latent world simulator. By extracting spatiotemporal features from intermediate noise levels and fusing them with semantic representations through token-level adaptive gated fusion, VEGA-3D enriches MLLMs with dense geometric cues for 3D scene understanding, spatial reasoning, and embodied decision making.
2026.03.20: Released the paper, training and evaluation code.
- Clone this repository and navigate to the VEGA-3D:
git clone https://github.com/H-EmbodVis/VEGA-3D.git
cd VEGA-3D- Create the conda environment:
conda create -n vega3d python=3.10 -y
conda activate vega3d
pip install --upgrade pip
pip install -e ".[train]"
pip install flash-attn --no-build-isolation # install flash attentionPlease download the required datasets from the sources below and organize them following the same directory structure as Video-3D-LLM. The repository expects a data/ root following structure:
data/
├── benchmark/
├── embodiedscan/
├── mask/
├── metadata/
├── models/
├── processed/
└── scannet/
All checkpoints should be placed under data/models/ and named exactly as expected by the training scripts.
Required checkpoints for all released settings:
| Model | Hugging Face | Expected local path | Used by |
|---|---|---|---|
| LLaVA-Video-7B-Qwen2 | lmms-lab/LLaVA-Video-7B-Qwen2 | data/models/LLaVA-Video-7B-Qwen2 |
All training scripts |
| SigLIP | google/siglip-so400m-patch14-384 | data/models/siglip-so400m-patch14-384 |
All training scripts |
Download them first:
mkdir -p data/models
huggingface-cli download lmms-lab/LLaVA-Video-7B-Qwen2 \
--local-dir data/models/LLaVA-Video-7B-Qwen2
huggingface-cli download google/siglip-so400m-patch14-384 \
--local-dir data/models/siglip-so400m-patch14-384Optional generative backbones and auxiliary checkpoints. Download only the ones required by the script you plan to run:
| Model | Hugging Face | Expected local path | Used by |
|---|---|---|---|
| Wan2.1-T2V-1.3B | Wan-AI/Wan2.1-T2V-1.3B | data/models/Wan2.1-T2V-1.3B |
scripts/3d/train/train_wan_t2v_online.sh |
| Wan2.1-VACE-1.3B | Wan-AI/Wan2.1-VACE-1.3B | data/models/Wan2.1-VACE-1.3B |
scripts/3d/train/train_wan_vace_online.sh |
| Stable Diffusion 2.1 | Manojb/stable-diffusion-2-1-base | data/models/stable-diffusion-2-1-base |
scripts/3d/train/train_sd21_online.sh, scripts/3d/train/train_vae_online.sh, and SEVA/Vmem preprocessing |
| Stable Video Diffusion | stabilityai/stable-video-diffusion-img2vid | data/models/stable-video-diffusion-img2vid |
scripts/3d/train/train_svd_online.sh |
| DINOv3 Large | timm/vit_large_patch16_dinov3.lvd1689m | data/models/vit_large_patch16_dinov3.sat493m |
scripts/3d/train/train_dinov3_online.sh |
| V-JEPA V2 | facebook/vjepa2-vitg-fpc64-384-ssv2 | data/models/vjepa2-vitg-fpc64-384-ssv2 |
scripts/3d/train/train_vjepa_online.sh |
| VGGT | facebook/VGGT-1B | data/models/VGGT-1B |
scripts/3d/train/train_vggt_online.sh |
| SEVA | stabilityai/stable-virtual-camera | data/models/stable-virtual-camera |
scripts/3d/train/train_seva_offline.sh |
| Vmem | liguang0115/vmem | data/models/vmem |
scripts/3d/train/train_vmem_offline.sh |
Additional CLIP checkpoint for models that require a separate text/conditioning encoder:
| Model | Hugging Face | Expected local path | Required by |
|---|---|---|---|
| CLIP-ViT-H-14-laion2B-s32B-b79K | laion/CLIP-ViT-H-14-laion2B-s32B-b79K | data/models/CLIP-ViT-H-14-laion2B-s32B-b79K |
scripts/3d/train/train_svd_online.sh and SEVA/Vmem preprocessing |
Notes:
- See detailed checkpoint setup for auxiliary files such as
empty_prompt_embeds.pt, WAN prompt embeddings, VGGT checkpoint placement, and SEVA/Vmem preprocessing dependencies.
This README keeps only the three main script examples used to present the released VEGA-3D settings.
bash scripts/3d/train/train_wan_t2v_online.shbash scripts/3d/train/train_wan_vace_online.shbash scripts/3d/train/train_sd21_online.shAll three scripts train first and then run the five downstream evaluations by default. Set RUN_EVAL=0 inside the script if you want training only.
Evaluation wrappers remain in scripts/3d/eval/ and follow the standard pattern:
bash scripts/3d/eval/eval_<task>.sh <run_name> uniform 32 <generative_model_id>
For online settings, use None as the last argument. For offline settings, pass the offline feature id.
We build upon the following great works and open source repositories
- Video-3D-LLM: the codebase our repository is built upon.
- VG-LLM, Wan2.1, 3DRS
If you find this repository useful in your research, please consider giving a star ⭐ and a citation.
@article{wu2026vega,
title={Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding},
author={Xianjin Wu and Dingkang Liang and Tianrui Feng and Kui Xia and Yumeng Zhang and Xiaofan Li and Xiao Tan and Xiang Bai},
journal={arXiv preprint arXiv:2603.19235},
year={2026}
}