Project page | Paper | Data
Tong Wu*, Shuai Yang*, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein
* Equal Contribution
conda create -n spmem python=3.10 -y
conda activate spmem
pip install -r requirements.txt
- Install PyTorch3D:
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
- Depth-Anything-3 (submodule):
cd Depth-Anything-3
pip install -e .
- We processed web videos (from Miradata) into ~80K video clips, and annotated the original videos with MegaSAM (images, depth, and camera poses).
- The resulting dataset is available at
ysmikey/spmem_megadata. - To further convert our data into the TSDF (dynamic/static separation) training format similar to
datasets/train_data_example, please refer totsdf/data_process.sh.
Download required weights from [Qwen2.5-VL-7B-Instruct] and [spmem_ckpt].
- Qwen2.5-VL-7B-Instruct →
ckpt/Qwen2.5-VL-7B-Instruct/ - spmem_ckpt →
ckpt/spmem_ckpt/
Run the example:
bash run_demo.sh
Run streaming control demo:
bash run_stream.sh
We provide an example training script that uses the example training data format.
- Script (8x GPU):
bash train_example.sh - Example data:
datasets/train_data_example - Example config:
datasets/train_data_example_config
Run:
bash train_example.sh
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
@article{wu2025video,
title={Video world models with long-term spatial memory},
author={Wu, Tong and Yang, Shuai and Po, Ryan and Xu, Yinghao and Liu, Ziwei and Lin, Dahua and Wetzstein, Gordon},
journal={arXiv preprint arXiv:2506.05284},
year={2025}
}