Ruicheng Zhang1,2*, Jun Zhou1*, Zunnan Xu1*, Zihao Liu1, Jiehui Huang3, Mingyang Zhang4, Yu Sun2, Xiu Li1†
1 Tsinghua University, 2 Sun Yat-sen University
3 The Hong Kong University of Science and Technology, 4 China University of Geosciences
* Equal contribution. † Corresponding author.
An overview of our zero-shot trajectory-guided video generation framework.
Our method optimizes a pre-trained video diffusion model at specific denoising timesteps via two key stages.
Test-Time Training (TTT) adapts the latent state and an ephemeral adapter to maintain semantic consistency along the trajectory.
Guidance Field Rectification refines the denoising direction using a one-step lookahead optimization to ensure precise path execution.
To run Zo3T, ensure you have the following dependencies installed:
| Requirement | Description |
|---|---|
| Python | Version 3.12 |
| Pre-trained Model | Stable Video Diffusion model |
Follow these steps to set up the environment and install dependencies.
Clone the Zo3T repository to your local machine:
git clone https://github.com/your-username/Zo3T-main.git
cd Zo3T-mainWe recommend using conda to manage dependencies. Create and activate a new environment:
conda create -n zo3t python=3.12 -y
conda activate zo3tInstall all required packages listed in requirements.txt:
pip install -r requirements.txtThe pipeline requires the stable-video-diffusion-img2vid model checkpoint.
- Download it from the official Hugging Face repository.
- Place the model folder in a convenient location.
Update the svd_dir variable in inference.py to point to the model directory:
# in inference.py
# Load pre-trained image-to-video diffusion models
print("Loading Stable Video Diffusion from local path...")
svd_dir = "/path/to/your/stable-video-diffusion-img2vid" # ⬅️ Update this pathYou are now ready to run the inference script.
Prepare your input directory with the following structure:
/path/to/your/input_dir/
├── img.png
└── traj.npy
img.png: The first frame of the video.traj.npy: A NumPy array of shape[N, (2+F), 2], where:N: Number of objects to track.- First slice
[:, :2, :]: Top-left and bottom-right coordinates[[w1, h1], [w2, h2]]of the initial bounding boxes. - Second slice
[:, 2:, :]: Trajectory of the center point for each bounding box overFframes.
Run the inference script:
python inference.py --input_dir /path/to/your/input_dir/ --output_dir /path/to/your/output_dir/Adjust hyperparameters in the Config class in inference.py:
seed: Random seed for reproducibility.height,width: Resolution of the generated video.num_frames: Number of frames to generate.num_inference_steps: Total number of denoising steps.optimize_latent_time: List of timesteps for optimization.optimize_latent_iter: Number of optimization iterations per timestep.optimize_latent_lr: Learning rate for latent optimization.enable_lora: Set toTrueto use LoRA during optimization.enable_depth_scaling: Set toTrueto enable depth-aware trajectory scaling.enable_control_force_optimization: Set toTrueto enable control force optimization.
If you find our work useful for your research, please consider citing our paper:
@misc{zhang2025zeroshot3dawaretrajectoryguidedimagetovideo,
title={Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training},
author={Ruicheng Zhang and Jun Zhou and Zunnan Xu and Zihao Liu and Jiehui Huang and Mingyang Zhang and Yu Sun and Xiu Li},
year={2025},
eprint={2509.06723},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.06723},
}