Zo3T: Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

Ruicheng Zhang^1,2*, Jun Zhou^1*, Zunnan Xu^1*, Zihao Liu¹, Jiehui Huang³, Mingyang Zhang⁴, Yu Sun², Xiu Li^1†

_{¹ Tsinghua University, ² Sun Yat-sen University

³ The Hong Kong University of Science and Technology, ⁴ China University of Geosciences}

_{* Equal contribution. † Corresponding author.}

ArXiv ArXiv | Demo Page

🎉 Our paper has been accepted to AAAI 2026! 🤖

Framework Overview

An overview of our zero-shot trajectory-guided video generation framework.
Our method optimizes a pre-trained video diffusion model at specific denoising timesteps via two key stages.
Test-Time Training (TTT) adapts the latent state and an ephemeral adapter to maintain semantic consistency along the trajectory.
Guidance Field Rectification refines the denoising direction using a one-step lookahead optimization to ensure precise path execution.

Getting Started

Prerequisites

To run Zo3T, ensure you have the following dependencies installed:

Requirement	Description
Python	Version 3.12
Pre-trained Model	Stable Video Diffusion model

Installation

Follow these steps to set up the environment and install dependencies.

1. Clone the Repository

Clone the Zo3T repository to your local machine:

git clone https://github.com/your-username/Zo3T-main.git
cd Zo3T-main

2. Create and Activate a Conda Environment

We recommend using conda to manage dependencies. Create and activate a new environment:

conda create -n zo3t python=3.12 -y
conda activate zo3t

3. Install Dependencies

Install all required packages listed in requirements.txt:

pip install -r requirements.txt

4. Download the Stable Video Diffusion Model

The pipeline requires the stable-video-diffusion-img2vid model checkpoint.

Download it from the official Hugging Face repository.
Place the model folder in a convenient location.

Update the svd_dir variable in inference.py to point to the model directory:

# in inference.py
# Load pre-trained image-to-video diffusion models
print("Loading Stable Video Diffusion from local path...")
svd_dir = "/path/to/your/stable-video-diffusion-img2vid"  # ⬅️ Update this path

You are now ready to run the inference script.

Usage

Prepare your input directory with the following structure:

/path/to/your/input_dir/
├── img.png
└── traj.npy

img.png: The first frame of the video.
traj.npy: A NumPy array of shape [N, (2+F), 2], where:
- N: Number of objects to track.
- First slice [:, :2, :]: Top-left and bottom-right coordinates [[w1, h1], [w2, h2]] of the initial bounding boxes.
- Second slice [:, 2:, :]: Trajectory of the center point for each bounding box over F frames.

Run the inference script:

python inference.py --input_dir /path/to/your/input_dir/ --output_dir /path/to/your/output_dir/

Configuration

Adjust hyperparameters in the Config class in inference.py:

seed: Random seed for reproducibility.
height, width: Resolution of the generated video.
num_frames: Number of frames to generate.
num_inference_steps: Total number of denoising steps.
optimize_latent_time: List of timesteps for optimization.
optimize_latent_iter: Number of optimization iterations per timestep.
optimize_latent_lr: Learning rate for latent optimization.
enable_lora: Set to True to use LoRA during optimization.
enable_depth_scaling: Set to True to enable depth-aware trajectory scaling.
enable_control_force_optimization: Set to True to enable control force optimization.

Citation

If you find our work useful for your research, please consider citing our paper:

@misc{zhang2025zeroshot3dawaretrajectoryguidedimagetovideo,
      title={Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training}, 
      author={Ruicheng Zhang and Jun Zhou and Zunnan Xu and Zihao Liu and Jiehui Huang and Mingyang Zhang and Yu Sun and Xiu Li},
      year={2025},
      eprint={2509.06723},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.06723}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
asserts		asserts
examples		examples
ml-depth-pro-main		ml-depth-pro-main
src		src
README.md		README.md
depthestimator.py		depthestimator.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zo3T: Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

🎉 Our paper has been accepted to AAAI 2026! 🤖

Framework Overview

Getting Started

Prerequisites

Installation

1. Clone the Repository

2. Create and Activate a Conda Environment

3. Install Dependencies

4. Download the Stable Video Diffusion Model

Usage

Configuration

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Zo3T: Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

🎉 Our paper has been accepted to AAAI 2026! 🤖

Framework Overview

Getting Started

Prerequisites

Installation

1. Clone the Repository

2. Create and Activate a Conda Environment

3. Install Dependencies

4. Download the Stable Video Diffusion Model

Usage

Configuration

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages