Skip to content

Richard-Zhang-AI/Zo3T-main

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zo3T: Zero-shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

Ruicheng Zhang1,2*, Jun Zhou1*, Zunnan Xu1*, Zihao Liu1, Jiehui Huang3, Mingyang Zhang4, Yu Sun2, Xiu Li1†

1 Tsinghua University, 2 Sun Yat-sen University
3 The Hong Kong University of Science and Technology, 4 China University of Geosciences

* Equal contribution. † Corresponding author.

ArXiv ArXiv | Demo Page Demo Page

🎉 Our paper has been accepted to AAAI 2026! 🤖


Framework Overview

Framework Diagram

An overview of our zero-shot trajectory-guided video generation framework.
Our method optimizes a pre-trained video diffusion model at specific denoising timesteps via two key stages.
Test-Time Training (TTT) adapts the latent state and an ephemeral adapter to maintain semantic consistency along the trajectory.
Guidance Field Rectification refines the denoising direction using a one-step lookahead optimization to ensure precise path execution.


Getting Started

Prerequisites

To run Zo3T, ensure you have the following dependencies installed:

Requirement Description
Python Version 3.12
Pre-trained Model Stable Video Diffusion model

Installation

Follow these steps to set up the environment and install dependencies.

1. Clone the Repository

Clone the Zo3T repository to your local machine:

git clone https://github.com/your-username/Zo3T-main.git
cd Zo3T-main

2. Create and Activate a Conda Environment

We recommend using conda to manage dependencies. Create and activate a new environment:

conda create -n zo3t python=3.12 -y
conda activate zo3t

3. Install Dependencies

Install all required packages listed in requirements.txt:

pip install -r requirements.txt

4. Download the Stable Video Diffusion Model

The pipeline requires the stable-video-diffusion-img2vid model checkpoint.

Update the svd_dir variable in inference.py to point to the model directory:

# in inference.py
# Load pre-trained image-to-video diffusion models
print("Loading Stable Video Diffusion from local path...")
svd_dir = "/path/to/your/stable-video-diffusion-img2vid"  # ⬅️ Update this path

You are now ready to run the inference script.


Usage

Prepare your input directory with the following structure:

/path/to/your/input_dir/
├── img.png
└── traj.npy
  • img.png: The first frame of the video.
  • traj.npy: A NumPy array of shape [N, (2+F), 2], where:
    • N: Number of objects to track.
    • First slice [:, :2, :]: Top-left and bottom-right coordinates [[w1, h1], [w2, h2]] of the initial bounding boxes.
    • Second slice [:, 2:, :]: Trajectory of the center point for each bounding box over F frames.

Run the inference script:

python inference.py --input_dir /path/to/your/input_dir/ --output_dir /path/to/your/output_dir/

Configuration

Adjust hyperparameters in the Config class in inference.py:

  • seed: Random seed for reproducibility.
  • height, width: Resolution of the generated video.
  • num_frames: Number of frames to generate.
  • num_inference_steps: Total number of denoising steps.
  • optimize_latent_time: List of timesteps for optimization.
  • optimize_latent_iter: Number of optimization iterations per timestep.
  • optimize_latent_lr: Learning rate for latent optimization.
  • enable_lora: Set to True to use LoRA during optimization.
  • enable_depth_scaling: Set to True to enable depth-aware trajectory scaling.
  • enable_control_force_optimization: Set to True to enable control force optimization.

Citation

If you find our work useful for your research, please consider citing our paper:

@misc{zhang2025zeroshot3dawaretrajectoryguidedimagetovideo,
      title={Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training}, 
      author={Ruicheng Zhang and Jun Zhou and Zunnan Xu and Zihao Liu and Jiehui Huang and Mingyang Zhang and Yu Sun and Xiu Li},
      year={2025},
      eprint={2509.06723},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.06723}, 
}

About

Code for article "Zo3T: Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors