Skip to content

LgQu/TTOM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

TTOM Framework Overview

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu1*, Ziyang Wang1*, Na Zheng1, Wenjie Wang2, Liqiang Nie3, Tat-Seng Chua1
1NExT++ Lab, National University of Singapore Β Β  2University of Science and Technology of China Β Β  3Harbin Institute of Technology (Shenzhen)
*Equal Contribution

πŸ“‘ arXiv Β Β |Β Β  🌐 Project Page Β Β |Β Β  πŸ–₯️ GitHub

[ICLR 2026] Official repository


πŸ”₯ News

  • 2026-01: πŸŽ‰ TTOM has been accepted to ICLR 2026!
  • 2025-12: πŸ”₯ Released inference code.

πŸ“– Overview

TTOM is a training-free, test-time optimization and memorization framework for compositional video generation. It addresses the challenge of generating videos with multiple objects, attributes, and motions that faithfully follow complex text prompts β€” without any additional training or fine-tuning.

The framework operates in two phases:

  1. Meta Extraction & Layout Generation – Uses GPT-4o to extract object metadata and generate spatial-temporal layouts from text prompts.
  2. Video Generation with TTOM – Generates videos conditioned on extracted metadata and layouts, with iterative test-time optimization of cross-attention via LoRA.

Built on top of DiffSynth-Studio, an efficient diffusion inference engine.

πŸŽ₯ Qualitative Results

Qualitative Results on T2VCompBench

Memorization Qualitative Results

πŸ† T2V-CompBench Results

Evaluation on T2V-CompBench across 7 compositional categories. Bold = best, underline = second best.

Model Avg. Motion Num Spatial Con-attr Dyn-attr Action Interact
Kling-1.0 0.46300.25620.44130.56900.69310.00980.57870.7128
Dreamina 1.2 0.46890.23610.43800.57730.69130.00510.59240.6824
CogVideoX-5B 0.41890.26580.37060.51720.61640.02190.53330.6069
Β Β + DyST-XL 0.50810.27120.39690.61100.86960.02210.73210.6536
Β Β + LVD 0.47390.32910.38250.52740.75340.02190.68260.6204
Β Β + Ours 0.56320.43510.50810.61730.87820.03410.71910.7502
  %Improve. 🟒+34.4🟒+63.7🟒+37.1🟒+19.4🟒+42.5🟒+55.7🟒+34.8🟒+23.6
Wan2.1-14B 0.53140.26960.51130.57090.83690.05700.75040.7239
Β Β + LVD 0.54390.28640.47070.57530.86100.08290.81070.7201
Β Β + Ours 0.61550.49220.58810.62750.89820.11820.81520.7691
  %Improve. 🟒+15.8🟒+82.6🟒+15.0🟒+9.9🟒+7.3🟒+107.4🟒+8.6🟒+6.2

πŸ’‘ TTOM is training-free β€” it optimizes at inference time on top of frozen pre-trained models without any additional training or fine-tuning.

πŸ“‚ Project Structure

TTOM/
β”œβ”€β”€ generation/
β”‚   β”œβ”€β”€ gen_cache.py              # Phase 1: Meta extraction and layout generation
β”‚   β”œβ”€β”€ gen_benchmarks.py         # Phase 2: Video generation with TTOM
β”‚   └── get_attnmap.py            # Attention map generation
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ gdino_detection_video.py  # GroundingDINO object detection + SAM2 segmentation
β”‚   β”œβ”€β”€ evaluate_miou.py          # mIoU evaluation between attention and masks
β”‚   β”œβ”€β”€ visualize_attn_maps.py    # Attention map visualization
β”‚   └── ...                       # Other utility functions
β”œβ”€β”€ ttom/                         # Core TTOM implementation
└── scripts/                      # Batch processing scripts

πŸ› οΈ Installation

Prerequisites

  • Python 3.10+
  • CUDA-capable GPU (recommended β‰₯ 24 GB VRAM for Wan2.1-T2V-14B)
  • pip and a virtualenv/conda environment

πŸ’‘ Note: GroundingDINO and SAM2 are only required for attention-layout overlap evaluation. For basic video generation, skip them.

1. Install TTOM

git clone https://github.com/LgQu/TTOM.git
cd TTOM
pip install -r requirements.txt
pip install -e .

2. Download Wan2.1-T2V-14B

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./models/Wan2.1-T2V-14B

πŸ’‘ The download can be large. The model will be saved to ./models/Wan2.1-T2V-14B/.

3. Optional: GroundingDINO & SAM2

Click to expand (only needed for attention-layout evaluation)

GroundingDINO

git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO && pip install -e . && cd ..

SAM2

git clone https://github.com/facebookresearch/segment-anything-2.git sam2
cd sam2 && pip install -e . && cd ..

4. Configuration

Set up your OpenAI API key for GPT-4o prompt processing:

export OPENAI_API_KEY="your-api-key-here"

On Windows PowerShell:

$env:OPENAI_API_KEY = "your-api-key-here"

πŸš€ Quickstart

Step 1: Build Layout Cache

python generation/gen_cache.py \
    --benchmark_source t2vcompbench \
    --benchmark_type 1_consistent_attr \
    --start_idx 0 \
    --end_idx 199 \
    --skip_if_exists

Outputs:

  • cache/{benchmark_source}_{benchmark_type}-gpt_4o.json – enriched prompts, object metadata, and layouts.
  • data/layout/boxes_{cache_name}/layout_{pid}.gif – layout visualization GIFs.

Step 2: Generate Videos with TTOM

python generation/gen_benchmarks.py \
    --pid 0 \
    --cache_type t2vcompbench_motion_binding_gpt-4o \
    --guidance_type lora \
    --target_layers [3] \
    --max_iter 8 \
    --max_guidance_step 5 \
    --max_lora_step 5 \
    --target_modules "cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o" \
    --jsd_loss_weight 1.0 \
    --min_loss_value 0.06 \
    --save_lora_weight False \
    --skip_existed_prompt False \
    --prefix "test" \
    --strat_id 0

Outputs:

  • data/benchmarks/{cache_type}/.../{pid}_{tag}.mp4 – generated videos.

πŸ”§ Advanced Usage

Meta Extraction & Layout Generation

Uses GPT-4o to enrich prompts, extract object instances/attributes, and generate spatial-temporal layouts:

python generation/gen_cache.py \
    --benchmark_source t2vcompbench \
    --benchmark_type 1_consistent_attr \
    --start_idx 0 \
    --end_idx 199 \
    --skip_if_exists

Video Generation with TTOM Strategies

gen_benchmarks.py generates videos using Wan2.1 with TTOM-style test-time optimization and memorization:

python generation/gen_benchmarks.py \
    --pid 0 \
    --cache_type t2vcompbench_motion_binding_gpt-4o \
    --guidance_type lora \
    --target_layers [3] \
    --max_iter 8 \
    --max_guidance_step 5 \
    --max_lora_step 5 \
    --target_modules "cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o" \
    --jsd_loss_weight 1.0 \
    --com_loss_weight 0.0 \
    --min_loss_value 0.03 \
    --save_lora_weight False \
    --save_mask False \
    --skip_existed_prompt False \
    --prefix "" \
    --strat_id 0
πŸ“‹ Full argument reference
Argument Description Default
--pid Sample ID to generate required
--cache_type Cache file name t2vcompbench_motion_binding_gpt-4o
--guidance_type Guidance type: lora, lvd, none lora
--target_layers Transformer layers for guidance [3]
--max_iter Number of TTOM iterations 8
--max_guidance_step Max guidance steps per iteration 5
--max_lora_step Max LoRA update steps per iteration 5
--target_modules Modules to apply LoRA cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o
--jsd_loss_weight JSD loss weight 1.0
--com_loss_weight Composition loss weight 0.0
--min_loss_value Min loss value threshold 0.03
--save_lora_weight Save LoRA weights False
--save_mask Save attention masks False
--skip_existed_prompt Skip existing outputs False
--prefix Output directory prefix ""
--strat_id TTOM strategy (0=update, 1=load, 2=load+update) 0

Attention Map Analysis

Click to expand full evaluation pipeline

1. Generate attention maps:

bash scripts/run_attnmap_batch.sh

πŸ’‘ Before running, configure CONDA_PYTHON and BASE_DIR in the script.

2. Run GroundingDINO detection + SAM2 segmentation:

python utils/gdino_detection_video.py

3. Compute mIoU:

python utils/evaluate_miou.py \
    --attn_dir data/attn_maps/wan21_lora/ \
    --dino_dir data/dino_results_batch/ \
    --output_dir data/miou_summary/

4. Visualize attention maps:

python utils/visualize_attn_maps.py \
    --pid 0 \
    --step_id 40 \
    --layer_id 3 \
    --inst_id 0 \
    --save

πŸ“ Output Structure

data/
β”œβ”€β”€ attn_maps/            # Attention map files (.pt)
β”œβ”€β”€ dino_results_batch/   # GroundingDINO + SAM2 detection/segmentation results
β”œβ”€β”€ miou_summary/         # mIoU evaluation results and summary stats
β”œβ”€β”€ benchmarks/           # Generated videos
└── layout/               # Layout visualizations (GIFs)

πŸ™ Acknowledgements

We thank the authors and maintainers of the following projects:

  • Wan2.1 – Open and advanced large-scale video generative models.
  • DiffSynth-Studio – Efficient diffusion model inference engine.
  • GroundingDINO – Open-set object detection with language grounding.
  • SAM2 – Segment Anything 2 for high-quality segmentation.

⭐ Citation

If you find TTOM useful, please consider giving this repository a star ⭐ and citing our paper:

@article{qu2025ttom,
  title   = {TTOM: Test-Time Optimization and Memorization for Compositional Video Generation},
  author  = {Leigang Qu and Ziyang Wang and Na Zheng and Wenjie Wang and Liqiang Nie and Tat-Seng Chua},
  journal = {arXiv preprint arXiv:2510.07940},
  year    = {2025},
  url     = {https://arxiv.org/abs/2510.07940}
}

About

ICLR'26 submission 3050: TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors