TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu^1*, Ziyang Wang^1*, Na Zheng¹, Wenjie Wang², Liqiang Nie³, Tat-Seng Chua¹
¹NExT++ Lab, National University of Singapore ²University of Science and Technology of China ³Harbin Institute of Technology (Shenzhen)
^*Equal Contribution

📑 arXiv | 🌐 Project Page | 🖥️ GitHub

[ICLR 2026] Official repository

🔥 News

2026-01: 🎉 TTOM has been accepted to ICLR 2026!
2025-12: 🔥 Released inference code.

📖 Overview

TTOM is a training-free, test-time optimization and memorization framework for compositional video generation. It addresses the challenge of generating videos with multiple objects, attributes, and motions that faithfully follow complex text prompts — without any additional training or fine-tuning.

The framework operates in two phases:

Meta Extraction & Layout Generation – Uses GPT-4o to extract object metadata and generate spatial-temporal layouts from text prompts.
Video Generation with TTOM – Generates videos conditioned on extracted metadata and layouts, with iterative test-time optimization of cross-attention via LoRA.

Built on top of DiffSynth-Studio, an efficient diffusion inference engine.

🎥 Qualitative Results

🏆 T2V-CompBench Results

Evaluation on T2V-CompBench across 7 compositional categories. Bold = best, underline = second best.

Model	Avg.	Motion	Num	Spatial	Con-attr	Dyn-attr	Action	Interact
Kling-1.0	0.4630	0.2562	0.4413	0.5690	0.6931	0.0098	0.5787	0.7128
Dreamina 1.2	0.4689	0.2361	0.4380	0.5773	0.6913	0.0051	0.5924	0.6824

CogVideoX-5B	0.4189	0.2658	0.3706	0.5172	0.6164	0.0219	0.5333	0.6069
+ DyST-XL	0.5081	0.2712	0.3969	0.6110	0.8696	0.0221	0.7321	0.6536
+ LVD	0.4739	0.3291	0.3825	0.5274	0.7534	0.0219	0.6826	0.6204
+ Ours	0.5632	0.4351	0.5081	0.6173	0.8782	0.0341	0.7191	0.7502
%Improve.	🟢+34.4	🟢+63.7	🟢+37.1	🟢+19.4	🟢+42.5	🟢+55.7	🟢+34.8	🟢+23.6

Wan2.1-14B	0.5314	0.2696	0.5113	0.5709	0.8369	0.0570	0.7504	0.7239
+ LVD	0.5439	0.2864	0.4707	0.5753	0.8610	0.0829	0.8107	0.7201
+ Ours	0.6155	0.4922	0.5881	0.6275	0.8982	0.1182	0.8152	0.7691
%Improve.	🟢+15.8	🟢+82.6	🟢+15.0	🟢+9.9	🟢+7.3	🟢+107.4	🟢+8.6	🟢+6.2

💡 TTOM is training-free — it optimizes at inference time on top of frozen pre-trained models without any additional training or fine-tuning.

📂 Project Structure

TTOM/
├── generation/
│   ├── gen_cache.py              # Phase 1: Meta extraction and layout generation
│   ├── gen_benchmarks.py         # Phase 2: Video generation with TTOM
│   └── get_attnmap.py            # Attention map generation
├── utils/
│   ├── gdino_detection_video.py  # GroundingDINO object detection + SAM2 segmentation
│   ├── evaluate_miou.py          # mIoU evaluation between attention and masks
│   ├── visualize_attn_maps.py    # Attention map visualization
│   └── ...                       # Other utility functions
├── ttom/                         # Core TTOM implementation
└── scripts/                      # Batch processing scripts

🛠️ Installation

Prerequisites

Python 3.10+
CUDA-capable GPU (recommended ≥ 24 GB VRAM for Wan2.1-T2V-14B)
pip and a virtualenv/conda environment

💡 Note: GroundingDINO and SAM2 are only required for attention-layout overlap evaluation. For basic video generation, skip them.

1. Install TTOM

git clone https://github.com/LgQu/TTOM.git
cd TTOM
pip install -r requirements.txt
pip install -e .

2. Download Wan2.1-T2V-14B

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./models/Wan2.1-T2V-14B

💡 The download can be large. The model will be saved to ./models/Wan2.1-T2V-14B/.

3. Optional: GroundingDINO & SAM2

Click to expand (only needed for attention-layout evaluation)

GroundingDINO

git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO && pip install -e . && cd ..

SAM2

git clone https://github.com/facebookresearch/segment-anything-2.git sam2
cd sam2 && pip install -e . && cd ..

4. Configuration

Set up your OpenAI API key for GPT-4o prompt processing:

export OPENAI_API_KEY="your-api-key-here"

On Windows PowerShell:

$env:OPENAI_API_KEY = "your-api-key-here"

🚀 Quickstart

Step 1: Build Layout Cache

python generation/gen_cache.py \
    --benchmark_source t2vcompbench \
    --benchmark_type 1_consistent_attr \
    --start_idx 0 \
    --end_idx 199 \
    --skip_if_exists

Outputs:

cache/{benchmark_source}_{benchmark_type}-gpt_4o.json – enriched prompts, object metadata, and layouts.
data/layout/boxes_{cache_name}/layout_{pid}.gif – layout visualization GIFs.

Step 2: Generate Videos with TTOM

python generation/gen_benchmarks.py \
    --pid 0 \
    --cache_type t2vcompbench_motion_binding_gpt-4o \
    --guidance_type lora \
    --target_layers [3] \
    --max_iter 8 \
    --max_guidance_step 5 \
    --max_lora_step 5 \
    --target_modules "cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o" \
    --jsd_loss_weight 1.0 \
    --min_loss_value 0.06 \
    --save_lora_weight False \
    --skip_existed_prompt False \
    --prefix "test" \
    --strat_id 0

Outputs:

data/benchmarks/{cache_type}/.../{pid}_{tag}.mp4 – generated videos.

🔧 Advanced Usage

Meta Extraction & Layout Generation

Uses GPT-4o to enrich prompts, extract object instances/attributes, and generate spatial-temporal layouts:

python generation/gen_cache.py \
    --benchmark_source t2vcompbench \
    --benchmark_type 1_consistent_attr \
    --start_idx 0 \
    --end_idx 199 \
    --skip_if_exists

Video Generation with TTOM Strategies

gen_benchmarks.py generates videos using Wan2.1 with TTOM-style test-time optimization and memorization:

python generation/gen_benchmarks.py \
    --pid 0 \
    --cache_type t2vcompbench_motion_binding_gpt-4o \
    --guidance_type lora \
    --target_layers [3] \
    --max_iter 8 \
    --max_guidance_step 5 \
    --max_lora_step 5 \
    --target_modules "cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o" \
    --jsd_loss_weight 1.0 \
    --com_loss_weight 0.0 \
    --min_loss_value 0.03 \
    --save_lora_weight False \
    --save_mask False \
    --skip_existed_prompt False \
    --prefix "" \
    --strat_id 0

📋 Full argument reference

Argument	Description	Default
`--pid`	Sample ID to generate	required
`--cache_type`	Cache file name	`t2vcompbench_motion_binding_gpt-4o`
`--guidance_type`	Guidance type: `lora`, `lvd`, `none`	`lora`
`--target_layers`	Transformer layers for guidance	`[3]`
`--max_iter`	Number of TTOM iterations	`8`
`--max_guidance_step`	Max guidance steps per iteration	`5`
`--max_lora_step`	Max LoRA update steps per iteration	`5`
`--target_modules`	Modules to apply LoRA	`cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o`
`--jsd_loss_weight`	JSD loss weight	`1.0`
`--com_loss_weight`	Composition loss weight	`0.0`
`--min_loss_value`	Min loss value threshold	`0.03`
`--save_lora_weight`	Save LoRA weights	`False`
`--save_mask`	Save attention masks	`False`
`--skip_existed_prompt`	Skip existing outputs	`False`
`--prefix`	Output directory prefix	`""`
`--strat_id`	TTOM strategy (0=update, 1=load, 2=load+update)	`0`

Attention Map Analysis

Click to expand full evaluation pipeline

1. Generate attention maps:

bash scripts/run_attnmap_batch.sh

💡 Before running, configure CONDA_PYTHON and BASE_DIR in the script.

2. Run GroundingDINO detection + SAM2 segmentation:

python utils/gdino_detection_video.py

3. Compute mIoU:

python utils/evaluate_miou.py \
    --attn_dir data/attn_maps/wan21_lora/ \
    --dino_dir data/dino_results_batch/ \
    --output_dir data/miou_summary/

4. Visualize attention maps:

python utils/visualize_attn_maps.py \
    --pid 0 \
    --step_id 40 \
    --layer_id 3 \
    --inst_id 0 \
    --save

📁 Output Structure

data/
├── attn_maps/            # Attention map files (.pt)
├── dino_results_batch/   # GroundingDINO + SAM2 detection/segmentation results
├── miou_summary/         # mIoU evaluation results and summary stats
├── benchmarks/           # Generated videos
└── layout/               # Layout visualizations (GIFs)

🙏 Acknowledgements

We thank the authors and maintainers of the following projects:

Wan2.1 – Open and advanced large-scale video generative models.
DiffSynth-Studio – Efficient diffusion model inference engine.
GroundingDINO – Open-set object detection with language grounding.
SAM2 – Segment Anything 2 for high-quality segmentation.

⭐ Citation

If you find TTOM useful, please consider giving this repository a star ⭐ and citing our paper:

@article{qu2025ttom,
  title   = {TTOM: Test-Time Optimization and Memorization for Compositional Video Generation},
  author  = {Leigang Qu and Ziyang Wang and Na Zheng and Wenjie Wang and Liqiang Nie and Tat-Seng Chua},
  journal = {arXiv preprint arXiv:2510.07940},
  year    = {2025},
  url     = {https://arxiv.org/abs/2510.07940}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

🔥 News

📖 Overview

🎥 Qualitative Results

🏆 T2V-CompBench Results

📂 Project Structure

🛠️ Installation

Prerequisites

1. Install TTOM

2. Download Wan2.1-T2V-14B

3. Optional: GroundingDINO & SAM2

4. Configuration

🚀 Quickstart

Step 1: Build Layout Cache

Step 2: Generate Videos with TTOM

🔧 Advanced Usage

Meta Extraction & Layout Generation

Video Generation with TTOM Strategies

Attention Map Analysis

📁 Output Structure

🙏 Acknowledgements

⭐ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
cache		cache
generation		generation
scripts		scripts
ttom		ttom
utils		utils
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

🔥 News

📖 Overview

🎥 Qualitative Results

🏆 T2V-CompBench Results

📂 Project Structure

🛠️ Installation

Prerequisites

1. Install TTOM

2. Download Wan2.1-T2V-14B

3. Optional: GroundingDINO & SAM2

4. Configuration

🚀 Quickstart

Step 1: Build Layout Cache

Step 2: Generate Videos with TTOM

🔧 Advanced Usage

Meta Extraction & Layout Generation

Video Generation with TTOM Strategies

Attention Map Analysis

📁 Output Structure

🙏 Acknowledgements

⭐ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages