TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu1*, Ziyang Wang1*, Na Zheng1, Wenjie Wang2, Liqiang Nie3, Tat-Seng Chua1
1NExT++ Lab, National University of Singapore Β Β 2University of Science and Technology of China Β Β 3Harbin Institute of Technology (Shenzhen)
*Equal Contribution
π arXiv Β Β |Β Β π Project Page Β Β |Β Β π₯οΈ GitHub
[ICLR 2026] Official repository
- 2026-01: π TTOM has been accepted to ICLR 2026!
- 2025-12: π₯ Released inference code.
TTOM is a training-free, test-time optimization and memorization framework for compositional video generation. It addresses the challenge of generating videos with multiple objects, attributes, and motions that faithfully follow complex text prompts β without any additional training or fine-tuning.
The framework operates in two phases:
- Meta Extraction & Layout Generation β Uses GPT-4o to extract object metadata and generate spatial-temporal layouts from text prompts.
- Video Generation with TTOM β Generates videos conditioned on extracted metadata and layouts, with iterative test-time optimization of cross-attention via LoRA.
Built on top of DiffSynth-Studio, an efficient diffusion inference engine.
Evaluation on T2V-CompBench across 7 compositional categories. Bold = best, underline = second best.
| Model | Avg. | Motion | Num | Spatial | Con-attr | Dyn-attr | Action | Interact |
|---|---|---|---|---|---|---|---|---|
| Kling-1.0 | 0.4630 | 0.2562 | 0.4413 | 0.5690 | 0.6931 | 0.0098 | 0.5787 | 0.7128 |
| Dreamina 1.2 | 0.4689 | 0.2361 | 0.4380 | 0.5773 | 0.6913 | 0.0051 | 0.5924 | 0.6824 |
| CogVideoX-5B | 0.4189 | 0.2658 | 0.3706 | 0.5172 | 0.6164 | 0.0219 | 0.5333 | 0.6069 |
| Β Β + DyST-XL | 0.5081 | 0.2712 | 0.3969 | 0.6110 | 0.8696 | 0.0221 | 0.7321 | 0.6536 |
| Β Β + LVD | 0.4739 | 0.3291 | 0.3825 | 0.5274 | 0.7534 | 0.0219 | 0.6826 | 0.6204 |
| Β Β + Ours | 0.5632 | 0.4351 | 0.5081 | 0.6173 | 0.8782 | 0.0341 | 0.7191 | 0.7502 |
| Β Β %Improve. | π’+34.4 | π’+63.7 | π’+37.1 | π’+19.4 | π’+42.5 | π’+55.7 | π’+34.8 | π’+23.6 |
| Wan2.1-14B | 0.5314 | 0.2696 | 0.5113 | 0.5709 | 0.8369 | 0.0570 | 0.7504 | 0.7239 |
| Β Β + LVD | 0.5439 | 0.2864 | 0.4707 | 0.5753 | 0.8610 | 0.0829 | 0.8107 | 0.7201 |
| Β Β + Ours | 0.6155 | 0.4922 | 0.5881 | 0.6275 | 0.8982 | 0.1182 | 0.8152 | 0.7691 |
| Β Β %Improve. | π’+15.8 | π’+82.6 | π’+15.0 | π’+9.9 | π’+7.3 | π’+107.4 | π’+8.6 | π’+6.2 |
π‘ TTOM is training-free β it optimizes at inference time on top of frozen pre-trained models without any additional training or fine-tuning.
TTOM/
βββ generation/
β βββ gen_cache.py # Phase 1: Meta extraction and layout generation
β βββ gen_benchmarks.py # Phase 2: Video generation with TTOM
β βββ get_attnmap.py # Attention map generation
βββ utils/
β βββ gdino_detection_video.py # GroundingDINO object detection + SAM2 segmentation
β βββ evaluate_miou.py # mIoU evaluation between attention and masks
β βββ visualize_attn_maps.py # Attention map visualization
β βββ ... # Other utility functions
βββ ttom/ # Core TTOM implementation
βββ scripts/ # Batch processing scripts
- Python 3.10+
- CUDA-capable GPU (recommended β₯ 24 GB VRAM for Wan2.1-T2V-14B)
pipand a virtualenv/conda environment
π‘ Note: GroundingDINO and SAM2 are only required for attention-layout overlap evaluation. For basic video generation, skip them.
git clone https://github.com/LgQu/TTOM.git
cd TTOM
pip install -r requirements.txt
pip install -e .pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./models/Wan2.1-T2V-14Bπ‘ The download can be large. The model will be saved to
./models/Wan2.1-T2V-14B/.
Click to expand (only needed for attention-layout evaluation)
GroundingDINO
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO && pip install -e . && cd ..SAM2
git clone https://github.com/facebookresearch/segment-anything-2.git sam2
cd sam2 && pip install -e . && cd ..Set up your OpenAI API key for GPT-4o prompt processing:
export OPENAI_API_KEY="your-api-key-here"On Windows PowerShell:
$env:OPENAI_API_KEY = "your-api-key-here"python generation/gen_cache.py \
--benchmark_source t2vcompbench \
--benchmark_type 1_consistent_attr \
--start_idx 0 \
--end_idx 199 \
--skip_if_existsOutputs:
cache/{benchmark_source}_{benchmark_type}-gpt_4o.jsonβ enriched prompts, object metadata, and layouts.data/layout/boxes_{cache_name}/layout_{pid}.gifβ layout visualization GIFs.
python generation/gen_benchmarks.py \
--pid 0 \
--cache_type t2vcompbench_motion_binding_gpt-4o \
--guidance_type lora \
--target_layers [3] \
--max_iter 8 \
--max_guidance_step 5 \
--max_lora_step 5 \
--target_modules "cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o" \
--jsd_loss_weight 1.0 \
--min_loss_value 0.06 \
--save_lora_weight False \
--skip_existed_prompt False \
--prefix "test" \
--strat_id 0Outputs:
data/benchmarks/{cache_type}/.../{pid}_{tag}.mp4β generated videos.
Uses GPT-4o to enrich prompts, extract object instances/attributes, and generate spatial-temporal layouts:
python generation/gen_cache.py \
--benchmark_source t2vcompbench \
--benchmark_type 1_consistent_attr \
--start_idx 0 \
--end_idx 199 \
--skip_if_existsgen_benchmarks.py generates videos using Wan2.1 with TTOM-style test-time optimization and memorization:
python generation/gen_benchmarks.py \
--pid 0 \
--cache_type t2vcompbench_motion_binding_gpt-4o \
--guidance_type lora \
--target_layers [3] \
--max_iter 8 \
--max_guidance_step 5 \
--max_lora_step 5 \
--target_modules "cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o" \
--jsd_loss_weight 1.0 \
--com_loss_weight 0.0 \
--min_loss_value 0.03 \
--save_lora_weight False \
--save_mask False \
--skip_existed_prompt False \
--prefix "" \
--strat_id 0π Full argument reference
| Argument | Description | Default |
|---|---|---|
--pid |
Sample ID to generate | required |
--cache_type |
Cache file name | t2vcompbench_motion_binding_gpt-4o |
--guidance_type |
Guidance type: lora, lvd, none |
lora |
--target_layers |
Transformer layers for guidance | [3] |
--max_iter |
Number of TTOM iterations | 8 |
--max_guidance_step |
Max guidance steps per iteration | 5 |
--max_lora_step |
Max LoRA update steps per iteration | 5 |
--target_modules |
Modules to apply LoRA | cross_attn.q,cross_attn.k,cross_attn.v,cross_attn.o |
--jsd_loss_weight |
JSD loss weight | 1.0 |
--com_loss_weight |
Composition loss weight | 0.0 |
--min_loss_value |
Min loss value threshold | 0.03 |
--save_lora_weight |
Save LoRA weights | False |
--save_mask |
Save attention masks | False |
--skip_existed_prompt |
Skip existing outputs | False |
--prefix |
Output directory prefix | "" |
--strat_id |
TTOM strategy (0=update, 1=load, 2=load+update) | 0 |
Click to expand full evaluation pipeline
1. Generate attention maps:
bash scripts/run_attnmap_batch.shπ‘ Before running, configure
CONDA_PYTHONandBASE_DIRin the script.
2. Run GroundingDINO detection + SAM2 segmentation:
python utils/gdino_detection_video.py3. Compute mIoU:
python utils/evaluate_miou.py \
--attn_dir data/attn_maps/wan21_lora/ \
--dino_dir data/dino_results_batch/ \
--output_dir data/miou_summary/4. Visualize attention maps:
python utils/visualize_attn_maps.py \
--pid 0 \
--step_id 40 \
--layer_id 3 \
--inst_id 0 \
--savedata/
βββ attn_maps/ # Attention map files (.pt)
βββ dino_results_batch/ # GroundingDINO + SAM2 detection/segmentation results
βββ miou_summary/ # mIoU evaluation results and summary stats
βββ benchmarks/ # Generated videos
βββ layout/ # Layout visualizations (GIFs)
We thank the authors and maintainers of the following projects:
- Wan2.1 β Open and advanced large-scale video generative models.
- DiffSynth-Studio β Efficient diffusion model inference engine.
- GroundingDINO β Open-set object detection with language grounding.
- SAM2 β Segment Anything 2 for high-quality segmentation.
If you find TTOM useful, please consider giving this repository a star β and citing our paper:
@article{qu2025ttom,
title = {TTOM: Test-Time Optimization and Memorization for Compositional Video Generation},
author = {Leigang Qu and Ziyang Wang and Na Zheng and Wenjie Wang and Liqiang Nie and Tat-Seng Chua},
journal = {arXiv preprint arXiv:2510.07940},
year = {2025},
url = {https://arxiv.org/abs/2510.07940}
}

