Qianxun Xu1, 2 • Chenxi Song1 • Yujun Cai3 • Chi Zhang1*
1AGI Lab, Westlake University • 2Duke Kunshan University • 3The University of Queensland
Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.
🎯 Event Alignment: Aligns each event description with its intended time span during inference, fundamentally addressing event blending and event collapse.
⚡ Training-Free Query Steering: Employs a novel projection-based query steering framework to map frames to their intended event prompts.
⚖️ Adaptive Balancing: Dynamically balances steering strength throughout the generation process to ensure smooth, coherent multi-event videos.
# Clone the repository
git clone [https://github.com/CeciliaTheBirb/SwitchCraft.git](https://github.com/CeciliaTheBirb/SwitchCraft.git)
cd SwitchCraft
# Create environment
conda create -n switchcraft python=3.10
conda activate switchcraft
# Install dependencies
pip install -r requirements.txt
Generate multi-event videos with 👇
bash gen_multi.sh
This project is built upon the Wan 2.1 pipeline.
If you find this code or our paper useful for your research, please consider citing:
@inproceedings{xu2026switchcraft,
title={SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls},
author={Xu, Qianxun and Song, Chenxi and Cai, Yujun and Zhang, Chi},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}




