🚀[NeurIPS 2025] Temporal In-Context Fine-Tuning with Temporal Reasoning for Versatile Control of Video Diffusion Models✨
- [2025.09.19] 🏆 TIC-FT officially accepted to NeurIPS 2025!
🎤 Stay tuned for our presentation at the conference.
We recommend using conda to manage the environment:
# Create a new conda environment
conda create -n tic-ft python=3.10
# Activate the environment
conda activate tic-ft
# Install python dependencies
pip install -r requirements.txt
# Install ftfy package
pip install ftfy
# Install ffmpeg via conda-forge
conda install -c conda-forge ffmpeg
# Additional packages for video processing
pip install imageio imageio-ffmpegFollow these steps to easily test the I2V pipeline:
-
Prepare Your Image
Convert your face image into either Cartoon or 3D Animation style with a white background using an image generation tool such as ChatGPT.
-
Save the Image
Save your generated image to:
dataset/custom/{mode}/images{mode}could be eitherCartoonor3DAnimation.- By default, an example
1.pngis provided. You can:- Add new images as
2.png,3.png, etc. - Or replace
1.pngdirectly.
- Add new images as
-
Convert Image to Reference Video
Use the following script to duplicate the image into 49 frames and generate a condition video:python dataset/utils/make_video_by_copying_image.py {image_path}Save the generated condition video into: dataset/custom/{mode}/videos
-
Prepare Dataset Files
- In dataset/custom/{mode}/videos.txt, list the relative video paths (one per line).
- In dataset/custom/{mode}/prompt.txt, write the corresponding text prompts (one per line).
-
Download Pretrained Weights
Download the safetensors weights for your selected mode from:
Google Drive -
Run Inference
Example command:python validate_repeat.py \ --model_name wan \ --model_id Wan-AI/Wan2.1-T2V-14B-Diffusers \ --lora_weight_path /data/kinamkim/TIC-FT/outputs/wan/3DAnimation/pytorch_lora_weights.safetensors \ --latent_partition_mode c1b3t9 \ --dataset_dir /data/kinamkim/TIC-FT/dataset/custom/Cartoon
This command will generate multiple videos with different random seeds and save them under
validation_videos/in your weight directory. -
⚠ Note (Wan Model Specific)
- Due to a known issue, the first generated sample may appear noisy.
- Valid results typically start from the second sample.
-
Now you have your own video featuring your character!
cartoon.mp4
3DAnimation.mp4
- Implement I2V code on both CogVideoX and Wan
- Prepare model weights for various I2V applications
- Implement V2V code for CogVideoX
- Implement remaining features: Multiple Conditions, Action Transfer, and Video Interpolation
- Download pretrained weights from here: Drive
To prepare your dataset, follow the structure provided in dataset/example/.
- Each video should have 49 frames in total:
- 13 condition image frames
- 36 target video frames
When converted into latent representations, the 13 frames are split into:
- The first 4 frames → condition frames
- The next 9 frames → target frames
During training:
- Only the first condition frame is kept as a pure condition.
- The remaining 3 condition frames are progressively noised and used as buffer frames.
In the training scripts, you will find that latent_partition_mode is set to c1b3t9, which means:
c1b3t9→ 1 pure condition frame, 3 buffer frames, and 9 target frames.
- For CogVideoX:
Example:scripts/cogvideox/I2V/train.sh
- For Wan:
Example:
scripts/wan/I2V/train.sh
python validate.py \
--model_name wan \
--model_id {checkpoint path} \
--lora_weight_path {safetensors path} \
--latent_partition_mode c1b3t9 \
--dataset_dir {dataset dir}Below are example videos showcasing various applications of TIC-FT.
We emphasize that our I2V approach is not limited to the first frame; instead, it leverages the identity of the first frame to enable the generation of diverse and coherent videos.
i2v-1.mp4
i2v-2.mp4
i2v-3.mp4
i2v-4.mp4
v2v-1.mp4
v2v-2.mp4
MC-1.mp4
MC-2.mp4
ActionTransfer-1.mp4
Interpolation-1.mp4
This project is built upon the following works:
@article{kim2025temporal,
title={Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models},
author={Kim, Kinam and Hyung, Junha and Choo, Jaegul},
journal={arXiv preprint arXiv:2506.00996},
year={2025}
}