Skip to content

kinam0252/TIC-FT

Repository files navigation

🚀[NeurIPS 2025] Temporal In-Context Fine-Tuning with Temporal Reasoning for Versatile Control of Video Diffusion Models✨

📑Paper

🌐Project Page

📰 News

  • [2025.09.19] 🏆 TIC-FT officially accepted to NeurIPS 2025!
    🎤 Stay tuned for our presentation at the conference.

⚙️ Requirements

We recommend using conda to manage the environment:

# Create a new conda environment
conda create -n tic-ft python=3.10

# Activate the environment
conda activate tic-ft

# Install python dependencies
pip install -r requirements.txt

# Install ftfy package
pip install ftfy

# Install ffmpeg via conda-forge
conda install -c conda-forge ffmpeg

# Additional packages for video processing
pip install imageio imageio-ffmpeg

🚀 Try It Yourself!

Follow these steps to easily test the I2V pipeline:

  1. Prepare Your Image
    Convert your face image into either Cartoon or 3D Animation style with a white background using an image generation tool such as ChatGPT.

  2. Save the Image
    Save your generated image to:
    dataset/custom/{mode}/images

    • {mode} could be either Cartoon or 3DAnimation.
    • By default, an example 1.png is provided. You can:
      • Add new images as 2.png, 3.png, etc.
      • Or replace 1.png directly.
  3. Convert Image to Reference Video
    Use the following script to duplicate the image into 49 frames and generate a condition video:

    python dataset/utils/make_video_by_copying_image.py {image_path}

    Save the generated condition video into: dataset/custom/{mode}/videos

  4. Prepare Dataset Files

    • In dataset/custom/{mode}/videos.txt, list the relative video paths (one per line).
    • In dataset/custom/{mode}/prompt.txt, write the corresponding text prompts (one per line).
  5. Download Pretrained Weights
    Download the safetensors weights for your selected mode from:
    Google Drive

  6. Run Inference
    Example command:

    python validate_repeat.py \
    --model_name wan \
    --model_id Wan-AI/Wan2.1-T2V-14B-Diffusers \
    --lora_weight_path /data/kinamkim/TIC-FT/outputs/wan/3DAnimation/pytorch_lora_weights.safetensors \
    --latent_partition_mode c1b3t9 \
    --dataset_dir /data/kinamkim/TIC-FT/dataset/custom/Cartoon

    This command will generate multiple videos with different random seeds and save them under validation_videos/ in your weight directory.

  7. ⚠ Note (Wan Model Specific)

    • Due to a known issue, the first generated sample may appear noisy.
    • Valid results typically start from the second sample.
  8. Now you have your own video featuring your character!

cartoon.mp4
3DAnimation.mp4

🚧 Progress

✅ Completed

  • Implement I2V code on both CogVideoX and Wan

🔄 In Progress

  • Prepare model weights for various I2V applications
  • Implement V2V code for CogVideoX

🔜 Upcoming

  • Implement remaining features: Multiple Conditions, Action Transfer, and Video Interpolation

🗺️Start Guide

🔗 Weights

  • Download pretrained weights from here: Drive

📂 Dataset

To prepare your dataset, follow the structure provided in dataset/example/.

  • Each video should have 49 frames in total:
    • 13 condition image frames
    • 36 target video frames

When converted into latent representations, the 13 frames are split into:

  • The first 4 frames → condition frames
  • The next 9 frames → target frames

During training:

  • Only the first condition frame is kept as a pure condition.
  • The remaining 3 condition frames are progressively noised and used as buffer frames.

In the training scripts, you will find that latent_partition_mode is set to c1b3t9, which means:

  • c1b3t9 → 1 pure condition frame, 3 buffer frames, and 9 target frames.

🚀 Train

  • For CogVideoX:
    Example:
    scripts/cogvideox/I2V/train.sh
    
  • For Wan: Example:
    scripts/wan/I2V/train.sh
    

🔎 Inference

python validate.py \
  --model_name wan \
  --model_id {checkpoint path} \
  --lora_weight_path {safetensors path} \
  --latent_partition_mode c1b3t9 \
  --dataset_dir {dataset dir}

🎥 Video Examples

Below are example videos showcasing various applications of TIC-FT.


🖼️ I2V

We emphasize that our I2V approach is not limited to the first frame; instead, it leverages the identity of the first frame to enable the generation of diverse and coherent videos.

i2v-1.mp4
i2v-2.mp4
i2v-3.mp4
i2v-4.mp4

🔁 V2V

v2v-1.mp4
v2v-2.mp4

🖼️ Multiple Conditions

MC-1.mp4
MC-2.mp4

🎯 Action Transfer

ActionTransfer-1.mp4

🕰️ Keyframe Interpolation

Interpolation-1.mp4

🙏Acknowledgements

This project is built upon the following works:

📖 BibTeX

@article{kim2025temporal,
  title={Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models},
  author={Kim, Kinam and Hyung, Junha and Choo, Jaegul},
  journal={arXiv preprint arXiv:2506.00996},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors