demo.mp4
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao†
University of Science and Technology of China
†corresponding author
🎉🎉 Our paper, “RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models” accepted by ICCV 2025! Our project page.
The training is conducted on 2 A100 GPUs (80GB VRAM), the inference is tested on 1 A100 GPU.
git clone https://github.com/Lyne1/Realgeneral.git
cd RealGeneral
All the tests are conducted in Linux. We suggest running our code in Linux. To set up our environment in Linux, please run:
conda create -n RealGeneral python=3.10 -y
conda activate RealGeneral
bash env.sh
-
CogVideoX-1.5 T2V Checkpoint Download the pre-trained CogVideoX-1.5 (Text-to-Video) checkpoint from this link, and place the entire folder under the
pretrained_weightsdirectory. The resulting structure should look like this:./pretrained_weights/CogVideoX1.5-5B -
RealGeneral Checkpoints Download the pre-trained RealGeneral checkpoints from this link. We provide two LoRA checkpoints for different tasks:
- Subject-driven generation
- Canny-to-image translation
Place them under
pretrained_weights:./pretrained_weights/IP-LoRA ./pretrained_weights/Canny2image-LoRA
cd inference
bash run_ip.sh # For subject-driven image generation
# For other tasks:
# bash run_canny.shNote: For tasks other than subject-driven generation, make sure to append a task-specific description to the prompt automatically during inference. This is handled internally according to the task type. For example, a canny2image task will append “The image has the specific canny map” to your original prompt.
The supported task types and their corresponding additions are:
Task Type Appended Description canny2image The image has the specific canny map depth2image The image has the specific depth map image2depth The image has the specific depth map deblurring The image has a blur map filling The image has the specific filling map coloring The image has the specific grey map
Your dataset directory should be structured as follows:
.
├── videos/ # Folder containing video files
├── videos.txt # List of video file paths
├── prompts.txt # Text prompts for each video
└── instance.txt # (Optional) Subject words for subject-driven generation
💡 Tip: You can adjust the number of GPUs used for training by modifying the
num_processesvalue infinetune/accelerate_config_machine_single.yaml.
To start training:
cd finetune
bash finetune_ip.sh # For subject-driven generation
# For other tasks:
# bash finetune_other_task.shNote: For tasks beyond subject-driven generation, you’ll need to modify the
--purposeargument to specify the task type.
Don't forget to cite this source if it proves useful in your research!
@misc{lin2025realgeneralunifyingvisualgeneration,
title={RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models},
author={Yijing Lin and Mengqi Huang and Shuhan Zhuang and Zhendong Mao},
year={2025},
eprint={2503.10406},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.10406},
}