RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

demo.mp4

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao^†
University of Science and Technology of China
^†corresponding author

🎉🎉 Our paper, “RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models” accepted by ICCV 2025! Our project page.

Requirements

The training is conducted on 2 A100 GPUs (80GB VRAM), the inference is tested on 1 A100 GPU.

Setup

git clone https://github.com/Lyne1/Realgeneral.git
cd RealGeneral

Environment

All the tests are conducted in Linux. We suggest running our code in Linux. To set up our environment in Linux, please run:

conda create -n RealGeneral python=3.10 -y
conda activate RealGeneral

bash env.sh

🔗 Checkpoints

CogVideoX-1.5 T2V Checkpoint Download the pre-trained CogVideoX-1.5 (Text-to-Video) checkpoint from this link, and place the entire folder under the pretrained_weights directory. The resulting structure should look like this:
```
./pretrained_weights/CogVideoX1.5-5B
```
RealGeneral Checkpoints Download the pre-trained RealGeneral checkpoints from this link. We provide two LoRA checkpoints for different tasks:
- Subject-driven generation
- Canny-to-image translation
Place them under pretrained_weights:
```
./pretrained_weights/IP-LoRA
./pretrained_weights/Canny2image-LoRA
```

🎨 Inference

cd inference
bash run_ip.sh          # For subject-driven image generation

# For other tasks:
# bash run_canny.sh

Note: For tasks other than subject-driven generation, make sure to append a task-specific description to the prompt automatically during inference. This is handled internally according to the task type. For example, a canny2image task will append “The image has the specific canny map” to your original prompt.

The supported task types and their corresponding additions are:

Task Type Appended Description

canny2image The image has the specific canny map

depth2image The image has the specific depth map

image2depth The image has the specific depth map

deblurring The image has a blur map

filling The image has the specific filling map

coloring The image has the specific grey map

🏋️ Train on Your Own Data

1. Dataset Preparation

Your dataset directory should be structured as follows:

.
├── videos/                 # Folder containing video files
├── videos.txt              # List of video file paths
├── prompts.txt             # Text prompts for each video
└── instance.txt            # (Optional) Subject words for subject-driven generation

2. Training

💡 Tip: You can adjust the number of GPUs used for training by modifying the num_processes value in finetune/accelerate_config_machine_single.yaml.

To start training:

cd finetune
bash finetune_ip.sh         # For subject-driven generation

# For other tasks:
# bash finetune_other_task.sh

Note: For tasks beyond subject-driven generation, you’ll need to modify the --purpose argument to specify the task type.

Citation:

Don't forget to cite this source if it proves useful in your research!

@misc{lin2025realgeneralunifyingvisualgeneration,
      title={RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models}, 
      author={Yijing Lin and Mengqi Huang and Shuhan Zhuang and Zhendong Mao},
      year={2025},
      eprint={2503.10406},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10406}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
finetune		finetune
inference		inference
realgeneral		realgeneral
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
MODEL_LICENSE		MODEL_LICENSE
README.md		README.md
env.sh		env.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Requirements

Setup

Environment

🔗 Checkpoints

🎨 Inference

🏋️ Train on Your Own Data

1. Dataset Preparation

2. Training

Citation:

About

Uh oh!

Contributors

Uh oh!

Languages

Task Type	Appended Description
canny2image	The image has the specific canny map
depth2image	The image has the specific depth map
image2depth	The image has the specific depth map
deblurring	The image has a blur map
filling	The image has the specific filling map
coloring	The image has the specific grey map

Folders and files

Latest commit

History

Repository files navigation

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Requirements

Setup

Environment

🔗 Checkpoints

🎨 Inference

🏋️ Train on Your Own Data

1. Dataset Preparation

2. Training

Citation:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages