DreamStyle: A Unified Framework for Video Stylization

Mengtian Li^✉, Jinshu Chen, Songtao Zhao^‡, Wanquan Feng, Pengqi Tu, Qian He
Intelligent Creation, ByteDance

Abstract

Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.

Overview

Data Curation Pipeline. We propose generating the training data with two key steps: image stylization followed by image to video. Considering the characteristics of different image stylization techniques, we construct a CT dataset and a SFT dataset, where SDXL (equipped with ControlNet, InstantStyle, and ID plugin) and Seedream 4.0 are selected as their stylization models, respectively. For image to video, we utilize ControlNets to enhance the motion consistency between the generated stylized and raw videos. To ensure the data quality, we additionally apply automatic filtering for CT data and manual filtering for SFT data.

Overview of DreamStyle Framework. DreamStyle is built on the Wan14B-I2V model, integrating the text and raw-video conditions through the cross-attention and image channels of the base model, while the first-frame and style-image conditions serve as additional frames concatenated to the start and end of the frame sequence. We train it using a standard flow matching loss and a token-specific LoRA that contributes to distinguishing different condition tokens.

Latest News

Jan 7, 2026: We release the Technical Report of DreamStyle

Todo List

Release technical report
Release inference code
Release models
Release training code

Citation

If you find DreamStyle useful in your research, please kindly cite our paper:

@misc{li2026dreamstyle,
    title={DreamStyle: A Unified Framework for Video Stylization}, 
    author={Mengtian Li and Jinshu Chen and Songtao Zhao and Wanquan Feng and Pengqi Tu and Qian He},
    year={2026},
    eprint={2601.02785},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2601.02785}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DreamStyle: A Unified Framework for Video Stylization

Abstract

Overview

Latest News

Todo List

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DreamStyle: A Unified Framework for Video Stylization

Abstract

Overview

Latest News

Todo List

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages