MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

【CVPR 2025】MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning MA, Shanghang Zhang

Overview

Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers.

News

[2025/2/26] 🔥 The code of MoVE-KD is released.
[2025/2/26] 🔥 MoVE-KD is accepted by CVPR 2025.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Install

If you are not using Linux, do NOT proceed, see instructions for macOS and Windows.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/hey-cjj/MoVE-KD.git
cd MoVE-KD

Install Package

conda create -n move-kd python=3.10 -y
conda activate move-kd
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

MoVE-KD Weights

Method	LLM	VQAv2	GQA	TextVQA	VizWiz	POPE	SQA	MME	MMB
LLaVA-v1.5	Vicuna-7B	78.5	62.0	58.2	50.0	85.9	66.8	1510.7	64.3
MoVE-KD-v1.0	Vicuna-7B	79.5	63.2	58.3	52.3	86.9	69.3	1524.5	66.3
MoVE-KD-v1.1	Vicuna-7B	79.9	63.9	59.6	52.7	86.3	69.8	1509.1	67.4
LLaVA-v1.5	Vicuna-13B	80.0	63.3	61.3	53.6	85.9	71.6	1531.3	67.7
MoVE-KD-v1.0	Vicuna-13B	80.6	64.2	59.7	55.7	85.7	73.2	1568.1	70.2
MoVE-KD-v1.1	Vicuna-13B	80.8	63.9	61.1	57.5	86.3	71.8	1568.3	69.7

Train

Our training procedure is consistent with LLaVA, consisting of two stages: pre-training and fine-tuning. In the pre-training stage, we tune the parameter weights of the student encoder’s MoLE, encoder adapters, and projector, while freezing all other parameter weights. In the fine-tuning stage, all parameter weights are updated except those of the teacher encoders.

Pre-training

Training script with DeepSpeed ZeRO-2: pretrain-v1.0.sh and pretrain-v1.1.sh.

Fine-tuning

Training script with DeepSpeed ZeRO-2: finetune-v1.0.sh and finetune-v1.1.sh.

Evaluation

See Evaluation.md.

Citation

If you find MoVE-KD useful for your research and applications, please cite using this BibTeX:

@InProceedings{Cao_2025_CVPR,
    author    = {Cao, Jiajun and Zhang, Yuan and Huang, Tao and Lu, Ming and Zhang, Qizhe and An, Ruichuan and Ma, Ningning and Zhang, Shanghang},
    title     = {MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {19846-19856}
}

Acknowledgement

Our code is built on LLaVA and EAGLE

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
llava		llava
playground/data/prompts		playground/data/prompts
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

Overview

News

Contents

Install

MoVE-KD Weights

Train

Pre-training

Fine-tuning

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

Overview

News

Contents

Install

MoVE-KD Weights

Train

Pre-training

Fine-tuning

Evaluation

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages