Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
An open-source framework to accelerate Vision Language Model (VLM) inference by up to 3x with no quality loss.
DREAM is a cutting-edge framework designed to significantly accelerate the inference speed of Vision Language Models (VLMs), such as LLaVA. By employing a novel speculative decoding mechanism, DREAM achieves up to a 3x speedup over traditional autoregressive methods without compromising the quality of the output.
The core of DREAM is its innovative approach: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding. This allows the model to generate multiple tokens in parallel and validate them efficiently, leading to substantial gains in performance.
- High-Performance Inference: Up to 3x faster inference for Vision Language Models (VLMs) compared to standard methods.
- Zero Quality Loss: Maintains the same output distribution as the original model.
- Multimodal Support: Fully compatible with multimodal models like LLaVA.
- Efficient Training: Includes scripts for training the auto-regression head using DeepSpeed.
- Interactive Web UI: Comes with a Gradio-based web interface for easy testing and demonstration.
- Comprehensive Tooling: Provides scripts for training data generation and performance evaluation.
| Vanilla | DREAM |
![]() |
![]() |
-
Clone the repository:
git clone https://github.com/SAI-Lab-NYU/DREAM.git cd DREAM -
Install dependencies: We recommend creating a virtual environment first.
pip install -e .Note:
-einstalls the project in editable mode. -
Download Model Weights: See the Model Weights section below for links to the available models.
Run our Gradio-based web interface for an interactive experience. The command automatically handles model allocation across multiple GPUs.
python -m dream.application.webui \
--ea-model-path [PATH_TO_DREAM_WEIGHTS] \
--base-model-path [PATH_TO_BASE_MODEL][PATH_TO_DREAM_WEIGHTS]: Path to the downloaded DREAM weights (e.g.,./DREAM-llava-v1.6-vicuna-7b).[PATH_TO_BASE_MODEL]: Path to the original base model weights (e.g., the originalvicuna-7b-v1.3).total-token: Number of draft tokens. Adjust this based on your hardware for optimal performance. Set to-1for auto-configuration.
Once the model is loaded, a URL will be displayed in the terminal.
First, generate the necessary training data (see ./ge_data for detailed instructions and generation scripts):
python -m dream.ge_data.allocation_mix665Then, use the following DeepSpeed command to start training:
cd dream/train
deepspeed main_deepspeed.py \
--deepspeed_config ./ds_config.json \
--tmpdir [PATH_TO_TRAINING_DATA] \
--cpdir [PATH_TO_SAVE_CHECKPOINTS] \
--configpath ./vicuna_7B_config.jsonTest the inference speed of DREAM on benchmarks like MT-Bench.
python -m dream.evaluation.eval_llava \
--ea-model-path [PATH_TO_DREAM_WEIGHTS] \
--base-model-path [PATH_TO_BASE_MODEL]This will generate a .jsonl file containing the generation results and wall time.
| Model | Base Model | Download |
|---|---|---|
DREAM-llava-v1.6-vicuna-7b |
vicuna-7b-v1.6 |
🤗 HideonBed12138/DREAM-llava-v1.6-vicuna-7b |
If you find our work useful for your research, please consider citing our paper:
@misc{hu2025dreamdraftingrefinedtarget,
title={DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding},
author={Yunhai Hu and Tianhua Xia and Zining Liu and Rahul Raman and Xingyu Liu and Bo Bao and Eric Sather and Vithursan Thangarasa and Sai Qian Zhang},
year={2025},
eprint={2505.19201},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.19201},
}This project is built upon the incredible work of the open-source community. We are especially grateful to the developers of Medusa, EAGLE, and FastChat.
DREAM is licensed under the Apache 2.0 License

