Breaking the "Memory Wall" for MLLMs with Adaptive Video Compression (ACL 2025)
Xiao Wang1,2,‡, Qingyi Si2,‡, Jianlong Wu1*, Shiyu Zhu3, Li Cao2, Liqiang Nie1*
1 Harbin Institute of Technology, Shenzhen
2 Huawei Technologies Co., Ltd.
2 Shandong University
‡ Equal contribution
* Corresponding authors
Have a coding agent (Claude Code, Cursor, etc.) reproduce all paper results end-to-end with a single prompt:
Read AGENTS.md and reproduce the AdaReTaKe paper results end-to-end.
AGENTS.md contains everything the agent needs: environment setup, dataset preparation, eval commands, expected scores, and common failure modes.
AdaReTaKe is an advanced video compression framework designed for Multimodal Large Language Models (MLLMs). By adaptively reducing uneven visual redundancy across timestamps and model layers, it:
✅ Extends context capacity from 256 to 2048 frames
✅ Theoretically minimizes compression loss via adaptive ratio allocation
✅ Outperforms SOTA by +2.3% (7B) and +2.8% (72B) on four benchmarks
| Feature | Innovation |
|---|---|
| Adaptive Redundancy Reduction | Layer-wise + timestamp-wise compression for maximal context retention |
| Scalability | Validated on 7B to 72B MLLMs with consistent gains |
| Theoretical Guarantee | Compression ratio allocation minimizes the loss upper bound |
# For GPU users
conda create -n retake python=3.11
pip install -r requirements.txt
# For NPU users (e.g., Ascend)
conda env create -f environment_npu.yaml
# Additional dependencies
apt-get install ffmpeg # Required for full video processing
pip install flash-attn==2.6.3 --no-build-isolationEdit demo.py:
hf_qwen2vl7b_path = "your/local/path/to/Qwen2-VL-7B-Instruct"
# NPU users: config_path = 'configs/demo_npu.yaml'python scripts/utils/convert_llava_video_weights_to_hf.py \
--text_model_id /path_to/Qwen2-7B-Instruct \
--vision_model_id /path_to/siglip-so400m-patch14-384 \
--output_hub_path /path_to/llava-video-qwen2-7b-hf \
--old_state_dict_id /path_to/LLaVAVideoQwen2_7Bpython demo.py# Main results (paper configuration: temporal + AdaKV, 2048 frames)
bash main_results.sh
# Ablation study (1024 frames, 4 configs × 4 datasets)
bash ablation.shResults saved in ./results
| Benchmark | Frames | FPS | Score |
|---|---|---|---|
| MLVU (M-AVG) | 2048 | 2 | 75.2 |
| LongVideoBench | 2048 | 2 | 61.6 |
| LVBench | 2048 | 2 | 50.4 |
| Video-MME | 2048 | 4 | 64.8 |
We conduct ablation experiments at 1024 frames (4× the 256-frame setting used in the paper) to study how each component behaves when scaling to more frames. Four configurations are compared:
| Config | Temporal | Layer Allocation | Description |
|---|---|---|---|
no_both |
✗ | Even | Baseline |
no_layer |
✓ | Even | Temporal adaptation only |
no_temporal |
✗ | AdaKV | Layer allocation only |
full |
✓ | AdaKV | Full method (paper) |
Results (overall accuracy):
| Config | LVBench | LongVideoBench | MLVU | VideoMME | Avg |
|---|---|---|---|---|---|
Baseline (no_both) |
49.19 | 61.40 | 75.63 | 66.67 | 63.22 |
Temporal only (no_layer) |
49.97 | 61.48 | 75.94 | 66.63 | 63.51 |
AdaKV only (no_temporal) |
48.55 | 61.65 | 75.50 | 66.52 | 63.06 |
Full (full) |
48.48 | 62.22 | 75.41 | 66.19 | 63.08 |
Key observations at 1024-frame scale:
- Temporal adaptation remains consistently beneficial: it improves performance on LVBench (+0.78) and MLVU (+0.31), with neutral impact on the other two benchmarks. This confirms the generalizability of the temporal adaptation mechanism.
- Layer allocation shows dataset-dependent behavior: AdaKV layer allocation benefits LongVideoBench (+0.82 when combined with temporal), where subtitle-rich prompts create distinct cross-modal attention patterns across layers. However, it has negative impact on LVBench (−0.64) and VideoMME (−0.48). This divergence at higher frame counts warrants further investigation — potentially through more fine-grained layer-wise budget strategies or dataset-adaptive allocation.
- LongVideoBench is unique: its questions include full subtitle transcripts (~3000 tokens avg), creating a fundamentally different attention landscape compared to purely visual benchmarks.
Pending final release
