🌟 AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding

Breaking the "Memory Wall" for MLLMs with Adaptive Video Compression (ACL 2025)

Authors

Xiao Wang^1,2,‡, Qingyi Si^2,‡, Jianlong Wu¹*, Shiyu Zhu³, Li Cao², Liqiang Nie¹*

¹ Harbin Institute of Technology, Shenzhen
² Huawei Technologies Co., Ltd.
² Shandong University
‡ Equal contribution * Corresponding authors

🤖 Reproduce with a Coding Agent (One Prompt)

Have a coding agent (Claude Code, Cursor, etc.) reproduce all paper results end-to-end with a single prompt:

Read AGENTS.md and reproduce the AdaReTaKe paper results end-to-end.

AGENTS.md contains everything the agent needs: environment setup, dataset preparation, eval commands, expected scores, and common failure modes.

🔍 Overview

AdaReTaKe is an advanced video compression framework designed for Multimodal Large Language Models (MLLMs). By adaptively reducing uneven visual redundancy across timestamps and model layers, it:
✅ Extends context capacity from 256 to 2048 frames
✅ Theoretically minimizes compression loss via adaptive ratio allocation
✅ Outperforms SOTA by +2.3% (7B) and +2.8% (72B) on four benchmarks

🎯 Key Contributions

Feature	Innovation
Adaptive Redundancy Reduction	Layer-wise + timestamp-wise compression for maximal context retention
Scalability	Validated on 7B to 72B MLLMs with consistent gains
Theoretical Guarantee	Compression ratio allocation minimizes the loss upper bound

🛠️ Setup

🌐 Environment

# For GPU users
conda create -n retake python=3.11
pip install -r requirements.txt

# For NPU users (e.g., Ascend)
conda env create -f environment_npu.yaml

# Additional dependencies
apt-get install ffmpeg  # Required for full video processing
pip install flash-attn==2.6.3 --no-build-isolation

🚦 Quick Start

1️⃣ Configure Paths

Edit demo.py:

hf_qwen2vl7b_path = "your/local/path/to/Qwen2-VL-7B-Instruct"  
# NPU users: config_path = 'configs/demo_npu.yaml'

2️⃣ (Optional) Convert LLaVA-Video Weights

python scripts/utils/convert_llava_video_weights_to_hf.py \
  --text_model_id /path_to/Qwen2-7B-Instruct \
  --vision_model_id /path_to/siglip-so400m-patch14-384 \
  --output_hub_path /path_to/llava-video-qwen2-7b-hf \
  --old_state_dict_id /path_to/LLaVAVideoQwen2_7B

3️⃣ Run Demo

python demo.py

📈 Reproduce Results

Dataset Preparation

Evaluation Scripts

# Main results (paper configuration: temporal + AdaKV, 2048 frames)
bash main_results.sh

# Ablation study (1024 frames, 4 configs × 4 datasets)
bash ablation.sh

Results saved in ./results

Main Results (Qwen2.5-VL-7B, Paper Configuration)

Benchmark	Frames	FPS	Score
MLVU (M-AVG)	2048	2	75.2
LongVideoBench	2048	2	61.6
LVBench	2048	2	50.4
Video-MME	2048	4	64.8

Ablation Study: Scaling to 1024 Frames

We conduct ablation experiments at 1024 frames (4× the 256-frame setting used in the paper) to study how each component behaves when scaling to more frames. Four configurations are compared:

Config	Temporal	Layer Allocation	Description
`no_both`	✗	Even	Baseline
`no_layer`	✓	Even	Temporal adaptation only
`no_temporal`	✗	AdaKV	Layer allocation only
`full`	✓	AdaKV	Full method (paper)

Results (overall accuracy):

Config	LVBench	LongVideoBench	MLVU	VideoMME	Avg
Baseline (`no_both`)	49.19	61.40	75.63	66.67	63.22
Temporal only (`no_layer`)	49.97	61.48	75.94	66.63	63.51
AdaKV only (`no_temporal`)	48.55	61.65	75.50	66.52	63.06
Full (`full`)	48.48	62.22	75.41	66.19	63.08

Key observations at 1024-frame scale:

Temporal adaptation remains consistently beneficial: it improves performance on LVBench (+0.78) and MLVU (+0.31), with neutral impact on the other two benchmarks. This confirms the generalizability of the temporal adaptation mechanism.
Layer allocation shows dataset-dependent behavior: AdaKV layer allocation benefits LongVideoBench (+0.82 when combined with temporal), where subtitle-rich prompts create distinct cross-modal attention patterns across layers. However, it has negative impact on LVBench (−0.64) and VideoMME (−0.48). This divergence at higher frame counts warrants further investigation — potentially through more fine-grained layer-wise budget strategies or dataset-adaptive allocation.
LongVideoBench is unique: its questions include full subtitle transcripts (~3000 tokens avg), creating a fundamentally different attention landscape compared to purely visual benchmarks.

📄 License

Pending final release ⚠️ Research use only — Commercial applications require explicit permission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟 AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding

Authors

🤖 Reproduce with a Coding Agent (One Prompt)

🔍 Overview

🎯 Key Contributions

🛠️ Setup

🌐 Environment

🚦 Quick Start

1️⃣ Configure Paths

2️⃣ (Optional) Convert LLaVA-Video Weights

3️⃣ Run Demo

📈 Reproduce Results

Dataset Preparation

Evaluation Scripts

Main Results (Qwen2.5-VL-7B, Paper Configuration)

Ablation Study: Scaling to 1024 Frames

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
configs		configs
docs		docs
misc		misc
retake		retake
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
ablation.sh		ablation.sh
demo.py		demo.py
environment_npu.yaml		environment_npu.yaml
main_results.sh		main_results.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌟 AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding

Authors

🤖 Reproduce with a Coding Agent (One Prompt)

🔍 Overview

🎯 Key Contributions

🛠️ Setup

🌐 Environment

🚦 Quick Start

1️⃣ Configure Paths

2️⃣ (Optional) Convert LLaVA-Video Weights

3️⃣ Run Demo

📈 Reproduce Results

Dataset Preparation

Evaluation Scripts

Main Results (Qwen2.5-VL-7B, Paper Configuration)

Ablation Study: Scaling to 1024 Frames

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages