Reference implementation of "A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering", accepted at ICLR 2026.
A.I.R. is a training-free framework for long-video question answering. It combines CLIP-based coarse filtering, GMM-driven adaptive event segmentation, and a VLM-guided iterative refinement loop, keeping only the frames that actually help the downstream VLM answer the question.
The code plugs into
lmms-eval as a new
model named hvp, and is compatible with Qwen2.5-VL, Qwen2-VL,
InternVL3, LLaVA-OneVision, and VILA-1.5 backbones.
- 2026-01 — A.I.R. accepted to ICLR 2026.
- 2025-10 — Preprint released on arXiv: 2510.04428.
- Adaptive event segmentation. A query-conditioned Gaussian-Mixture partitions the video into a small number of events from CLIP similarity statistics — no fixed stride, no fixed scene count.
- Density-aware sampling. Each event contributes frames in proportion to its duration and query relevance.
- Iterative VLM refinement. A short VLM loop ranks the remaining candidates and only forwards high-potential frames for deeper reasoning, adaptively allocating compute.
- Plug-and-play. Works with any modern video-VLM backbone without fine-tuning.
AIR/
├── air/ # Core package (`import air`)
│ ├── __init__.py
│ ├── model.py # H_VPModel: coarse filtering + iterative refinement
│ ├── hvp.py # lmms-eval model wrapper (registers as "hvp")
│ ├── base_models/ # Backbone adapters
│ │ ├── clip_model.py # CLIP / OpenCLIP / EVA-CLIP / LongCLIP / HF-CLIP
│ │ ├── Qwen2_5VL.py
│ │ ├── Qwen2_VL.py
│ │ ├── InternVL3.py
│ │ ├── Llava_OV.py
│ │ ├── VILA1_5.py
│ │ └── utils/qwen_vl_utils.py
│ └── Process_Utils/ # Prompting, sampling, video pre-processing
│ ├── process_messages.py
│ ├── process_video.py
│ └── sample_logic.py
├── baselines/ # Reference VLM adapters for lmms-eval
│ ├── qwen2_5_vl.py
│ ├── qwen2_vl.py
│ ├── internvl2.py
│ ├── llava_onevision.py
│ └── vila.py
├── benchmark_utils/ # Task-specific helpers (MME, MLVU, LVB, NExT-QA)
├── configs/ # lmms-eval YAML configs
├── scripts/slurm/ # Sanitised example SLURM launchers
├── assets/ # Figures used in this README
├── requirements.txt
├── LICENSE # MIT
└── README.md
-
Install a CUDA-matched PyTorch build from https://pytorch.org/get-started/locally/ (tested with torch 2.3+).
-
Install the remaining dependencies:
pip install -r requirements.txt pip install git+https://github.com/openai/CLIP.git pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
-
Optional backbones that ship source-only:
Backbone How to install VILA-1.5 git clone https://github.com/NVlabs/VILAand setVILA_PATHLLaVA-OneVision git clone https://github.com/LLaVA-VL/LLaVA-NeXTand add toPYTHONPATHLong-CLIP git clone https://github.com/beichenzbc/Long-CLIPand setLONGCLIP_PATH(+LONGCLIP_CKPT_DIR) -
(Optional) Install FlashAttention-2 for the Qwen backbones:
pip install flash-attn --no-build-isolation
A.I.R. does not hard-code any absolute paths. Configure it via env vars:
| Variable | Purpose | Default |
|---|---|---|
AIR_CLIP_CACHE_DIR |
Directory used to cache per-video CLIP features | ./clip_cache |
HF_HOME |
Hugging Face cache root | ~/.cache/huggingface |
HF_TOKEN |
Hugging Face token for gated datasets/models | unset |
VILA_PATH |
Local clone of NVlabs/VILA | unset |
VILA_SIGLIP_PATH |
SigLIP weights path (only for the VILA baseline) | $VILA_PATH/siglip-so400m-patch14-384 |
LONGCLIP_PATH |
Local clone of Long-CLIP | unset |
LONGCLIP_CKPT_DIR |
Directory containing longclip-B.pt / longclip-L.pt |
unset |
DECORD_EOF_RETRY_MAX |
Decord retry budget for corrupt videos | 40960 |
The air/hvp.py module registers A.I.R. as a new lmms-eval model named
hvp. Typical invocation:
accelerate launch --num_processes=4 --main_process_port=32397 -m lmms_eval \
--model=hvp \
--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=auto \
--tasks=videomme \
--batch_size=1 \
--output_path=./logs_air/ \
--log_samplesThe backbone is auto-detected from the pretrained substring
(Qwen2.5, Qwen2, InternVL, llava-onevision, or VILA); there is
no separate backbone argument. See air/hvp.py for the full list of
--model_args keys.
Example SLURM launch scripts live in scripts/slurm/. They set up the
environment but intentionally do not hard-code a partition, account,
or HF_TOKEN; fill those in for your cluster.
Datasets are pulled automatically by lmms-eval from Hugging Face the
first time a task runs (you may need an HF_TOKEN for gated splits).
| Benchmark | YAML config | Helper |
|---|---|---|
| Video-MME | configs/videomme.yaml, configs/videomme_w_subtitle.yaml |
benchmark_utils/mme_utils.py |
| MLVU | configs/mlvu_dev.yaml, configs/mlvu_test.yaml |
benchmark_utils/mlvu_utils.py |
| LongVideoBench | configs/longvideobench_val_v.yaml |
benchmark_utils/lvb_utils.py |
| NExT-QA | configs/nextqa_mc_test.yaml, configs/nextqa_oe_val.yaml |
benchmark_utils/nextqa_utils.py |
| EgoSchema | built-in egoschema task in lmms-eval |
— |
The baselines/ directory reproduces the standard lmms-eval adapters
used in our paper for head-to-head comparisons (Qwen2.5-VL, Qwen2-VL,
InternVL2/3, LLaVA-OneVision, VILA-1.5). Each file can be dropped into
an lmms-eval tasks search path or invoked through the
scripts/slurm/baseline_*.slurm templates.
If you find A.I.R. useful, please cite our ICLR 2026 paper:
@inproceedings{zou2026air,
title = {A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based
Frame Selection For Video Question Answering},
author = {Yuanhao Zou and Shengji Jin and Andong Deng and
Youpeng Zhao and Jun Wang and Chen Chen},
booktitle = {The Fourteenth International Conference on Learning
Representations (ICLR)},
year = {2026},
url = {https://openreview.net/forum?id=SZVpOKw0YD}
}Released under the MIT license — see LICENSE for the full text.
A.I.R. is built on top of the lmms-eval evaluation harness and borrows reference code from Qwen-VL, InternVL, LLaVA-OneVision / LLaVA-NeXT, VILA, OpenAI CLIP, OpenCLIP, EVA-CLIP, and Long-CLIP. Please also cite the upstream projects when using the corresponding backbones.
