A.I.R. — Adaptive, Iterative, and Reasoning-based Frame Selection for Video QA

Reference implementation of "A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering", accepted at ICLR 2026.

A.I.R. is a training-free framework for long-video question answering. It combines CLIP-based coarse filtering, GMM-driven adaptive event segmentation, and a VLM-guided iterative refinement loop, keeping only the frames that actually help the downstream VLM answer the question.

The code plugs into lmms-eval as a new model named hvp, and is compatible with Qwen2.5-VL, Qwen2-VL, InternVL3, LLaVA-OneVision, and VILA-1.5 backbones.

News

2026-01 — A.I.R. accepted to ICLR 2026.
2025-10 — Preprint released on arXiv: 2510.04428.

Key ideas

Adaptive event segmentation. A query-conditioned Gaussian-Mixture partitions the video into a small number of events from CLIP similarity statistics — no fixed stride, no fixed scene count.
Density-aware sampling. Each event contributes frames in proportion to its duration and query relevance.
Iterative VLM refinement. A short VLM loop ranks the remaining candidates and only forwards high-potential frames for deeper reasoning, adaptively allocating compute.
Plug-and-play. Works with any modern video-VLM backbone without fine-tuning.

Repository layout

AIR/
├── air/                       # Core package (`import air`)
│   ├── __init__.py
│   ├── model.py               # H_VPModel: coarse filtering + iterative refinement
│   ├── hvp.py                 # lmms-eval model wrapper (registers as "hvp")
│   ├── base_models/           # Backbone adapters
│   │   ├── clip_model.py      #   CLIP / OpenCLIP / EVA-CLIP / LongCLIP / HF-CLIP
│   │   ├── Qwen2_5VL.py
│   │   ├── Qwen2_VL.py
│   │   ├── InternVL3.py
│   │   ├── Llava_OV.py
│   │   ├── VILA1_5.py
│   │   └── utils/qwen_vl_utils.py
│   └── Process_Utils/         # Prompting, sampling, video pre-processing
│       ├── process_messages.py
│       ├── process_video.py
│       └── sample_logic.py
├── baselines/                 # Reference VLM adapters for lmms-eval
│   ├── qwen2_5_vl.py
│   ├── qwen2_vl.py
│   ├── internvl2.py
│   ├── llava_onevision.py
│   └── vila.py
├── benchmark_utils/           # Task-specific helpers (MME, MLVU, LVB, NExT-QA)
├── configs/                   # lmms-eval YAML configs
├── scripts/slurm/             # Sanitised example SLURM launchers
├── assets/                    # Figures used in this README
├── requirements.txt
├── LICENSE                    # MIT
└── README.md

Installation

Install a CUDA-matched PyTorch build from https://pytorch.org/get-started/locally/ (tested with torch 2.3+).

Install the remaining dependencies:

pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

Optional backbones that ship source-only:

Backbone	How to install
VILA-1.5	`git clone https://github.com/NVlabs/VILA` and set `VILA_PATH`
LLaVA-OneVision	`git clone https://github.com/LLaVA-VL/LLaVA-NeXT` and add to `PYTHONPATH`
Long-CLIP	`git clone https://github.com/beichenzbc/Long-CLIP` and set `LONGCLIP_PATH` (+ `LONGCLIP_CKPT_DIR`)

(Optional) Install FlashAttention-2 for the Qwen backbones:
```
pip install flash-attn --no-build-isolation
```

Environment variables

A.I.R. does not hard-code any absolute paths. Configure it via env vars:

Variable	Purpose	Default
`AIR_CLIP_CACHE_DIR`	Directory used to cache per-video CLIP features	`./clip_cache`
`HF_HOME`	Hugging Face cache root	`~/.cache/huggingface`
`HF_TOKEN`	Hugging Face token for gated datasets/models	unset
`VILA_PATH`	Local clone of NVlabs/VILA	unset
`VILA_SIGLIP_PATH`	SigLIP weights path (only for the VILA baseline)	`$VILA_PATH/siglip-so400m-patch14-384`
`LONGCLIP_PATH`	Local clone of Long-CLIP	unset
`LONGCLIP_CKPT_DIR`	Directory containing `longclip-B.pt` / `longclip-L.pt`	unset
`DECORD_EOF_RETRY_MAX`	Decord retry budget for corrupt videos	`40960`

Running benchmarks with lmms-eval

The air/hvp.py module registers A.I.R. as a new lmms-eval model named hvp. Typical invocation:

accelerate launch --num_processes=4 --main_process_port=32397 -m lmms_eval \
    --model=hvp \
    --model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=auto \
    --tasks=videomme \
    --batch_size=1 \
    --output_path=./logs_air/ \
    --log_samples

The backbone is auto-detected from the pretrained substring (Qwen2.5, Qwen2, InternVL, llava-onevision, or VILA); there is no separate backbone argument. See air/hvp.py for the full list of --model_args keys.

Example SLURM launch scripts live in scripts/slurm/. They set up the environment but intentionally do not hard-code a partition, account, or HF_TOKEN; fill those in for your cluster.

Supported benchmarks

Datasets are pulled automatically by lmms-eval from Hugging Face the first time a task runs (you may need an HF_TOKEN for gated splits).

Benchmark	YAML config	Helper
Video-MME	`configs/videomme.yaml`, `configs/videomme_w_subtitle.yaml`	`benchmark_utils/mme_utils.py`
MLVU	`configs/mlvu_dev.yaml`, `configs/mlvu_test.yaml`	`benchmark_utils/mlvu_utils.py`
LongVideoBench	`configs/longvideobench_val_v.yaml`	`benchmark_utils/lvb_utils.py`
NExT-QA	`configs/nextqa_mc_test.yaml`, `configs/nextqa_oe_val.yaml`	`benchmark_utils/nextqa_utils.py`
EgoSchema	built-in `egoschema` task in lmms-eval	—

Baselines

The baselines/ directory reproduces the standard lmms-eval adapters used in our paper for head-to-head comparisons (Qwen2.5-VL, Qwen2-VL, InternVL2/3, LLaVA-OneVision, VILA-1.5). Each file can be dropped into an lmms-eval tasks search path or invoked through the scripts/slurm/baseline_*.slurm templates.

Citation

If you find A.I.R. useful, please cite our ICLR 2026 paper:

@inproceedings{zou2026air,
  title     = {A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based
               Frame Selection For Video Question Answering},
  author    = {Yuanhao Zou and Shengji Jin and Andong Deng and
               Youpeng Zhao and Jun Wang and Chen Chen},
  booktitle = {The Fourteenth International Conference on Learning
               Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=SZVpOKw0YD}
}

License

Released under the MIT license — see LICENSE for the full text.

Acknowledgements

A.I.R. is built on top of the lmms-eval evaluation harness and borrows reference code from Qwen-VL, InternVL, LLaVA-OneVision / LLaVA-NeXT, VILA, OpenAI CLIP, OpenCLIP, EVA-CLIP, and Long-CLIP. Please also cite the upstream projects when using the corresponding backbones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A.I.R. — Adaptive, Iterative, and Reasoning-based Frame Selection for Video QA

News

Key ideas

Repository layout

Installation

Environment variables

Running benchmarks with lmms-eval

Supported benchmarks

Baselines

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
air		air
assets		assets
baselines		baselines
benchmark_utils		benchmark_utils
configs		configs
scripts/slurm		scripts/slurm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

A.I.R. — Adaptive, Iterative, and Reasoning-based Frame Selection for Video QA

News

Key ideas

Repository layout

Installation

Environment variables

Running benchmarks with lmms-eval

Supported benchmarks

Baselines

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages