Skip to content

UCF-AIR/A.I.R.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A.I.R. — Adaptive, Iterative, and Reasoning-based Frame Selection for Video QA

License: MIT arXiv OpenReview

A.I.R. motivation: CLIP-only scoring misses relevance, while dense VLM analysis is prohibitively expensive.

Reference implementation of "A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering", accepted at ICLR 2026.

A.I.R. is a training-free framework for long-video question answering. It combines CLIP-based coarse filtering, GMM-driven adaptive event segmentation, and a VLM-guided iterative refinement loop, keeping only the frames that actually help the downstream VLM answer the question.

The code plugs into lmms-eval as a new model named hvp, and is compatible with Qwen2.5-VL, Qwen2-VL, InternVL3, LLaVA-OneVision, and VILA-1.5 backbones.

News

  • 2026-01 — A.I.R. accepted to ICLR 2026.
  • 2025-10 — Preprint released on arXiv: 2510.04428.

Key ideas

  • Adaptive event segmentation. A query-conditioned Gaussian-Mixture partitions the video into a small number of events from CLIP similarity statistics — no fixed stride, no fixed scene count.
  • Density-aware sampling. Each event contributes frames in proportion to its duration and query relevance.
  • Iterative VLM refinement. A short VLM loop ranks the remaining candidates and only forwards high-potential frames for deeper reasoning, adaptively allocating compute.
  • Plug-and-play. Works with any modern video-VLM backbone without fine-tuning.

Repository layout

AIR/
├── air/                       # Core package (`import air`)
│   ├── __init__.py
│   ├── model.py               # H_VPModel: coarse filtering + iterative refinement
│   ├── hvp.py                 # lmms-eval model wrapper (registers as "hvp")
│   ├── base_models/           # Backbone adapters
│   │   ├── clip_model.py      #   CLIP / OpenCLIP / EVA-CLIP / LongCLIP / HF-CLIP
│   │   ├── Qwen2_5VL.py
│   │   ├── Qwen2_VL.py
│   │   ├── InternVL3.py
│   │   ├── Llava_OV.py
│   │   ├── VILA1_5.py
│   │   └── utils/qwen_vl_utils.py
│   └── Process_Utils/         # Prompting, sampling, video pre-processing
│       ├── process_messages.py
│       ├── process_video.py
│       └── sample_logic.py
├── baselines/                 # Reference VLM adapters for lmms-eval
│   ├── qwen2_5_vl.py
│   ├── qwen2_vl.py
│   ├── internvl2.py
│   ├── llava_onevision.py
│   └── vila.py
├── benchmark_utils/           # Task-specific helpers (MME, MLVU, LVB, NExT-QA)
├── configs/                   # lmms-eval YAML configs
├── scripts/slurm/             # Sanitised example SLURM launchers
├── assets/                    # Figures used in this README
├── requirements.txt
├── LICENSE                    # MIT
└── README.md

Installation

  1. Install a CUDA-matched PyTorch build from https://pytorch.org/get-started/locally/ (tested with torch 2.3+).

  2. Install the remaining dependencies:

    pip install -r requirements.txt
    pip install git+https://github.com/openai/CLIP.git
    pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
  3. Optional backbones that ship source-only:

    Backbone How to install
    VILA-1.5 git clone https://github.com/NVlabs/VILA and set VILA_PATH
    LLaVA-OneVision git clone https://github.com/LLaVA-VL/LLaVA-NeXT and add to PYTHONPATH
    Long-CLIP git clone https://github.com/beichenzbc/Long-CLIP and set LONGCLIP_PATH (+ LONGCLIP_CKPT_DIR)
  4. (Optional) Install FlashAttention-2 for the Qwen backbones:

    pip install flash-attn --no-build-isolation

Environment variables

A.I.R. does not hard-code any absolute paths. Configure it via env vars:

Variable Purpose Default
AIR_CLIP_CACHE_DIR Directory used to cache per-video CLIP features ./clip_cache
HF_HOME Hugging Face cache root ~/.cache/huggingface
HF_TOKEN Hugging Face token for gated datasets/models unset
VILA_PATH Local clone of NVlabs/VILA unset
VILA_SIGLIP_PATH SigLIP weights path (only for the VILA baseline) $VILA_PATH/siglip-so400m-patch14-384
LONGCLIP_PATH Local clone of Long-CLIP unset
LONGCLIP_CKPT_DIR Directory containing longclip-B.pt / longclip-L.pt unset
DECORD_EOF_RETRY_MAX Decord retry budget for corrupt videos 40960

Running benchmarks with lmms-eval

The air/hvp.py module registers A.I.R. as a new lmms-eval model named hvp. Typical invocation:

accelerate launch --num_processes=4 --main_process_port=32397 -m lmms_eval \
    --model=hvp \
    --model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,device_map=auto \
    --tasks=videomme \
    --batch_size=1 \
    --output_path=./logs_air/ \
    --log_samples

The backbone is auto-detected from the pretrained substring (Qwen2.5, Qwen2, InternVL, llava-onevision, or VILA); there is no separate backbone argument. See air/hvp.py for the full list of --model_args keys.

Example SLURM launch scripts live in scripts/slurm/. They set up the environment but intentionally do not hard-code a partition, account, or HF_TOKEN; fill those in for your cluster.

Supported benchmarks

Datasets are pulled automatically by lmms-eval from Hugging Face the first time a task runs (you may need an HF_TOKEN for gated splits).

Benchmark YAML config Helper
Video-MME configs/videomme.yaml, configs/videomme_w_subtitle.yaml benchmark_utils/mme_utils.py
MLVU configs/mlvu_dev.yaml, configs/mlvu_test.yaml benchmark_utils/mlvu_utils.py
LongVideoBench configs/longvideobench_val_v.yaml benchmark_utils/lvb_utils.py
NExT-QA configs/nextqa_mc_test.yaml, configs/nextqa_oe_val.yaml benchmark_utils/nextqa_utils.py
EgoSchema built-in egoschema task in lmms-eval

Baselines

The baselines/ directory reproduces the standard lmms-eval adapters used in our paper for head-to-head comparisons (Qwen2.5-VL, Qwen2-VL, InternVL2/3, LLaVA-OneVision, VILA-1.5). Each file can be dropped into an lmms-eval tasks search path or invoked through the scripts/slurm/baseline_*.slurm templates.

Citation

If you find A.I.R. useful, please cite our ICLR 2026 paper:

@inproceedings{zou2026air,
  title     = {A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based
               Frame Selection For Video Question Answering},
  author    = {Yuanhao Zou and Shengji Jin and Andong Deng and
               Youpeng Zhao and Jun Wang and Chen Chen},
  booktitle = {The Fourteenth International Conference on Learning
               Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=SZVpOKw0YD}
}

License

Released under the MIT license — see LICENSE for the full text.

Acknowledgements

A.I.R. is built on top of the lmms-eval evaluation harness and borrows reference code from Qwen-VL, InternVL, LLaVA-OneVision / LLaVA-NeXT, VILA, OpenAI CLIP, OpenCLIP, EVA-CLIP, and Long-CLIP. Please also cite the upstream projects when using the corresponding backbones.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages