VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
Illustration of the agentic pipeline in VideoTemp-o3. Given a video QA pair, the model performs on-demand temporal grounding to locate the most relevant segment, then refines it iteratively. Finally, it produces a reliable answer grounded in the pertinent visual evidence.
- Release the paper of VideoTemp-o3.
- Release training and evaluation code.
- Release the checkpoints of VideoTemp-o3.
- Release SFT and RL training data.
- Release VideoTemp-Bench.
conda create -n videotemp_o3 python=3.12 -y
conda activate videotemp_o3
# Our CUDA version is 12.9
# Install vLLM v0.11.0
pip install https://github.com/vllm-project/vllm/releases/download/v0.11.0/vllm-0.11.0+cu129-cp38-abi3-manylinux1_x86_64.whl
# Install ms-swift
pip install 'ms-swift[all]==3.10.0' -U
# Install flash-attn (CUDA 12 + PyTorch 2.8)
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
# Install DeepSpeed
pip install deepspeed==0.16.9Download the SFT data from Hugging Face and place it under sft/data/. The expected directory structure is:
sft/data/
├── wo_tool_call/ # cold-start data (no tool call)
│ ├── activitynet.jsonl
│ ├── charades.jsonl
│ ├── vidchapters.jsonl
│ ├── video_r1_image_mc.jsonl
│ └── video_r1_video.jsonl
└── wi_tool_call/ # tool-call data
├── activitynet.jsonl
├── longvila.jsonl
└── qvhighlight.jsonl
Download the RL data from Hugging Face and place it under rl/data/. The expected directory structure is:
rl/data/
├── qa.jsonl # video QA reward data
└── grounding.jsonl # temporal grounding reward data
bash sft/sft.shRL training uses 6 GPUs for GRPO and 2 GPUs for the rollout engine.
Step 1 — Start the rollout engine (uses GPUs 6, 7):
bash rl/rollout.shStep 2 — Start GRPO training (uses GPUs 0–5), once the rollout engine is ready:
bash rl/grpo.sh| Benchmark | Download |
|---|---|
| MLVU | https://huggingface.co/datasets/MLVU/MLVU_Test |
| Video-MMMU | https://huggingface.co/datasets/lmms-lab/VideoMMMU |
| Video-MME | https://huggingface.co/datasets/lmms-lab/Video-MME |
| LVBench | https://huggingface.co/datasets/zai-org/LVBench |
| VideoTemp-Bench | https://huggingface.co/datasets |
Place the downloaded data under the corresponding eval/<benchmark>/data/ directory.
Step 1 — Deploy the vLLM inference engine:
bash eval/7b_deploy_256.shStep 2 — Run the evaluation script for each benchmark:
# VideoTemp-Bench (MCQ)
python eval/videotemp/videotemp.py
# VideoTemp-Bench (Grounding)
python eval/videotemp/videotemp-g.py
# Video-MME
python eval/videomme/videomme.py
# MLVU
python eval/mlvu/mlvu.py
# Video-MMMU
python eval/videommmu/videommmu.py
# LVBench
python eval/lvbench/lvbench.pyAll benchmarks share a unified scoring script:
# VideoTemp MCQ (broken down by video duration)
python eval/score.py videotemp
# VideoTemp Grounding (mIoU + R@{0.3, 0.5, 0.7})
python eval/score.py videotemp-g
# Video-MME
python eval/score.py videomme --return_categories_accuracy --return_task_types_accuracy
# MLVU
python eval/score.py mlvu
# Video-MMMU
python eval/score.py videommmu
# LVBench
python eval/score.py lvbenchEach subcommand accepts --input_file <path> to override the default output path. Run python eval/score.py <benchmark> --help for details.
If you find our work useful, please consider citing:
@article{liu2026videotemp,
title={VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos},
author={Liu, Wenqi and Wang, Yunxiao and Ma, Shijie and Liu, Meng and Su, Qile and Zhang, Tianke and Fan, Haonan and Liu, Changyi and Jiang, Kaiyu and Chen, Jiankang and Tang, Kaiyu and Wen, Bin and Yang, Fan and Gao, Tingting and Li, Han and Wei, Yinwei and Song, Xuemeng},
journal={arXiv preprint arXiv:2602.07801},
year={2026}
}