VIR-Bench:
Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agentβs markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.
2025-11-14π Our paper was accepted by AAAI 2026!2025-09-20π We released the benchmark together with its evaluation framework and agent implementations.
- VIR-Bench
- Experiments
- Download the Dataset
- Run Your Own Evaluation
- Travel-planning Agent
- Acknowledgement
- Citation
Overview: VIR-Bench is a benchmark to evaluate long-range geospatial-temporal understanding via itinerary reconstruction from travel vlog videos. The core output is a directed visiting order graph: nodes represent locations at three granularities (prefecture, city, and point of interest (POI)) and edges represent two relations, inclusion for spatial hierarchy and transition for temporal adjacency. The dataset comprises 200 travel vlogs filmed across Japan, a major inbound tourism destination, each accompanied by a manually annotated and double-reviewed visiting order graph.
We aim to generate visiting order graphs directly from videos with MLLMs. However, our preliminary experiments revealed that this end-to-end approach is too difficult for current models. To address this, we decompose the task into two sub-tasks: node prediction and edge prediction.
Node Prediction: This task evaluates modelsβ geospatial understanding, akin to playing "GeoGuessr". Given a video, MLLMs are asked to return all visited locations in three JSON lists (prefectures, cities, and POIs). For each POI, the model must also predict its category.
Edge Prediction: Given a video and all visited locations (gold labels, shuffled), MLLMs are asked to predict all inclusion and transition edges that constitute the videoβs visiting order graph. The output should be a JSON list of tuples formatted as <source, target, edge_type>. Inclusion edge prediction evaluates modelsβ geospatial knowledge, while transition edge prediction assesses their temporal understanding.
We evaluate 13 mainstream MLLMs on VIR-Bench in a zero-shot setting. For each model, we use the maximum number of input frames allowed by its interface or pre-training configuration.
We evaluate models using macro-averaged precision, recall, and F1 across both node and edge prediction. For prefecture and city nodes, a prediction is considered correct only if it exactly matches the gold label's surface name. For POIs, we apply a lightweight sequence-matching algorithm: predictions with a similarity score above 0.7 (high similarity) are treated as correct; predictions with a score above 0.5 (moderate similarity) are also accepted if the predicted POI category matches the gold category; all others are treated incorrect. For inclusion and transition edges, a prediction is counted as correct only when the tuple <source, target, edge_type> exactly matches the gold tuple.
Across all five task categories, open-weight models continue to underperform proprietary models. The strongest open model, Qwen2.5-VL-72B, comes close to proprietary performance on the easier categories (prefecture node prediction and inclusion edge prediction), but substantial gaps remain on the harder categories (POI node prediction and transition edge prediction). Other open models perform markedly worse: the LLaVA-Video series and InternVL3-8B achieve only single-digit F1 in city and POI node prediction, and five of the nine models also remain in single digits on transition edge prediction. In the proprietary models, Gemini-2.5-Pro is the top performer, especially on edge prediction, yet its F1 scores for city/POI node and transition edge prediction remain around 60. Taken together, these findings indicate that VIR-Bench is highly challenging for current MLLMs and highlight persistent limitations in geospatial and temporal understanding.
We release the VIR-Bench dataset strictly for research purposes, in compliance with Article 30-4 (Use for Non-Enjoyment Purposes) and Article 47-5 (Minor Use in Information Analysis Services) of the Japanese Copyright Act. Commercial use of any kind is strictly prohibited. The dataset may not be redistributed on servers outside Japan or under alternative licenses.
Dataset link: https://soya.infini-cloud.net/share/1302266998c5d047
To run evaluations, download and unzip data.zip and videos.zip, and organize them into the following directory structure.
graphs.zip (containing visiting order graphs in both pickle and SVG formats) is optional and not required for evaluation.
VIR-Bench/
βββ data/
β βββ test-00000-of-00001.parquet
β βββ validation-00000-of-00001.parquet
βββ videos/
β βββ 0oODCXC3oms.mp4
β βββ ...
To evaluate the models featured in our paper, please use the following commands to set up your environment.
git clone https://github.com/nlp-waseda/VIR-Bench.git
cd lmms-eval && pip install -e . && cd ..
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT && pip install -e . && cd ..
pip install qwen-vl-utils[decord]
pip install flash-attn --no-build-isolation
pip install google-genaiAlso update both ./lmms-eval/lmms_eval/tasks/virbench/node_prediction.yaml and ./lmms-eval/lmms_eval/tasks/virbench/edge_prediction.yaml by replacing the dataset path:
dataset_path: YOUR_PATH_TO/VIR-Bench
dataset_kwargs:
cache_dir: YOUR_PATH_TO/VIR-Bench
video: True
local_files_only: True
...
Below is an example using Qwen2.5-VL-7B-Instruct to perform evaluation on node prediction and edge prediction.
# Node Prediction
python -m accelerate.commands.launch \
--num_processes=1 \
-m lmms_eval \
--model qwen2_5_vl \
--model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_num_frames=256,min_pixels=37632,max_pixels=75264,attn_implementation=flash_attention_2,device_map=auto" \
--tasks virbench_node_prediction \
--batch_size 1 \
--log_samples \
--output_path YOUR_OUTPUT_PATH
# Edge Prediction
python -m accelerate.commands.launch \
--num_processes=1 \
-m lmms_eval \
--model qwen2_5_vl \
--model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_num_frames=256,min_pixels=37632,max_pixels=75264,attn_implementation=flash_attention_2,device_map=auto" \
--tasks virbench_edge_prediction \
--batch_size 1 \
--log_samples \
--output_path YOUR_OUTPUT_PATHIf you encounter issues with decord when running Qwen models, try fixing the code in YOUR_PATH_TO_VENV/lib/python3.x/site-packages/qwen_vl_utils/vision_process.py.
- vr = decord.VideoReader(path)
+ vr = decord.VideoReader(path, ctx=decord.cpu(0), num_threads=1)Below is an example using Gemini-2.5-Flash to perform evaluation on node prediction and edge prediction.
export GOOGLE_API_KEY="YOUR_API_KEY"
# Node Prediction
python -m accelerate.commands.launch \
--num_processes=1 \
-m lmms_eval \
--model gemini_api \
--model_args "model_version=gemini-2.5-flash,response_persistent_folder=YOUR_PATH_TO_RESPONSE_FOLDER" \
--tasks virbench_node_prediction \
--batch_size 1 \
--log_samples \
--output_path YOUR_OUTPUT_PATH
# Edge Prediction
python -m accelerate.commands.launch \
--num_processes=1 \
-m lmms_eval \
--model gemini_api \
--model_args "model_version=gemini-2.5-flash,response_persistent_folder=YOUR_PATH_TO_RESPONSE_FOLDER" \
--tasks virbench_edge_prediction \
--batch_size 1 \
--log_samples \
--output_path YOUR_OUTPUT_PATHFor example scripts covering all models, see the scripts directory.
We provide the full code for the travel-planning agent used in our paper. See the agent/README for setup and usage instructions.
The evaluatiokn code is build upon lmms-eval. We acknowledge their team for providing this excellent toolkit for evaluating multimodal large language models.
@misc{wang2025virbenchevaluatinggeospatialtemporal,
title={VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction},
author={Hao Wang and Eiki Murata and Lingfang Zhang and Ayako Sato and So Fukuda and Ziqi Yin and Wentao Hu and Keisuke Nakao and Yusuke Nakamura and Sebastian Zwirner and Yi-Chia Chen and Hiroyuki Otomo and Hiroki Ouchi and Daisuke Kawahara},
year={2025},
eprint={2509.19002},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.19002},
}