VIR-Bench:
Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Hao Wang*¹, Eiki Murata*^2,3, Lingfang Zhang¹, Ayako Sato², So Fukuda¹, Ziqi Yin¹, Wentao Hu¹, Keisuke Nakao¹, Yusuke Nakamura¹, Sebastian Zwirner¹, Yi-Chia Chen¹, Hiroyuki Otomo², Hiroki Ouchi^4,2, Daisuke Kawahara¹,

¹Waseda University ²CyberAgent, Inc. ³AI Shift, Inc. ⁴Nara Institute of Science and Technology

* Equal contribution

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

Release

2025-11-14 🙌 Our paper was accepted by AAAI 2026!
2025-09-20 🚀 We released the benchmark together with its evaluation framework and agent implementations.

VIR-Bench

Overview: VIR-Bench is a benchmark to evaluate long-range geospatial-temporal understanding via itinerary reconstruction from travel vlog videos. The core output is a directed visiting order graph: nodes represent locations at three granularities (prefecture, city, and point of interest (POI)) and edges represent two relations, inclusion for spatial hierarchy and transition for temporal adjacency. The dataset comprises 200 travel vlogs filmed across Japan, a major inbound tourism destination, each accompanied by a manually annotated and double-reviewed visiting order graph.

Experiments

Task Definition

We aim to generate visiting order graphs directly from videos with MLLMs. However, our preliminary experiments revealed that this end-to-end approach is too difficult for current models. To address this, we decompose the task into two sub-tasks: node prediction and edge prediction.

Node Prediction: This task evaluates models’ geospatial understanding, akin to playing "GeoGuessr". Given a video, MLLMs are asked to return all visited locations in three JSON lists (prefectures, cities, and POIs). For each POI, the model must also predict its category.

Edge Prediction: Given a video and all visited locations (gold labels, shuffled), MLLMs are asked to predict all inclusion and transition edges that constitute the video’s visiting order graph. The output should be a JSON list of tuples formatted as <source, target, edge_type>. Inclusion edge prediction evaluates models’ geospatial knowledge, while transition edge prediction assesses their temporal understanding.

Evaluation Setups

We evaluate 13 mainstream MLLMs on VIR-Bench in a zero-shot setting. For each model, we use the maximum number of input frames allowed by its interface or pre-training configuration.

We evaluate models using macro-averaged precision, recall, and F1 across both node and edge prediction. For prefecture and city nodes, a prediction is considered correct only if it exactly matches the gold label's surface name. For POIs, we apply a lightweight sequence-matching algorithm: predictions with a similarity score above 0.7 (high similarity) are treated as correct; predictions with a score above 0.5 (moderate similarity) are also accepted if the predicted POI category matches the gold category; all others are treated incorrect. For inclusion and transition edges, a prediction is counted as correct only when the tuple <source, target, edge_type> exactly matches the gold tuple.

Results

Across all five task categories, open-weight models continue to underperform proprietary models. The strongest open model, Qwen2.5-VL-72B, comes close to proprietary performance on the easier categories (prefecture node prediction and inclusion edge prediction), but substantial gaps remain on the harder categories (POI node prediction and transition edge prediction). Other open models perform markedly worse: the LLaVA-Video series and InternVL3-8B achieve only single-digit F1 in city and POI node prediction, and five of the nine models also remain in single digits on transition edge prediction. In the proprietary models, Gemini-2.5-Pro is the top performer, especially on edge prediction, yet its F1 scores for city/POI node and transition edge prediction remain around 60. Taken together, these findings indicate that VIR-Bench is highly challenging for current MLLMs and highlight persistent limitations in geospatial and temporal understanding.

Download the Dataset

We release the VIR-Bench dataset strictly for research purposes, in compliance with Article 30-4 (Use for Non-Enjoyment Purposes) and Article 47-5 (Minor Use in Information Analysis Services) of the Japanese Copyright Act. Commercial use of any kind is strictly prohibited. The dataset may not be redistributed on servers outside Japan or under alternative licenses.

Dataset link: https://soya.infini-cloud.net/share/1302266998c5d047

To run evaluations, download and unzip data.zip and videos.zip, and organize them into the following directory structure. graphs.zip (containing visiting order graphs in both pickle and SVG formats) is optional and not required for evaluation.

VIR-Bench/
  ├── data/
  │   ├── test-00000-of-00001.parquet
  │   └── validation-00000-of-00001.parquet
  ├── videos/
  │   ├── 0oODCXC3oms.mp4
  │   └── ...

Run Your Own Evaluation

Installation

To evaluate the models featured in our paper, please use the following commands to set up your environment.

git clone https://github.com/nlp-waseda/VIR-Bench.git

cd lmms-eval && pip install -e . && cd ..

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT && pip install -e . && cd ..

pip install qwen-vl-utils[decord]
pip install flash-attn --no-build-isolation
pip install google-genai

Also update both ./lmms-eval/lmms_eval/tasks/virbench/node_prediction.yaml and ./lmms-eval/lmms_eval/tasks/virbench/edge_prediction.yaml by replacing the dataset path:

dataset_path: YOUR_PATH_TO/VIR-Bench
dataset_kwargs:
  cache_dir: YOUR_PATH_TO/VIR-Bench
  video: True
  local_files_only: True
...

Evaluation

Below is an example using Qwen2.5-VL-7B-Instruct to perform evaluation on node prediction and edge prediction.

# Node Prediction
python -m accelerate.commands.launch \
    --num_processes=1 \
    -m lmms_eval \
    --model qwen2_5_vl \
    --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_num_frames=256,min_pixels=37632,max_pixels=75264,attn_implementation=flash_attention_2,device_map=auto" \
    --tasks virbench_node_prediction \
    --batch_size 1 \
    --log_samples \
    --output_path YOUR_OUTPUT_PATH

# Edge Prediction
python -m accelerate.commands.launch \
    --num_processes=1 \
    -m lmms_eval \
    --model qwen2_5_vl \
    --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_num_frames=256,min_pixels=37632,max_pixels=75264,attn_implementation=flash_attention_2,device_map=auto" \
    --tasks virbench_edge_prediction \
    --batch_size 1 \
    --log_samples \
    --output_path YOUR_OUTPUT_PATH

If you encounter issues with decord when running Qwen models, try fixing the code in YOUR_PATH_TO_VENV/lib/python3.x/site-packages/qwen_vl_utils/vision_process.py.

- vr = decord.VideoReader(path)
+ vr = decord.VideoReader(path, ctx=decord.cpu(0), num_threads=1)

Below is an example using Gemini-2.5-Flash to perform evaluation on node prediction and edge prediction.

export GOOGLE_API_KEY="YOUR_API_KEY"

# Node Prediction
python -m accelerate.commands.launch \
    --num_processes=1 \
    -m lmms_eval \
    --model gemini_api \
    --model_args "model_version=gemini-2.5-flash,response_persistent_folder=YOUR_PATH_TO_RESPONSE_FOLDER" \
    --tasks virbench_node_prediction \
    --batch_size 1 \
    --log_samples \
    --output_path YOUR_OUTPUT_PATH

# Edge Prediction
python -m accelerate.commands.launch \
    --num_processes=1 \
    -m lmms_eval \
    --model gemini_api \
    --model_args "model_version=gemini-2.5-flash,response_persistent_folder=YOUR_PATH_TO_RESPONSE_FOLDER" \
    --tasks virbench_edge_prediction \
    --batch_size 1 \
    --log_samples \
    --output_path YOUR_OUTPUT_PATH

For example scripts covering all models, see the scripts directory.

Travel-planning Agent

We provide the full code for the travel-planning agent used in our paper. See the agent/README for setup and usage instructions.

Acknowledgement

The evaluatiokn code is build upon lmms-eval. We acknowledge their team for providing this excellent toolkit for evaluating multimodal large language models.

Citation

@misc{wang2025virbenchevaluatinggeospatialtemporal,
      title={VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction}, 
      author={Hao Wang and Eiki Murata and Lingfang Zhang and Ayako Sato and So Fukuda and Ziqi Yin and Wentao Hu and Keisuke Nakao and Yusuke Nakamura and Sebastian Zwirner and Yi-Chia Chen and Hiroyuki Otomo and Hiroki Ouchi and Daisuke Kawahara},
      year={2025},
      eprint={2509.19002},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.19002}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
agent		agent
icons		icons
lmms-eval		lmms-eval
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VIR-Bench:
Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Release

Contents

VIR-Bench

Experiments

Task Definition

Evaluation Setups

Results

Download the Dataset

Run Your Own Evaluation

Installation

Evaluation

Travel-planning Agent

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Release

Contents

VIR-Bench

Experiments

Task Definition

Evaluation Setups

Results

Download the Dataset

Run Your Own Evaluation

Installation

Evaluation

Travel-planning Agent

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

VIR-Bench:
Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Packages