Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

🔥🔥🔥 ICLR 2026 Accepted 🔥🔥🔥

ZhiYuan Feng¹*, Zhaolu Kang²*, Qijie Wang¹*, Zhiying Du³*, Jiongrui Yan⁴, Shi Shubin⁴, Chengbo Yuan¹, Huizhi Liang¹, Yu Deng⁵, Qixiu Li¹, Rushuai Yang⁶, Ruichuan An², Leqi Zheng¹, Weijie Wang⁷, Shawn Chen⁷, Sicheng Xu⁵, Yaobo Liang⁵, Jiaolong Yang⁵†, Baining Guo⁵

¹Tsinghua University, ²Peking University, ³Fudan University, ⁴Jilin University, ⁵Microsoft Research Asia, ⁶Hong Kong University of Science and Technology, ⁷Zhejiang University

(*Equal Contribution, †Corresponding Author)

🎉 News

[2025.10] 📢📢 Paper and initial project release.
[2026.01] 📦 Benchmark dataset released on Hugging Face.
[2026.01] 🎉🎉 Paper accepted to ICLR 2026.
[2026.03] 🛠️ Evaluation code released (see the evaluation branch).

📝 To-Do List

Release Evaluation Code (see the evaluation branch)
Release the benchmark dataset on Hugging Face

MV-RoboBench

Benchmark Overview: We introduce MV-RoboBench, a benchmark designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic scenes. It contains [Number] question-answer pairs across [Number] diverse robotic scenes. The benchmark comprises [Number] challenging tasks, such as [Task 1 Name], [Task 2 Name], and [Task 3 Name]. These tasks are designed to probe various aspects of 3D scene understanding, from establishing object correspondences to understanding relative spatial poses.

📌 A Benchmark for Robotic Scenes: We introduce MV-RoboBench, a comprehensive benchmark designed to evaluate the spatial reasoning of Vision-Language Models in robotic scenes.

📊 Comprehensive Evaluation: We evaluate [Number] state-of-the-art VLMs, including models like GPT-4o and Claude 3, revealing a significant performance gap compared to human-level reasoning.

🔍 Revealing Core Challenges: Our analysis pinpoints key failure modes for current models in robotic scene understanding, particularly in cross-view correspondence, relative pose estimation, and action planning.

Contact

For any questions or suggestions, please feel free to contact Zhiyuan Feng or another author.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
figures		figures
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

🎉 News

📝 To-Do List

MV-RoboBench

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

🎉 News

📝 To-Do List

MV-RoboBench

Contact

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages