Skip to content

microsoft/MV-RoboBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

🔥🔥🔥 ICLR 2026 Accepted 🔥🔥🔥

ZhiYuan Feng¹*, Zhaolu Kang²*, Qijie Wang¹*, Zhiying Du³*, Jiongrui Yan⁴, Shi Shubin⁴, Chengbo Yuan¹, Huizhi Liang¹, Yu Deng⁵, Qixiu Li¹, Rushuai Yang⁶, Ruichuan An², Leqi Zheng¹, Weijie Wang⁷, Shawn Chen⁷, Sicheng Xu⁵, Yaobo Liang⁵, Jiaolong Yang⁵†, Baining Guo⁵


¹Tsinghua University, ²Peking University, ³Fudan University, ⁴Jilin University, ⁵Microsoft Research Asia, ⁶Hong Kong University of Science and Technology, ⁷Zhejiang University

(*Equal Contribution, †Corresponding Author)


🎉 News

  • [2025.10] 📢📢 Paper and initial project release.
  • [2026.01] 📦 Benchmark dataset released on Hugging Face.
  • [2026.01] 🎉🎉 Paper accepted to ICLR 2026.
  • [2026.03] 🛠️ Evaluation code released (see the evaluation branch).

📝 To-Do List

MV-RoboBench

Data Pipeline

Benchmark Overview: We introduce MV-RoboBench, a benchmark designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic scenes. It contains [Number] question-answer pairs across [Number] diverse robotic scenes. The benchmark comprises [Number] challenging tasks, such as [Task 1 Name], [Task 2 Name], and [Task 3 Name]. These tasks are designed to probe various aspects of 3D scene understanding, from establishing object correspondences to understanding relative spatial poses.

Benchmark Examples

📌 A Benchmark for Robotic Scenes: We introduce MV-RoboBench, a comprehensive benchmark designed to evaluate the spatial reasoning of Vision-Language Models in robotic scenes.

📊 Comprehensive Evaluation: We evaluate [Number] state-of-the-art VLMs, including models like GPT-4o and Claude 3, revealing a significant performance gap compared to human-level reasoning.

🔍 Revealing Core Challenges: Our analysis pinpoints key failure modes for current models in robotic scene understanding, particularly in cross-view correspondence, relative pose estimation, and action planning.

Contact

For any questions or suggestions, please feel free to contact Zhiyuan Feng or another author.

About

MV-RoboBench

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors