Skip to content

H-EmbodVis/DOMINO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang1, Shangru Li1, Shuhan Wang1, Xuanyang Xi2, Dingkang Liang1, Xiang Bai1
1 Huazhong University of Science and Technology, 2 Huawei Technologies Co. Ltd

🔍 Overview

Dynamic manipulation requires robots to continuously adapt to moving objects and unpredictable environmental changes. Existing Vision-Language-Action (VLA) models rely on static single-frame observations, failing to capture essential spatiotemporal dynamics. We introduce DOMINO, a comprehensive benchmark for this underexplored frontier, and PUMA, a predictive architecture that couples historical motion cues with future state anticipation to achieve highly reactive embodied intelligence.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks.

🎥 Visual Demos

More visual demos can be found on our project homepage.

✨ Key Idea

  • Current VLA models struggle with dynamic manipulation tasks due to a scarcity of dynamic datasets and a reliance on single-frame observations.
  • We introduce DOMINO, a large-scale benchmark for dynamic manipulation comprising 35 tasks and over 110K expert trajectories.
  • We propose PUMA, a dynamics-aware VLA architecture that integrates historical optical flow and world queries to forecast future object states.
  • Training on dynamic data fosters robust spatiotemporal representations, demonstrating enhanced generalization capabilities.

📅 TODO

  • Release the paper
  • Release DOMINO data generation pipeline
  • Release DOMINO dataset
  • Release PUMA training code
  • Release PUMA checkpoint and eval code
  • Support Huawei Ascend NPUs

🛠️ Getting Started

Coming soon...

👍 Acknowledgement

We build upon the following great works and open source repositories

📖 Citation

@article{fang2026towards,
      title={Towards Generalizable Robotic Manipulation in Dynamic Environments},
      author={Fang, Heng and Li, Shangru and Wang, Shuhan and Xi, Xuanyang and Liang, Dingkang and Bai, Xiang},
      journal={arXiv preprint arXiv:2603.15620},
      year={2026}
}

About

Towards Generalizable Robotic Manipulation in Dynamic Environments

Resources

License

Stars

Watchers

Forks