GitHub - H-EmbodVis/DOMINO: Towards Generalizable Robotic Manipulation in Dynamic Environments

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang¹, Shangru Li¹, Shuhan Wang¹, Xuanyang Xi², Dingkang Liang¹, Xiang Bai¹

¹ Huazhong University of Science and Technology, ² Huawei Technologies Co. Ltd

🔍 Overview

Dynamic manipulation requires robots to continuously adapt to moving objects and unpredictable environmental changes. Existing Vision-Language-Action (VLA) models rely on static single-frame observations, failing to capture essential spatiotemporal dynamics. We introduce DOMINO, a comprehensive benchmark for this underexplored frontier, and PUMA, a predictive architecture that couples historical motion cues with future state anticipation to achieve highly reactive embodied intelligence.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks.

🎥 Visual Demos

More visual demos can be found on our project homepage.

✨ Key Idea

Current VLA models struggle with dynamic manipulation tasks due to a scarcity of dynamic datasets and a reliance on single-frame observations.
We introduce DOMINO, a large-scale benchmark for dynamic manipulation comprising 35 tasks and over 110K expert trajectories.
We propose PUMA, a dynamics-aware VLA architecture that integrates historical optical flow and world queries to forecast future object states.
Training on dynamic data fosters robust spatiotemporal representations, demonstrating enhanced generalization capabilities.

📅 TODO

Release the paper
Release DOMINO data generation pipeline
Release DOMINO dataset
Release PUMA training code
Release PUMA checkpoint and eval code
Support Huawei Ascend NPUs

🛠️ Getting Started

Coming soon...

👍 Acknowledgement

We build upon the following great works and open source repositories

📖 Citation

@article{fang2026towards,
      title={Towards Generalizable Robotic Manipulation in Dynamic Environments},
      author={Fang, Heng and Li, Shangru and Wang, Shuhan and Xi, Xuanyang and Liang, Dingkang and Bai, Xiang},
      journal={arXiv preprint arXiv:2603.15620},
      year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang¹, Shangru Li¹, Shuhan Wang¹, Xuanyang Xi², Dingkang Liang¹, Xiang Bai¹

🔍 Overview

🎥 Visual Demos

✨ Key Idea

📅 TODO

🛠️ Getting Started

👍 Acknowledgement

📖 Citation

About

Uh oh!

Contributors 2

Folders and files

Latest commit

History

Repository files navigation

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang1, Shangru Li1, Shuhan Wang1, Xuanyang Xi2, Dingkang Liang1, Xiang Bai1

🔍 Overview

🎥 Visual Demos

✨ Key Idea

📅 TODO

🛠️ Getting Started

👍 Acknowledgement

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Heng Fang¹, Shangru Li¹, Shuhan Wang¹, Xuanyang Xi², Dingkang Liang¹, Xiang Bai¹