Simulating deformable objects is essential for a wide range of robotic manipulation applications. Yet, accurately predicting their dynamics remains challenging. Physics-based simulations are interpretable, but they require a known model class and precise identification of physical parameters, which are often difficult to obtain. Learning-based approaches can be more expressive in capturing dynamics, but they often generalize poorly outside the training distribution, require substantial training data, and lack physical consistency. We propose Physics-Guided Residual Dynamics (PGRD), a hybrid simulation framework that combines an optimizable spring–mass simulator as a backbone with a learned neural network that predicts residual corrections to compensate for discrepancies between physics-based predictions and reality. Our velocity-based formulation ensures stable simulation, while a sliding-window transformer captures temporal dependencies. We validate our approach on real-world deformable objects, demonstrating that PGRD outperforms both purely physics-based simulators and learning-based methods. We further demonstrate the practical utility of our framework in action-conditioned 3D video prediction using 3D Gaussian Splatting and in Model Predictive Control for manipulation planning on challenging tasks such as cable rerouting, where purely physics-based simulation fails.
We combine a tuned spring-mass simulator with a learned residual dynamics network. The simulator predicts the next state from current observations and actions, while the network predicts per-particle velocity corrections that are time-integrated to obtain final positions. This hybrid design leverages physics priors for improved generalization while capturing complex phenomena that are challenging to model analytically.
We use a spring-mass simulator as a backbone. The simulator includes parameters such as stiffness, damping, and friction, which are tuned to match real-world trajectories via black-box optimization.
While the optimized physics backbone captures the broad dynamics, it cannot fully model the objects. A neural network bridges this gap by predicting per-particle residual velocities, which are added to the simulator output and integrated to compute corrected positions. It consists of a Point Transformer V3 encoder, a NeRF-style decoder for initial velocity estimates, and a sliding-window transformer with gating to refine corrections using temporal history.
PGRD achieves the lowest error across all tracking metrics and improves visual quality in action-conditioned video prediction. Qualitative comparisons show improved structural consistency and more realistic deformations across all objects.
PGRD supports MPPI planning for manipulation planning, achieving 8/10 cable rerouting successes versus 2/10 for the physics-only baseline.
We further integrate PGRD with language-conditioned goal generation, allowing planning directly from text commands without collecting a target point cloud in advance. The generated goal PCD is visualized in each of the executions.
Lift right arm
Pass rope through closer slot
Pass rope through crossed slot
Pass rope through gray slot
Pass rope through red slot
Rotate sloth
PGRD enables action-conditioned 3D video prediction using 3D Gaussian Splatting for interactive simulation.