Today's off-the-shelf LMs have incredible generalization, reasoning, and planning capabilities. Agentic harnesses in CaP-Agent0 unleashes their potential in the physical world.
Click on each task to see the agent in action.
CaP-Bench provides the first comprehensive benchmark for evaluating how well large language model agents can write code to control robots. Integrated with hundreds of manipulation tasks across multiple robot learning benchmarks (LIBERO-PRO, Robosuite, BEHAVIOR), CaP-Bench tests both LLM and VLM models on their ability to generate executable robot control policies from natural language instructions.
Click on each task to see the agent in action.
Without any task-specific training, today's best frontier models can directly generate executable robot control code and achieve over 30% average success — a sharp contrast to the prior belief that only specially trained models (VLAs) can perform manipulation. Yet a 56-point gap to human performance remains, marking this as one of AI's most important open challenges.
On LIBERO-PRO — 30 manipulation tasks with position and instruction perturbations — state-of-the-art Vision-Language-Action models (OpenVLA, π0) score 0% across the board. Even the best VLA (π0.5) reaches only 13% average success. CaP-Agent0, a training-free coding agent, achieves 18% without any task-specific training, demonstrating that code-generation agents generalize where end-to-end learned policies break down.
Using CaP-RL, we apply reinforcement learning with environment rewards directly on the coding agent. A 7B model (Qwen 2.5 Coder) jumps from 20% to 72% average success in simulation after just 50 training iterations. The learned policies transfer to a real Franka Emika robot with minimal sim-to-real gap — reaching 84% on cube lifting and 76% on cube stacking, approaching human expert performance.
As API abstraction increases from raw primitives (S4) to high-level pick-and-place (S1), all models improve substantially — but the gains are most pronounced for weaker and open-source models, whose compilation rates collapse at low abstraction levels. This suggests a promising path: pair a lightweight LM for high-level planning with a visual-motor policy (e.g., a VLA) that handles low-level control, letting even smaller models achieve strong task performance through the right division of labor.