Our hardware setup
The current problem
How the project works
Our system architecture

Edge Rescue: The First Autonomous Benchmarking Platform for Off-the-Shelf VLA Policies

Inspiration

It started with a simple question: "How good is this model, actually?"

We'd been handed a DGX Spark, a robot arm, and 36 hours. The obvious move was to grab an off-the-shelf Vision-Language-Action (VLA) model, load it up, and watch the robot do something cool. So we did. We pulled Pi 0.5 from Physical Intelligence — a 4-billion parameter flow-matching policy that generates continuous robot trajectories from language and vision. We also grabbed SmolVLA from HuggingFace — a compact discretized model that tokenizes actions.

Both had impressive demo videos. Both claimed strong performance on manipulation tasks.

But when we actually tried to run them on the same robot, doing the same task, we realized something: there is no standardized way to know if a VLA model works. Every model ships with cherry-picked demo footage. Nobody publishes failure rates. Nobody compares models head-to-head on the same hardware. If you want to know whether Pi 0.5 or SmolVLA is better at stacking cubes, you have to load each one, physically watch every attempt, and manually decide whether it succeeded.

That's not engineering. That's vibes.

Then we thought about where this actually matters. In disaster response — collapsed buildings after earthquakes, infrastructure failures, hazmat scenarios — robots need to manipulate objects in unstructured environments with zero margin for error. Before you deploy a VLA policy to move rubble in a collapsed building, you need to know it works. Not from a demo video. From rigorous, autonomous, reproducible evaluation.

So we built the evaluation platform itself.

What It Does

Edge Rescue is an autonomous benchmarking platform that can evaluate any VLA policy against real-world manipulation tasks — with an independent VLM judge, on edge hardware, with no cloud dependency.

The core loop:

A user sends a natural language goal — "stack the orange cubes"
A Vision-Language Model (Cosmos-Reason1-7B) analyzes the scene via camera and decomposes the goal into an ordered plan of subtasks
The VLA under test executes each subtask — driving the real robot arm
The VLM independently verifies each step by comparing before/after camera images: "Did this action actually happen?"
If verification fails, the system retries (up to $N$ times), then replans from the current state — considering what succeeded and what failed
After all steps complete, a final VLM check verifies the overall goal is achieved. If not, it loops back and replans from scratch

The key insight: the VLM that evaluates is not the VLA being tested. It's an independent judge. This is what makes it a benchmarking platform rather than just a demo rig.

We tested two fundamentally different VLA architectures through the same pipeline:

	Pi 0.5	SmolVLA
Action space	Continuous (flow-matching)	Discrete (tokenized)
Output	Full trajectory chunks ($\sim$50 actions)	Single action per inference
Inference	~8.5s via OpenPI websocket	Direct PyTorch forward pass
Compute	DGX Spark (128GB unified)	RTX 3070 (8GB VRAM)
Format	OpenPI (JAX/flax)	HuggingFace/LeRobot native

Same robot. Same cameras. Same tasks. Same judge. Different everything else.

How We Built It

Architecture

The system runs across two machines with five concurrent services, orchestrated over ROS2 Jazzy:

DGX Spark (128GB unified memory, aarch64):

Cosmos-Reason1-7B — VLM planner and judge, served via llama.cpp (GGUF Q8_0) on port 8080
OpenPI Pi 0.5 Server — VLA inference via websocket on port 8001, serving the pi05_so101 checkpoint
VLM Planner Node — ROS2 node implementing the full plan-execute-verify-replan state machine
Pi 0.5 Executor Node — ROS2 node bridging LeRobot robot control with the OpenPI client
Web Frontend — Next.js/React dashboard with live camera feed, plan visualization, and planner output log

Bowman (RTX 3070, 8GB VRAM):

SmolVLA Executor Node — same ROS2 interface, different model, different inference stack
HTTP/SSE Bridge — REST + MJPEG streaming for alternative access
Isaac Sim — physics simulation for structural validation

The ROS2 Contract

The secret to making model-swapping trivial is the topic contract:

/mission/goal        →  Planner receives high-level goal
/plan/full           →  Planner broadcasts the full plan (JSON)
/subtask/current     →  Planner dispatches one subtask at a time
/subtask/status      →  Executor reports success/fail
/planner/output      →  Structured log events for the UI
/cam0/image_raw/compressed  →  Camera feed (10 Hz JPEG)

To swap VLA models, you only change the executor node. The planner, the judge, the UI, and the camera pipeline are completely unchanged. The Pi 0.5 executor is 300 lines of Python. The SmolVLA executor is 270 lines. Same interface, radically different internals.

The VLM Judge

Cosmos-Reason1-7B performs four distinct evaluation roles:

Scene Understanding — analyzes the workspace image to generate a feasible plan
Step Verification — compares before/after images for each subtask: "Did the gripper actually make contact with the cube?"
Failure Diagnosis — when verification fails, explains why: "The arm overshot the target position"
Goal Completion — after all steps, checks if the overall goal is truly achieved

Every before/after image pair is automatically logged to ~/planner_logs/{timestamp}/ with descriptive filenames. One evaluation run of 13 steps produced 53 timestamped images — a full visual audit trail.

The Closed Loop

The planner operates as a state machine:

$$\text{IDLE} \rightarrow \text{PLANNING} \rightarrow \text{EXECUTING} \rightarrow \text{VERIFYING} \rightarrow \begin{cases} \text{next step} \ \text{RETRY} \ \text{REPLAN} \end{cases} \rightarrow \text{GOAL_CHECK} \rightarrow \text{IDLE}$$

Crucially, replanning is context-aware: the VLM receives the list of completed steps, the failed subtask, the failure reason, and the current scene image. It doesn't start from scratch — it builds on what already worked.

Challenges We Faced

The OpenPI Format Problem (3 AM Discovery)

The Pi 0.5 checkpoint (felixmayor/pi05_so101_orange_cube) is in OpenPI format — JAX/flax metadata, no config.json. LeRobot 0.4.3 has a PI05Policy class but it expects LeRobot-format checkpoints. It cannot load OpenPI-format weights.

We had to:

Write custom SO-101 input/output transforms (so101_policy.py) to map our camera names and joint states to the model's expected format
Patch the transformers library for OpenPI's PyTorch mode: cp -r src/openpi/models_pytorch/transformers_replace/* .venv/.../transformers/
Discover at 3 AM that openpi-client requires numpy<2.0.0, which conflicted with half our stack
Manage two separate Python environments: OpenPI uses uv with Python 3.11, while LeRobot runs on Python 3.12

The Serial Port Contention Problem

LeRobot owns the SO-ARM101's serial port for motor control. The cameras are also initialized through LeRobot. But the VLM planner needs camera frames for verification, and the web frontend needs them for live display.

You can't have multiple processes fighting over a USB serial bus. Our solution: a camera republisher thread inside the executor node that grabs frames from LeRobot's observation dictionary and publishes them to ROS2 at 10 Hz, protected by a threading.Lock mutex to prevent serial port contention between camera reads and motor commands.

The Action Space Bridge

Pi 0.5 returns action chunks — approximately 50 continuous joint positions per inference call. SmolVLA returns single discrete actions — one tokenized step per forward pass. The executor abstraction handles both: Pi 0.5's executor loops over the returned chunk at 50 Hz (matching training frequency), while SmolVLA's executor calls inference on every step.

Both publish the same success/fail string to /subtask/status. The planner doesn't know or care which model is running.

VLM JSON Reliability

Cosmos-Reason1 sometimes wraps its JSON output in markdown code fences, or adds conversational preamble before the JSON object. We built a robust fallback parser:

def _parse_json(self, text):
    try:
        return json.loads(text)          # Try direct parse
    except json.JSONDecodeError:
        pass
    start = text.find('{')
    end = text.rfind('}')
    if start != -1 and end > start:
        return json.loads(text[start:end + 1])  # Extract JSON substring

This handles every prompt type (plan, verify, replan, goal_check) with zero parse failures across all our evaluation runs.

What We Learned

Off-the-shelf VLAs are much more fragile than their demos suggest. Pi 0.5 generates beautifully smooth trajectories but overshoots on precise placement. SmolVLA reaches for objects reliably but its discretized actions lose fine motor control at the quantization boundary. Neither model's demo video would tell you this.
An independent VLM judge changes everything. When the model evaluating success is separate from the model generating actions, you get honest signal. The VLM caught failures that a human observer might miss in real-time — subtle cases where the gripper closed 2mm too early, or where an object shifted but didn't actually reach the target.
The ROS2 abstraction layer was the highest-leverage decision we made. By defining a clean topic contract up front, swapping between Pi 0.5 and SmolVLA became a one-line change in the launch script. Every minute we spent on that interface saved hours of integration work later.
128GB of unified memory is a superpower. Running a 7B VLM and a 4B VLA simultaneously on the same device — with camera streams and a web server — would be impossible on consumer hardware. The DGX Spark's unified memory architecture meant we never had to choose between model quality and system complexity.
Evaluation infrastructure is a prerequisite to trust. Before you deploy a VLA to move rubble in a collapsed building, you benchmark it moving cubes on a table. Same platform. Same evaluation loop. Same independent judge. The only thing that changes is the stakes.

What's Next

The platform is model-agnostic by design. We tested two VLAs, but the architecture supports any policy that can receive a text prompt and camera images and return joint-space actions. The verification loop, the replanning logic, and the visual audit trail work regardless of what's generating the actions.

The immediate extensions:

Quantitative comparison dashboards — success rates, retry counts, and replanning frequency across models and tasks
Isaac Sim integration — physics-validated stress testing before real execution (structural stability, collision prediction)
Multi-robot evaluation — run the same task on multiple arms simultaneously to measure consistency

The long-term vision: a standardized benchmark suite for embodied AI, running entirely on edge hardware, that gives robotics teams the same confidence in their VLA policies that software teams get from CI/CD test suites.

Built With

NVIDIA DGX Spark (128GB unified memory) — VLM + VLA inference
NVIDIA Cosmos-Reason1-7B — vision-language model for planning and evaluation
Pi 0.5 (Physical Intelligence, via OpenPI) — continuous VLA policy
SmolVLA (HuggingFace) — discrete VLA policy
ROS2 Jazzy — middleware and topic-based orchestration
LeRobot — robot control and camera integration
llama.cpp — efficient VLM serving (GGUF Q8_0)
SO-ARM101 — 6-DOF robot arm
React / Next.js / Tailwind — real-time monitoring frontend
NVIDIA Isaac Sim — physics simulation
Python, PyTorch, JAX, OpenCV, roslib.js