Robots Working
Architecture Diagram
Bot Extinguishing Fire
GIF
Neural Heuristic Based Search Optimization
Network Diagram
Singular Robot

OpenSwarm, an Extensible Swarm Orchestration Framework

Inspiration

As physical AI systems scale, coordinating multiple agents in dynamic environments remains brittle and hardware dependent. Most multi-robot systems rely on distributed intelligence, where each robot must reason independently. This increases cost, complexity, and limits scalability.

We wanted to explore a different model: what if intelligence lived in one centralized "queen" that observes the environment from above, reasons globally, and orchestrates simple, stateless worker robots?

OpenSwarm was built to be that orchestration layer.

What it does

OpenSwarm is a reusable, extensible framework for centralized multi-agent physical coordination.

A central intelligence system, the queen, maintains a global world model from an overhead perspective. It receives high-level user commands (natural language), decomposes them into structured subtasks using large language models, assigns roles optimally, and continuously replans when the environment changes.

Worker robots are intentionally simple and expendable. They do not reason. They execute movement and positioning commands from the queen via a standardized actions interface.

Extensible World Architecture

OpenSwarm uses a modular "world" system where each environment is self-contained with its own:

init.md - defines the world structure, agents, and capabilities
actions.py - implements available actions (move_to, collect, extinguish, etc.)

The queen automatically adapts to each world by reading its init document and available actions, and creates it's own interpretation of the world, the state and the goals, making it trivially easy to add new environments without modifying core orchestration logic.

Because intelligence is centralized and hardware is decoupled via the actions interface, this framework extends far beyond ground robots. For example, in natural disaster scenarios, overhead drones can act as the queen's perception layer, mapping debris fields, fire spread, or structural damage in real time. The queen can then coordinate fleets of low-cost, expendable ground machines to clear paths, deliver supplies, or stabilize hazardous zones. If an obstruction arises, the queen detects it and reallocates instantly without compromising the mission.

How we built it

Architecture Layers

1. Perception Layer

An overhead view (camera or simulation) tracks robot and object coordinates and maintains world state. Each world implements _get_state() to provide current positions, obstacles, and task-relevant information.

2. Orchestration Layer (Queen)

The queen runs a continuous poll loop that:

Reads the world initialization document and generates a structured world model using an LLM
Monitors a task queue (populated via user input or autonomous triggers)
For each task, sends the current world state + available actions to an LLM
Receives structured function calls with optimal bot assignments
Executes calls in parallel when possible (different bots) or sequentially (same bot)
Continuously updates world state and triggers replanning when needed

The queen supports multiple LLM backends (Claude, GPT, Gemini) via a modular interface, with vision support for screenshot-based reasoning.

3. Execution Layer

Worker robots receive structured commands via a standardized actions interface. Each action function:

Takes explicit parameters (positions, bot IDs, etc.)
Writes commands to an IPC file or directly to hardware
Returns immediately (non-blocking)

For physical robots, commands flow through:

WebSocket server (multi-device coordination)
ESP32 microcontrollers with PWM motor control
ArUco marker tracking for real-time localization

4. Neural Pathfinding with Modal

OpenSwarm uses a neural A* pathfinder with learned heuristics for efficient collision-free navigation:

Training: Trained a neural network (HeuristicNetV2) on thousands of pathfinding scenarios to predict optimal distance-to-goal heuristics
Inference: Model deployed to Modal for large-scale parallel inference
Scaling: Supports up to 50 concurrent pathfinding requests via Modal's parallel execution

The Modal deployment enables:

Zero cold starts (min_containers=1)
Auto-scaling under load (max_containers=10)
Batched pathfinding for large swarms (50 bots in mimic_world)

Collision Avoidance

For large swarms, paths are grouped into collision-free waves:

All bot paths are computed in parallel via Modal
Paths sharing grid cells are separated into sequential waves
Each wave moves simultaneously
Next wave starts after longest path in previous wave completes

This enables smooth, collision-free movement of 50 bots without complex multi-agent planning.

Challenges we ran into

1. Power Management

One of the key challenges we faced was balancing power efficiency with performance while keeping robots compact. We initially planned to use a single 18650 battery with a voltage booster. However, this setup could not provide enough current to reliably drive both motors simultaneously.

To overcome this, we simplified the power architecture by removing the booster and implementing PWM-based motor control. This allowed us to efficiently manage power delivery while maintaining reliable performance and meeting size constraints.

2. Large-Scale Pathfinding

Computing collision-free paths for 50 bots in real-time was initially too slow with traditional A*. We solved this by:

Training a neural heuristic to accelerate A* (reduces search space by ~60%)
Deploying to Modal for GPU acceleration
Implementing parallel path computation with wave-based collision avoidance

3. Hardware Abstraction

Making the same actions interface work for both simulation and physical robots required careful design:

Simulated bots use file-based IPC with instant state updates
Physical bots use WebSocket communication with real-time camera tracking
Both implement the same move_to(target, bot_id) interface
Main orchestration code remains identical across worlds

Accomplishments that we're proud of

Extensible architecture - Adding a new world requires only 3 files (init.md, actions.py, simulation.py). The queen adapts automatically.
Large-scale coordination - Demonstrated real-time coordination of 50 bots with collision-free movement using Modal for GPU-accelerated pathfinding.
Hardware abstraction - Same orchestration code controls both simulated and physical robots via a standardized actions interface.
Multi-modal reasoning - Queen processes natural language commands, visual input (screenshots), and structured world state to make decisions.
Neural pathfinding - Trained and deployed a learned heuristic that accelerates A* by 3-5× over traditional manhattan distance heuristics.

What we learned

Separating intelligence from execution dramatically simplifies scaling. When the queen maintains a unified world model, coordination becomes a systems problem rather than a robotics problem. Adding more bots doesn't increase complexity, it just increases parallelism.

LLMs are surprisingly good at spatial reasoning when given visual context (screenshots) and structured state information. The queen can reason about optimal bot assignments, formation control, and priority allocation without hand-coded heuristics.

Parallel/scalable pathfinding changes what's possible. Modal's deployment infrastructure let us scale from 2-bot demos to 50-bot swarms without rewriting pathfinding logic. The same neural model runs locally during development and on GPU in production.

What's next for OpenSwarm

In the long term, we see applications for OpenSwarm in warehouse logistics, disaster response, construction automation, satellite coordination, and more. We'd love to expand on OpenSwarm's feature set and continue to make its hardware more practical and accessible.