OpenSwarm, an Extensible Swarm Orchestration Framework
Inspiration
As physical AI systems scale, coordinating multiple agents in dynamic environments remains brittle and hardware dependent. Most multi-robot systems rely on distributed intelligence, where each robot must reason independently. This increases cost, complexity, and limits scalability.
We wanted to explore a different model: what if intelligence lived in one centralized "queen" that observes the environment from above, reasons globally, and orchestrates simple, stateless worker robots?
OpenSwarm was built to be that orchestration layer.
What it does
OpenSwarm is a reusable, extensible framework for centralized multi-agent physical coordination.
A central intelligence system, the queen, maintains a global world model from an overhead perspective. It receives high-level user commands (natural language), decomposes them into structured subtasks using large language models, assigns roles optimally, and continuously replans when the environment changes.
Worker robots are intentionally simple and expendable. They do not reason. They execute movement and positioning commands from the queen via a standardized actions interface.
Extensible World Architecture
OpenSwarm uses a modular "world" system where each environment is self-contained with its own:
init.md- defines the world structure, agents, and capabilitiesactions.py- implements available actions (move_to, collect, extinguish, etc.)
The queen automatically adapts to each world by reading its init document and available actions, and creates it's own interpretation of the world, the state and the goals, making it trivially easy to add new environments without modifying core orchestration logic.
Because intelligence is centralized and hardware is decoupled via the actions interface, this framework extends far beyond ground robots. For example, in natural disaster scenarios, overhead drones can act as the queen's perception layer, mapping debris fields, fire spread, or structural damage in real time. The queen can then coordinate fleets of low-cost, expendable ground machines to clear paths, deliver supplies, or stabilize hazardous zones. If an obstruction arises, the queen detects it and reallocates instantly without compromising the mission.
How we built it
Architecture Layers
1. Perception Layer
An overhead view (camera or simulation) tracks robot and object coordinates and maintains world state. Each world implements _get_state() to provide current positions, obstacles, and task-relevant information.
2. Orchestration Layer (Queen)
The queen runs a continuous poll loop that:
- Reads the world initialization document and generates a structured world model using an LLM
- Monitors a task queue (populated via user input or autonomous triggers)
- For each task, sends the current world state + available actions to an LLM
- Receives structured function calls with optimal bot assignments
- Executes calls in parallel when possible (different bots) or sequentially (same bot)
- Continuously updates world state and triggers replanning when needed
The queen supports multiple LLM backends (Claude, GPT, Gemini) via a modular interface, with vision support for screenshot-based reasoning.
3. Execution Layer
Worker robots receive structured commands via a standardized actions interface. Each action function:
- Takes explicit parameters (positions, bot IDs, etc.)
- Writes commands to an IPC file or directly to hardware
- Returns immediately (non-blocking)
For physical robots, commands flow through:
- WebSocket server (multi-device coordination)
- ESP32 microcontrollers with PWM motor control
- ArUco marker tracking for real-time localization
4. Neural Pathfinding with Modal
OpenSwarm uses a neural A* pathfinder with learned heuristics for efficient collision-free navigation:
- Training: Trained a neural network (HeuristicNetV2) on thousands of pathfinding scenarios to predict optimal distance-to-goal heuristics
- Inference: Model deployed to Modal for large-scale parallel inference
- Scaling: Supports up to 50 concurrent pathfinding requests via Modal's parallel execution
The Modal deployment enables:
- Zero cold starts (min_containers=1)
- Auto-scaling under load (max_containers=10)
- Batched pathfinding for large swarms (50 bots in mimic_world)
Collision Avoidance
For large swarms, paths are grouped into collision-free waves:
- All bot paths are computed in parallel via Modal
- Paths sharing grid cells are separated into sequential waves
- Each wave moves simultaneously
- Next wave starts after longest path in previous wave completes
This enables smooth, collision-free movement of 50 bots without complex multi-agent planning.
Challenges we ran into
1. Power Management
One of the key challenges we faced was balancing power efficiency with performance while keeping robots compact. We initially planned to use a single 18650 battery with a voltage booster. However, this setup could not provide enough current to reliably drive both motors simultaneously.
To overcome this, we simplified the power architecture by removing the booster and implementing PWM-based motor control. This allowed us to efficiently manage power delivery while maintaining reliable performance and meeting size constraints.
2. Large-Scale Pathfinding
Computing collision-free paths for 50 bots in real-time was initially too slow with traditional A*. We solved this by:
- Training a neural heuristic to accelerate A* (reduces search space by ~60%)
- Deploying to Modal for GPU acceleration
- Implementing parallel path computation with wave-based collision avoidance
3. Hardware Abstraction
Making the same actions interface work for both simulation and physical robots required careful design:
- Simulated bots use file-based IPC with instant state updates
- Physical bots use WebSocket communication with real-time camera tracking
- Both implement the same
move_to(target, bot_id)interface - Main orchestration code remains identical across worlds
Accomplishments that we're proud of
Extensible architecture - Adding a new world requires only 3 files (init.md, actions.py, simulation.py). The queen adapts automatically.
Large-scale coordination - Demonstrated real-time coordination of 50 bots with collision-free movement using Modal for GPU-accelerated pathfinding.
Hardware abstraction - Same orchestration code controls both simulated and physical robots via a standardized actions interface.
Multi-modal reasoning - Queen processes natural language commands, visual input (screenshots), and structured world state to make decisions.
Neural pathfinding - Trained and deployed a learned heuristic that accelerates A* by 3-5× over traditional manhattan distance heuristics.
What we learned
Separating intelligence from execution dramatically simplifies scaling. When the queen maintains a unified world model, coordination becomes a systems problem rather than a robotics problem. Adding more bots doesn't increase complexity, it just increases parallelism.
LLMs are surprisingly good at spatial reasoning when given visual context (screenshots) and structured state information. The queen can reason about optimal bot assignments, formation control, and priority allocation without hand-coded heuristics.
Parallel/scalable pathfinding changes what's possible. Modal's deployment infrastructure let us scale from 2-bot demos to 50-bot swarms without rewriting pathfinding logic. The same neural model runs locally during development and on GPU in production.
What's next for OpenSwarm
In the long term, we see applications for OpenSwarm in warehouse logistics, disaster response, construction automation, satellite coordination, and more. We'd love to expand on OpenSwarm's feature set and continue to make its hardware more practical and accessible.



Log in or sign up for Devpost to join the conversation.