Technical flowchart

OpenDrone

Indoor autonomous drone navigation powered by voice, 3D vision, and spatial AI.

Opendrone is the openclaw of drones. We created an indoor autonomous drone navigation powered by voice, 3D vision, and spatial AI.

When we arrived at McGill as people who came from Toronto, we couldn't navigate the university buildings (and got lost three times). We've noticed a common problem with indoor navigation; GPS dies the moment you walk through a door. Thankfully with Opendrone, we fix your GPS problems by spatially tracking indoor positions and semantically understanding settings.

What it does

OpenDrone is the result of combining ElevenLabs' full voice AI stack with MongoDB Atlas' spatial data platform to create a drone that can see, remember, and talk. You walk into a building, and the drone captures video of the space and runs it through a six-stage pipeline, frame extraction, COLMAP structure-from-motion, 3D Gaussian splatting for depth, RANSAC floor alignment, then GPT-4.1 vision to label every landmark with real-world coordinates. Through the process, we create a navigable 3D map. All of that spatial data is stored in MongoDB Atlas. From that point on, you just talk. Say "take me to the bathroom" in English, Japanese, Arabic, or any of 27 supported languages, and the drone strips out its own rotor noise, transcribes your speech, looks up the landmark in its spatial memory, computes a step-by-step flight path, and speaks turn-by-turn directions back to you, all in the language you spoke.

How we built it

ElevenLabs

ElevenLabs Speech-to-Text (Scribe v2)

The cleaned audio is transcribed using ElevenLabs' Scribe v2 model, which handles multilingual input natively. The user can speak in any of 27+ supported languages: English, Japanese, Arabic, Hindi, Korean, Chinese, and more, and Scribe returns a transcript without needing to know the language in advance. A separate language detection step then identifies what was spoken so the response can be generated in the same language.

ElevenLabs Text-to-Speech (Multilingual v2)

Navigation responses are spoken back using the eleven_multilingual_v2 model, specifically chosen over v1 for its quality on Asian languages like Japanese, Chinese, and Korean. Each supported language maps to a curated voice character: Rachel for English, Fin for Japanese, Clyde for Chinese, Antoni for Spanish, stored in a language-to-voice dictionary. For step-by-step navigation, the system makes multiple TTS calls, one per direction step ("turn left 45 degrees", "move forward 0.8 metres"), so the user hears instructions incrementally rather than as a single block.

ElevenLabs TTS with Word-Level Timestamps

Every TTS response is generated twice: once as plain audio and once through convert_with_timestamps(), which returns word-level timing boundaries. This enables the frontend to synchronize subtitles with the spoken audio in real-time. The timestamp data is also logged for performance analysis and displayed on the live dashboard.

ElevenLabs Voice Lookup & Caching

At runtime, voice names like "Rachel" or "Fin" need to be resolved to ElevenLabs voice IDs. The system calls the voices API once and caches all results, so subsequent TTS calls skip the lookup entirely. If a language-specific voice can't be found, it falls back to a default voice from the environment config. A single voice command triggers 4–6 ElevenLabs API calls across isolation, transcription, synthesis, and timestamping.

MongoDB

MongoDB Atlas Spatial Data Ingestion

Every time the drone completes a mapping run, the entire 3D reconstruction — point clouds, camera poses, landmarks, and Gaussian splats — is ingested into MongoDB Atlas as a tagged scenario. Each scenario gets a unique timestamped ID, so multiple mapping runs coexist in the same database without overwriting each other. The ingestion pipeline transforms COLMAP camera poses into aligned world-space coordinates, generates base64 thumbnails for each frame, and upserts everything with scenario-scoped compound indexes.

MongoDB Spatial Memory (LLM Context Injection)

The voice pipeline loads landmarks from MongoDB and injects them directly into the LLM system prompt as structured spatial context. Each landmark includes its label, description, confidence score, physical coordinates, and computed relative direction from the drone's current position: for example, "exit sign is 1.23m away, forward and to the right". When a user asks "where's the bathroom?", the LLM reads this MongoDB-backed spatial memory and answers with actual distances and directions, not hallucinated guesses.

MongoDB Atlas Search (Full-Text, Fuzzy, and Vector)

Landmarks are indexed with a compound text index on label and description, enabling full-text search with fuzzy matching directly through Atlas Search. During ingestion, every landmark's label and description are also embedded into 1536-dimensional vectors using Azure OpenAI's embedding model and stored in the landmark document. At query time, the user's voice transcript is embedded and compared against all landmark embeddings via MongoDB Atlas Vector Search. This is how "take me to the snack machine" resolves to the "vending machine" landmark, it matches on meaning, not exact keywords. The search endpoint supports text, vector, and hybrid modes.

MongoDB GeoJSON & 2dsphere Indexing

Every landmark is stored with a GeoJSON Point geometry using its physical coordinates, and a 2dsphere geospatial index enables proximity queries. The system can answer "what's near me?" by querying MongoDB's geospatial engine rather than computing distances in application code.

MongoDB Flight Session Logging

Every drone flight is tracked as a session in MongoDB, with individual events logged as timestamped documents. Voice commands log the full transcript, LLM reply, drone position, per-stage timing breakdown, detected language, voice ID, matched landmark, and API call count. Navigation events log start and target positions. Waypoints log each movement step with pitch and roll commands. This creates a complete audit trail of every interaction during a flight.

MongoDB Atlas Charts (Embedded Visualization)

The frontend embeds live MongoDB Atlas Charts: top-down point cloud scatter, camera position distribution, and landmark analytics, rendered directly from a spatial overlay collection that merges points, cameras, and landmarks with a type field so a single chart can color-code all three categories. Charts auto-refresh and respect the active scenario filter.

3D Mapping Pipeline (COLMAP + Gaussian Splatting + GPT-4.1 Vision)

Challenges we ran into

Getting the virtual-to-physical coordinate mapping accurate enough for actual drone navigation was the hardest part. The 3D reconstruction lives in arbitrary COLMAP units, but the drone flies in a real 1×1×1 metre box, aligning those two coordinate systems so that "go to the exit sign" translates to the correct physical pitch and roll commands required multiple calibration stages (floor alignment, scale transform, centroid shifting) and a lot of testing with the actual hardware. In the end, the setup for drone calibration did not ultimately work. Thankfully, our software is still well-made and intact.

Accomplishments that we're proud of

We built a system where you can walk into any building, scan it with a drone, and within minutes have a fully navigable 3D model with semantic landmarks that you can talk to in 27 languages. The entire pipeline, from raw video to spoken multilingual navigation, runs end to end without manual annotation or pre-mapped floor plans.

The spatial memory system is something we're particularly proud of. By storing the 3D map in MongoDB and injecting it as context into every LLM call, the drone actually understands where things are relative to where it is and can describe directions in natural language.