Inspiration
Traditional smart doorbells show you video and let you talk, but you still have to stop what you're doing, pull out your phone, and figure out who's at the door. We wanted a doorbell that could handle the entire interaction autonomously — greet visitors, verify deliveries against your actual orders, check if someone has an appointment on your calendar, and only bother you when it truly needs your input. The Gemini Live API's ability to process audio and video simultaneously in real-time made this possible for the first time.
What it does
DoorGemini acts as a building concierge at your front door:
- Greets visitors with natural, warm conversation through real-time voice (powered by Gemini's native audio model)
- Sees visitors through the phone camera and understands visual context (what they're wearing, carrying, etc.)
- Verifies deliveries by searching your Gmail for matching order confirmations
- Checks appointments against your Google Calendar when someone claims to have a meeting
- Recognizes known people from a registered faces database
- Alerts the homeowner via Telegram with a photo, summary, and urgency level
- Relays owner commands — tap "Let In", "Wait", or "Decline" on Telegram, and the AI naturally conveys your response to the visitor
- Handles suspicious visitors by capturing screenshots, escalating urgency, and politely declining without revealing any personal information
How we built it
Frontend: A single HTML page using vanilla JavaScript with WebRTC for camera/mic access and AudioWorklet nodes for low-latency audio processing. The phone acts as the doorbell hardware — camera captures JPEG frames at 5fps, microphone streams PCM audio at 16kHz.
Backend: FastAPI on Google Cloud Run, communicating with the frontend via a custom WebSocket binary framing protocol (0x01=audio, 0x02=video, 0x03=control). This keeps latency minimal by avoiding JSON encoding for media streams.
AI Core: The Gemini Live API (gemini-2.5-flash-native-audio-preview) via the google-genai SDK. We use client.aio.live.connect() for a persistent bidirectional WebSocket session with real-time audio/video input and audio output. The model handles voice activity detection, barge-in (visitors can interrupt the AI mid-sentence), and function calling — all natively.
Tool Calling: Five tools that Gemini invokes autonomously based on conversation context:
check_gmail_orders— Gmail API (OAuth2, run_in_executor for non-blocking)check_calendar— Google Calendar API (OAuth2, KST timezone)check_known_faces— JSON database lookupsend_telegram_alert— Telegram Bot API with photo attachment and inline action buttonscapture_screenshot— Captures the latest video frame from memory
Infrastructure: Fully automated with Terraform (Cloud Run, Artifact Registry, Secret Manager, Cloud Storage, IAM) and shell scripts. One command (./scripts/setup.sh) provisions everything from scratch.
Challenges we ran into
Gemini 1008 "policy violation" errors: The preview model doesn't support
context_window_compression, and passing an empty string forsession_resumptionhandle also triggers the error. We had to strip unsupported config and only set resumption when a valid handle exists.Slow/deep AI voice: We initially created one AudioContext at 16kHz for both capture and playback. But Gemini outputs audio at 24kHz — playing 24kHz through a 16kHz context produces audio at 2/3 speed. Fixed by using separate AudioContext instances (16kHz capture, 24kHz playback).
Google API calls blocking the event loop: The synchronous
googleapiclientlibrary blocked the async event loop during Gmail/Calendar checks, which prevented WebSocket keepalive pings and caused connection drops. Wrapping calls inasyncio.run_in_executor()was essential.Telegram webhook filtering: An old webhook had
allowed_updates: ["callback_query"]left over, silently blocking message updates. Debugging this required checking webhook info via the Telegram API.Audio/video key naming: The Gemini SDK requires
audio=andvideo=as separate keyword arguments tosend_realtime_input(). Using the deprecatedmedia=key silently fails. This cost hours of debugging.
Accomplishments that we're proud of
Fully autonomous visitor handling: The AI conducts entire conversations, verifies identities through multiple sources, and only involves the homeowner when needed — all in real-time voice.
Real Google API integration: Not mock data — it actually searches your Gmail inbox for matching orders and checks your real Google Calendar for appointments.
Sub-second latency: Binary WebSocket framing, AudioWorklet processing, and Cloud Run session affinity keep the conversation feeling natural, not like talking to a chatbot.
One-command deployment:
./scripts/setup.shtakes you from zero to a fully deployed, publicly accessible doorbell on GCP with Terraform-managed infrastructure.Owner command relay: The Telegram → webhook → Gemini pipeline means the homeowner taps a button and the AI naturally says "Great news, Kyuhee says come on in!" — the visitor never knows they're talking to an AI intermediary.
What we learned
- The Gemini Live API is remarkably capable for real-time multimodal interaction, but the preview model has undocumented constraints that require careful discovery (no context compression, strict parameter validation).
- Binary WebSocket protocols massively outperform JSON-based approaches for real-time audio/video streaming.
- Running synchronous Google API clients inside async frameworks requires deliberate thread pool isolation — a single blocking call can cascade into connection failures.
- AudioWorklet is the right primitive for real-time audio in browsers, but sample rate management across capture and playback contexts requires explicit attention.
- Terraform + shell scripts provide a practical middle ground for hackathon infrastructure — reproducible enough for demos, flexible enough for rapid iteration.
What's next for DoorGemini
- Persistent face recognition: Use camera frames to build visual embeddings, so the system recognizes returning visitors without them needing to state their name.
- Multi-language support: Gemini's native audio model supports multiple languages — detect visitor language automatically and switch conversation language in real-time.
- Smart home integration: Connect to Google Home / Matter devices to actually unlock doors, turn on lights, or activate intercoms based on owner commands.
- Conversation history dashboard: A web UI showing past visitor interactions, transcripts, screenshots, and owner decisions — useful for reviewing who came by while you were away.
- Hardware doorbell form factor: Replace the phone with a dedicated device (Raspberry Pi + camera + speaker + mic) for permanent installation.
Log in or sign up for Devpost to join the conversation.