Inspiration

Traditional smart doorbells show you video and let you talk, but you still have to stop what you're doing, pull out your phone, and figure out who's at the door. We wanted a doorbell that could handle the entire interaction autonomously — greet visitors, verify deliveries against your actual orders, check if someone has an appointment on your calendar, and only bother you when it truly needs your input. The Gemini Live API's ability to process audio and video simultaneously in real-time made this possible for the first time.

What it does

DoorGemini acts as a building concierge at your front door:

  • Greets visitors with natural, warm conversation through real-time voice (powered by Gemini's native audio model)
  • Sees visitors through the phone camera and understands visual context (what they're wearing, carrying, etc.)
  • Verifies deliveries by searching your Gmail for matching order confirmations
  • Checks appointments against your Google Calendar when someone claims to have a meeting
  • Recognizes known people from a registered faces database
  • Alerts the homeowner via Telegram with a photo, summary, and urgency level
  • Relays owner commands — tap "Let In", "Wait", or "Decline" on Telegram, and the AI naturally conveys your response to the visitor
  • Handles suspicious visitors by capturing screenshots, escalating urgency, and politely declining without revealing any personal information

How we built it

Frontend: A single HTML page using vanilla JavaScript with WebRTC for camera/mic access and AudioWorklet nodes for low-latency audio processing. The phone acts as the doorbell hardware — camera captures JPEG frames at 5fps, microphone streams PCM audio at 16kHz.

Backend: FastAPI on Google Cloud Run, communicating with the frontend via a custom WebSocket binary framing protocol (0x01=audio, 0x02=video, 0x03=control). This keeps latency minimal by avoiding JSON encoding for media streams.

AI Core: The Gemini Live API (gemini-2.5-flash-native-audio-preview) via the google-genai SDK. We use client.aio.live.connect() for a persistent bidirectional WebSocket session with real-time audio/video input and audio output. The model handles voice activity detection, barge-in (visitors can interrupt the AI mid-sentence), and function calling — all natively.

Tool Calling: Five tools that Gemini invokes autonomously based on conversation context:

  • check_gmail_orders — Gmail API (OAuth2, run_in_executor for non-blocking)
  • check_calendar — Google Calendar API (OAuth2, KST timezone)
  • check_known_faces — JSON database lookup
  • send_telegram_alert — Telegram Bot API with photo attachment and inline action buttons
  • capture_screenshot — Captures the latest video frame from memory

Infrastructure: Fully automated with Terraform (Cloud Run, Artifact Registry, Secret Manager, Cloud Storage, IAM) and shell scripts. One command (./scripts/setup.sh) provisions everything from scratch.

Challenges we ran into

  • Gemini 1008 "policy violation" errors: The preview model doesn't support context_window_compression, and passing an empty string for session_resumption handle also triggers the error. We had to strip unsupported config and only set resumption when a valid handle exists.

  • Slow/deep AI voice: We initially created one AudioContext at 16kHz for both capture and playback. But Gemini outputs audio at 24kHz — playing 24kHz through a 16kHz context produces audio at 2/3 speed. Fixed by using separate AudioContext instances (16kHz capture, 24kHz playback).

  • Google API calls blocking the event loop: The synchronous googleapiclient library blocked the async event loop during Gmail/Calendar checks, which prevented WebSocket keepalive pings and caused connection drops. Wrapping calls in asyncio.run_in_executor() was essential.

  • Telegram webhook filtering: An old webhook had allowed_updates: ["callback_query"] left over, silently blocking message updates. Debugging this required checking webhook info via the Telegram API.

  • Audio/video key naming: The Gemini SDK requires audio= and video= as separate keyword arguments to send_realtime_input(). Using the deprecated media= key silently fails. This cost hours of debugging.

Accomplishments that we're proud of

  • Fully autonomous visitor handling: The AI conducts entire conversations, verifies identities through multiple sources, and only involves the homeowner when needed — all in real-time voice.

  • Real Google API integration: Not mock data — it actually searches your Gmail inbox for matching orders and checks your real Google Calendar for appointments.

  • Sub-second latency: Binary WebSocket framing, AudioWorklet processing, and Cloud Run session affinity keep the conversation feeling natural, not like talking to a chatbot.

  • One-command deployment: ./scripts/setup.sh takes you from zero to a fully deployed, publicly accessible doorbell on GCP with Terraform-managed infrastructure.

  • Owner command relay: The Telegram → webhook → Gemini pipeline means the homeowner taps a button and the AI naturally says "Great news, Kyuhee says come on in!" — the visitor never knows they're talking to an AI intermediary.

What we learned

  • The Gemini Live API is remarkably capable for real-time multimodal interaction, but the preview model has undocumented constraints that require careful discovery (no context compression, strict parameter validation).
  • Binary WebSocket protocols massively outperform JSON-based approaches for real-time audio/video streaming.
  • Running synchronous Google API clients inside async frameworks requires deliberate thread pool isolation — a single blocking call can cascade into connection failures.
  • AudioWorklet is the right primitive for real-time audio in browsers, but sample rate management across capture and playback contexts requires explicit attention.
  • Terraform + shell scripts provide a practical middle ground for hackathon infrastructure — reproducible enough for demos, flexible enough for rapid iteration.

What's next for DoorGemini

  • Persistent face recognition: Use camera frames to build visual embeddings, so the system recognizes returning visitors without them needing to state their name.
  • Multi-language support: Gemini's native audio model supports multiple languages — detect visitor language automatically and switch conversation language in real-time.
  • Smart home integration: Connect to Google Home / Matter devices to actually unlock doors, turn on lights, or activate intercoms based on owner commands.
  • Conversation history dashboard: A web UI showing past visitor interactions, transcripts, screenshots, and owner decisions — useful for reviewing who came by while you were away.
  • Hardware doorbell form factor: Replace the phone with a dedicated device (Raspberry Pi + camera + speaker + mic) for permanent installation.

Built With

  • cloudrun
  • fastapi
  • gcp
  • gcs
  • geminiliveapi
  • gmail-&-calendar-apis
  • google-genai-sdk
  • python
  • telegram-bot-api
  • terraform
  • webrtc
Share this project:

Updates