Skip to main content
Your agents can read text and static images. But the real world is live, continuous, and always changing. To operate with real context, your agent needs real-time access to video calls, camera feeds, screen recordings, and live internet streams. VideoDB is the perception layer that lets agents see, hear, remember, and act on continuous media. Most AI development focuses on text and static images, but video remains a significant hurdle because of its density and lack of structure. VideoDB turns raw pixel data into structured context that agents can query, reason about, and act upon in real time. For agents to move beyond text boxes and interact with the physical or digital world via screens and cameras, they need a way to parse continuous visual and auditory data. VideoDB provides this through a specialized database that indexes video at the scene level - making it possible for an agent to “recall” specific events or “see” real-time occurrences without excessive compute costs.

Quickstart

Give your agent perception in 5 minutes

Core Concepts

Understand the platform architecture

How It Works

The platform operates through three stages: See, Understand, and Act.
StageWhat Happens
SeeCapture SDK or live stream integration takes in media from files, desktops, or cameras
UnderstandBuild specialized indexes for transcripts, visual scenes, or custom prompts
ActQuery, search, edit, and export - agents can generate summaries or clips based on findings
Rather than merely storing video files, the platform indexes frames and audio to support semantic retrieval. This allows an agent to ask for a specific moment in a continuous stream without downloading or processing the entire file. The architecture sits above transport protocols and below the reasoning engine. This separation means you can use VideoDB with any Large Language Model or Large Video Model. By consolidating transcription, frame extraction, vector indexing, and video playback into a single platform, VideoDB addresses the high total cost of ownership typically associated with video AI.

Skills: Native Agent Experiences

Since VideoDB handles server-side video processing, indexing, and retrieval, developers can use skills to create agent workflows that feel native to their environment. Skills give agents like Claude Code and Codex structured perception primitives - capture, search, edit, stream - without writing infrastructure code.
npx skills add video-db/skills

What You Can Build

Desktop Agents

Stream screen, mic, and camera. Get real-time context about what the user is doing and saying.Call.md →

Video Search

Search across hours of meetings, lectures, or archives. Get timestamped moments with playable evidence.Multimodal Search →

Real-time Monitoring

Connect RTSP cameras and drones. Detect events as they happen. Trigger alerts and automations.Intrusion Detection →

Media Automation

Compose videos with code. Generate voice, music, and images. Export to any format.Faceless Video Creator →

Agent Skills

Add real-time perception to coding assistants and autonomous agents. Screen capture, audio indexing, and searchable context.Agent Skills →

Browse All Examples

Explore examples across AI Copilots, Video Search, Live Intelligence, Content Factory, and more

Example: Real-time Alerting

import videodb

conn = videodb.connect()

# See: Get an active stream (from desktop capture or RTSP)
rtstream = conn.get_rtstream("rts-abc123")

# Understand: Create indexes on the live stream
visual_index = rtstream.index_visuals(prompt="Describe what the user is doing")
audio_index = rtstream.index_audio(prompt="Extract key decisions and action items")

# Act: Create an event and attach an alert
event = conn.create_event(
    event_prompt="Detect when someone mentions a deadline or due date"
)
alert = audio_index.create_alert(
    webhook_url="https://your-backend.com/webhooks/deadline-mentioned"
)

# Real-time events arrive via WebSocket or webhook
# { "channel": "alert", "timestamp": "2026-02-11T12:18:00.968810+00:00", "rtstream_id": "rts-xxx", "rtstream_name": "Meeting", "data": { "event_id": "event-77aae6b981970542", "label": "objection", "triggered": true, "confidence": 0.9, "start": 1770812246.3445818, "end": 1770812277.3488276 } }

Install the SDK

pip install videodb

Python SDK

GitHub, PyPI, and setup guide

Node.js SDK

npm, TypeScript, and setup guide

Philosophy

Why perception is the next frontier for AI agents.

Why AI Agents Are Blind Today

The gap between human perception and agent perception

Perception Is the Missing Layer

The stack that gives agents eyes and ears

MP4 Is the Wrong Primitive

Why video files don’t work for AI

What Episodic Memory Means for Agents

Remember experiences, not just facts