Agent 00Vision in action

Agent 00Vision

The name's Bond. James Bond...

Inspiration

Physical security and compliance monitoring is a $130B+ market running on human eyeballs. Security guards watch camera feeds for 8+ hours, with attention dropping drastically within a few hours. Compliance audits are manual, infrequent, and retrospective. Every new rule requires retraining staff and hoping humans remember. We saw an opportunity to put AI-powered compliance monitoring directly in the user's hands, with the option to keep all data local.

With increasingly centralized compute, especially for AI applications, consumers face risks exposing sensitive personal information to data breaches and marketing. Companies and other entities require even stricter enforcement and legal requirements around security and data storage. However, as Edge AI hardware and open-source AI becomes more capable, we saw potential to personalize to tailor local AI solutions to specific technological needs, in areas like security and privacy.

What it does

Agent 00Vision is an AI-powered video compliance monitoring platform that lets users define any compliance policy in plain English, point it at any camera (live webcam or uploaded video), and get structured, audit-ready compliance reports automatically.

A Vision Language Model (VLM) performs inference-time, user-directed identification of people, objects, and actions in video frames
A Large Language Model (LLM) takes agentic action based on user-defined compliance rulesets, evaluating observations against policies and generating structured verdicts
A dual-mode compliance system (Incidents vs. Checklists) prevents alert fatigue while maintaining safety standards
An optional on-premise deployment mode using NVIDIA DGX Spark ensures video data never leaves the building
Email notification capability to notify stakeholders remotely and log violations.

No model training. No computer vision expertise. Write the rule, point the camera, get the report.

How we built it

Backend: Python 3.11 with FastAPI, serving REST and WebSocket endpoints for real-time monitoring
Frontend: React 19 + TypeScript + Vite + Tailwind CSS for a responsive policy-builder and live monitoring dashboard
Cloud AI pipeline: OpenAI GPT-4o Vision for scene understanding, GPT-4o-mini for policy evaluation, and Whisper for audio transcription
Local AI pipeline on NVIDIA DGX Spark:
- Deployed Cosmos-Reason2 8B with vLLM as the Vision Language Model
- Deployed Nemotron-3-Nano 30B with Ollama as the compliance evaluation LLM
- Split VRAM between the two models for concurrent inference
Smart frame sampling: Built a dual-metric change detection engine (histogram correlation + structural similarity) with a threaded pipeline, reducing frames sent to cloud API calls
Async processing: Celery + Redis for background video analysis with real-time progress updates via WebSocket

Challenges we ran into

Hallucinations in both the VLM and LLM led to inaccurate compliance verdicts. We mitigated this with structured JSON output schemas, Pydantic validation, and retry logic with stricter prompts on parse failure
Time constraints limited our ability to pretrain or fine-tune models on compliance-specific data, so we relied on prompt engineering and few-shot examples
Splitting VRAM between Cosmos and Nemotron on the DGX Spark required careful configuration to keep both models loaded simultaneously
Rate limiting from cloud API providers required exponential backoff with jitter and usage tracking to stay within quotas during demo-heavy periods
Video seeking performance in compressed formats (H.264/H.265) was initially very slow. We switched from cap.set(POS_FRAMES) to sequential cap.read() with frame counting for a 5-10x speedup

Accomplishments that we're proud of

Building a complete end-to-end pipeline, from raw video to structured compliance reports, in a single hackathon
Achieving reduction in API calls through our intelligent frame sampling, making the product economically viable at scale
Implementing dual-mode compliance (Incidents vs. Checklists) with temporal memory so the system remembers what it has already verified
Successfully running inference on the NVIDIA DGX Spark with two models sharing VRAM, proving that local deployment is feasible today

What we learned

Local inference is the future for light consumer workloads. Privacy-sensitive applications don't need to send data to the cloud when edge hardware can handle it
NVIDIA has a great variety of open-source models suitable for deployment. Cosmos and Nemotron worked well out of the box with minimal prompt tuning
Transformer-based models give great flexibility by not limiting you to an ultra-specific use case. Traditional CV-based CNNs are hyper-specialized at the cost of generalization. VLMs can handle "any rule you can describe in English" without retraining
Dual-mode compliance prevents alert fatigue. Continuously re-alerting on the same compliant hard hat every 6 seconds creates noise. Checklist mode with validity periods solves this

What's next for Agent 00Vision

TensorRT optimization for DGX models to reduce inference latency
Modal.com deployment for elastic cloud scaling across multiple cameras
Notification channels: SMS via Twilio, Slack, and Microsoft Teams alerts for critical violations
Multi-camera orchestration: Dashboard for managing dozens of feeds with independent policies
As local inference gets cheaper, Agent 00Vision becomes a viable option for privacy-focused customers to explore wide-ranging use cases, from off-the-grid home security to endangered animal identification in the wild