Inspiration

Analyzing video advertisements at scale is tedious and expensive. Marketers manually tag brand mentions, product categories, and emotional tone, a process that doesn't scale. We built an AI pipeline that does this automatically, turning any video ad into structured, queryable metadata in seconds.

What it does

Upload a video advertisement and the pipeline extracts:

  • Brand information: name, logo visibility, text contrast
  • Product details: product name, industry classification
  • Topic classification: 38 categories (restaurants, electronics, financial services, etc.)
  • Sentiment analysis: 30 emotional dimensions (cheerful, confident, nostalgic, etc.)

The system combines computer vision, audio transcription, and Gemini 3 to deliver structured JSON output ready for analytics dashboards or ad platform integration.

How we built it

Video Ingestion: FFmpeg extracts audio tracks; OpenCV samples frames at configurable intervals with resolution capping at 720p for efficiency.

Scene Detection: Content-aware histogram analysis detects scene boundaries. Adaptive fallback handles edge cases: threshold drops from 27→15, then falls back to artificial 10-second chunks if needed.

Hierarchical Deduplication: Three-stage filtering eliminates redundant frames:

  1. Hash voting (pHash + dHash + wHash, min 2 votes, threshold 8)
  2. Perceptual similarity (SSIM 0.92 or LPIPS 0.15)
  3. Semantic deduplication (CLIP ViT-B/32, similarity 0.90)

Smart Frame Selection: NMS combined with K-means clustering selects representative frames. Scoring factors include video position (opening/closing), scene boundaries, and audio events. Key phrases from transcription boost frame importance by 1.5x.

Audio Analysis: Whisper ASR (base model) transcribes speech. VAD detects speech segments. Custom keyword detection identifies 38 promotional phrases ("sale", "limited time", "free shipping"). Rule-based classification determines audio mood and tempo.

Gemini 3 Extraction: Selected frames + full audio context (transcription, key phrases, mood) are sent to Gemini 3 Flash Preview. The model outputs structured JSON matching our schema for brand, product, topic, and sentiment.

Frontend: React + TypeScript + Tailwind CSS interface with drag-and-drop upload, real-time processing status, and results visualization.

Challenges we ran into

Frame redundancy vs. API costs: Naive approaches sent hundreds of similar frames to Gemini, wasting tokens and degrading output quality. Our 3-stage deduplication achieves 90-95% reduction while preserving visual diversity across the ad.

Scene detection edge cases: Ads with slow fades or continuous motion defeated standard threshold-based detection. We implemented adaptive fallback with progressively lower thresholds and artificial chunking as a last resort.

Audio-visual synchronization: Aligning transcription timestamps with frame selection required careful coordination. Key phrase timestamps now directly boost nearby frames' importance scores.

Accomplishments that we're proud of

  • 90-95% frame reduction without losing visual coverage, dramatically cutting API costs
  • End-to-end processing in 10-60 seconds depending on video length
  • Audio-visual fusion that meaningfully improves extraction accuracy
  • Production-ready architecture with FastAPI backend, batch processing (1-16 workers), and configurable YAML pipelines

What we learned

Audio context is underrated. Adding transcription and key phrase detection to the Gemini prompt significantly improved brand disambiguation and sentiment classification, especially for voiceover-heavy ads where visual cues alone were ambiguous.

Hierarchical deduplication outperforms any single method. Hash-based, perceptual, and semantic approaches each catch different types of redundancy, and combining them yields the best results.

What's next

  • Redis caching for pipeline results and repeated analyses
  • Real-time processing with streaming support
  • Multi-language transcription beyond English
  • Ad platform integrations (Google Ads, Meta Ads) for campaign-level analytics
  • Model fine-tuning on ad-specific datasets for improved accuracy

Gemini Integration

Gemini 3 Flash Preview is the core intelligence layer of our pipeline. Without it, the system would only produce frames and transcripts, not structured insights.

How we use Gemini 3:

We send Gemini 3 a carefully curated payload containing deduplicated key frames (typically 5-15 images per video after 90-95% reduction) alongside rich audio context: full transcription, detected key phrases with timestamps, and mood classification. The model processes this multimodal input and returns structured JSON matching our extraction schema.

Why Gemini 3 is central:

  1. Multimodal understanding: Gemini 3 interprets visual brand elements (logos, text overlays, product shots) and correlates them with spoken content to accurately identify brands even when logos are subtle or text is brief.

  2. Structured output generation: The model reliably produces JSON conforming to our schema: brand details, product classification, topic ID (1-38), and sentiment ID (1-30). This consistency enables direct database ingestion and analytics.

  3. Contextual reasoning: By combining visual frames with transcription and detected key phrases ("50% off", "limited time"), Gemini 3 infers ad intent and emotional tone that neither modality reveals alone.

The entire extraction step (brand, product, topic, and sentiment classification) runs through a single Gemini 3 API call, making it both cost-efficient and architecturally simple.

Built With

Share this project:

Updates