Automated pipeline that ingests an entire YouTube channel's video library, transcribes content via Gemini, generates embeddings, stores in a vector database, and exposes a conversational AI agent for natural-language Q&A over the knowledge base.
This system turns any YouTube channel into a searchable, conversational knowledge base. Instead of scrubbing through hours of video to find specific information, you ask a question and get an AI-generated answer grounded in the actual video content — with retrieval-augmented generation (RAG) ensuring accuracy.
Two-phase system:
Phase 1 — Ingestion Pipeline:
- Fetches all uploaded videos from a target YouTube channel via the YouTube Data API
- Attempts to download existing captions (SBV format) for each video
- For videos without captions, transcribes the full audio using Gemini 2.5 Flash's multi-modal capabilities (feeds the YouTube URL directly)
- Cleans transcription text (removes timecodes, normalizes line breaks)
- Generates OpenAI embeddings for the transcribed content
- Stores embeddings in a Supabase vector store (pgvector) for semantic search
Phase 2 — Conversational Agent:
- Accepts natural-language questions via a chat interface
- Queries the Supabase vector store to retrieve semantically relevant transcript segments
- Augments the AI agent's context with retrieved content (RAG)
- Generates accurate, source-grounded answers using o4-mini
- Falls back to Perplexity web search for questions outside the knowledge base
- Maintains conversation history via Postgres chat memory
graph TB
subgraph "Phase 1: Ingestion"
A[YouTube Channel ID] --> B[Fetch All Uploaded Videos]
B --> C[For Each Video]
C --> D{Captions Available?}
D -->|Yes| E[Download Captions SBV]
D -->|No| F[Gemini 2.5 Flash Transcribe]
E --> G[Clean Text: Remove Timecodes]
G --> H[Normalize Line Breaks]
F --> I[Extract Transcript Text]
I --> J[OpenAI Embeddings]
H --> J
J --> K[Store in Supabase pgvector]
end
subgraph "Phase 2: Conversational Agent"
L[User Question] --> M[AI Agent - o4-mini]
M --> N[Supabase Vector Store Retrieval]
M --> O[Perplexity Web Search]
M --> P[Postgres Chat Memory]
N -->|Relevant Segments| M
O -->|Web Results| M
P -->|Conversation History| M
M --> Q[Grounded Answer]
end
K -.->|Semantic Search| N
| Component | Technology | Purpose |
|---|---|---|
| Video Discovery | YouTube Data API v3 | Enumerate all channel uploads |
| Caption Download | YouTube Captions API | Retrieve existing SBV captions |
| Audio Transcription | Gemini 2.5 Flash (multi-modal) | Transcribe videos without captions |
| Text Processing | Regex-based cleaning pipeline | Remove timecodes, normalize formatting |
| Embeddings | OpenAI Embeddings API | Convert text chunks to vector representations |
| Vector Store | Supabase with pgvector extension | Semantic similarity search over transcripts |
| Chat Agent | o4-mini via n8n AI Agent | Conversational Q&A with tool calling |
| Web Augmentation | Perplexity (sonar-pro) | Answer questions outside the knowledge base |
| Chat Memory | Postgres (Supabase) | Persistent conversation history across sessions |
| Rate Limiting | n8n Wait nodes + batch processing | Respect API rate limits during bulk ingestion |
Why Gemini 2.5 Flash for transcription? Gemini's multi-modal capability can process YouTube video URLs directly — no need to download audio, convert formats, or manage file storage. It handles the full pipeline from video URL to transcript text in a single API call. Using the Flash tier keeps costs low for bulk channel ingestion.
Why a dual caption strategy (download first, then transcribe)? YouTube's own captions, when available, are often higher quality than AI transcription — especially for channels that upload manually-corrected captions. The pipeline tries the fast/free path first (downloading existing captions) and only falls back to Gemini transcription when necessary.
Why Supabase/pgvector instead of a dedicated vector DB? Supabase gives me a unified backend — the same Postgres instance handles the vector store, chat memory, and any future metadata. No additional infrastructure to manage. pgvector's performance is more than sufficient for this scale.
Why o4-mini for the chat agent? The retrieval step provides the factual grounding, so the chat model's job is primarily synthesis and presentation — o4-mini handles this well at lower cost than larger models. The chain-of-thought capability also helps with multi-step reasoning over retrieved segments.
Why include Perplexity as a fallback tool? Not every question will be answerable from the video content alone. Perplexity gives the agent access to current web information, so it can supplement the knowledge base with real-time data when needed — while clearly distinguishing between "from the videos" and "from the web."
- Multi-modal ingestion: Combines caption download, AI transcription, and text processing in a unified pipeline
- Hybrid retrieval: Vector similarity search (Supabase/pgvector) + live web search (Perplexity) as agent tools
- Persistent memory: Postgres-backed chat history enables multi-turn conversations with context
- Batch processing with rate limiting: Wait nodes prevent API throttling during bulk channel ingestion
- Fully self-hosted: Entire stack runs on personal infrastructure — n8n, Supabase, all under my control
- Hosting: Self-hosted on personal infrastructure (Coolify PaaS)
- Database: Supabase (self-hosted Postgres + pgvector)
- Orchestration: n8n workflow engine (27 nodes)
- Models: Gemini 2.5 Flash (transcription), OpenAI (embeddings), o4-mini (chat agent), Perplexity sonar-pro (web search)