podwise

Agentic RAG pipeline for YouTube transcripts: ingest videos, then ask questions and get answers with clickable timestamp links. Built with LangChain, ChromaDB, Ollama (embeddings), and Claude (Anthropic).

What it does

Ingest — Give a YouTube URL; the pipeline downloads the transcript, merges short segments into paragraphs, runs semantic chunking, embeds with Ollama (nomic-embed-text), and stores in ChromaDB.
Ask — Ask any question; a Claude agent uses retrieval tools (search transcripts, list episodes, get context) and answers with citations in the form [Episode Title | MM:SS | link].
Supports English and Mandarin transcripts (YouTube provides them; no extra config).

Prerequisites

Python 3.11+
Ollama (local, for embeddings): ollama.ai. After install, run:
```
ollama pull nomic-embed-text
```
Anthropic API key (for the Q&A agent): set ANTHROPIC_API_KEY in .env.

Setup

git clone <repo>
cd podwise
uv sync   # or: pip install -e ".[dev]"

cp .env.example .env
# Edit .env: set ANTHROPIC_API_KEY (required for ask). Others have defaults.

Usage

Ingest a video (transcript → chunks → ChromaDB):

uv run python -m src.index "https://www.youtube.com/watch?v=VIDEO_ID"

Ask a question over all ingested content:

uv run python main.py ask "What does Saining Xie say about world models?"
uv run python main.py ask "Compare what was said about sleep across episodes"

Answers are printed in the terminal with Markdown formatting and citation links.

List indexed episodes (optional):

uv run python -m src.retrieval.tools list

Environment

Variable	Default	Description
`CHROMA_PATH`	`./data/chroma`	ChromaDB persistence directory
`COLLECTION_NAME`	`podwise`	Chroma collection name
`TRANSCRIPTS_PATH`	`./data/transcripts`	Raw transcript cache (optional)
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server (embeddings)
`OLLAMA_EMBED_MODEL`	`nomic-embed-text`	Embedding model
`ANTHROPIC_API_KEY`	—	Required for `ask`
`LLM_MODEL`	`claude-sonnet-4-20250514`	Claude model for the agent

Project layout

podwise/
├── main.py                 # CLI: ask
├── src/
│   ├── config.py           # Env settings
│   ├── index.py            # Ingest: YouTube → clean → chunk → embed → Chroma
│   ├── ingestion/          # youtube, cleaner, chunker (semantic)
│   ├── embedding/         # Ollama embedder
│   ├── storage/            # ChromaDB wrapper
│   ├── retrieval/         # search_transcripts, get_episode_list, get_episode_context (+ LangChain tools)
│   └── agent/              # Claude + tools (ReAct-style)
├── data/chroma/            # Vector store (gitignored)
└── .env.example

Tech stack

Layer	Choice
Transcripts	`youtube-transcript-api`
Chunking	Time-based merge + LangChain `SemanticChunker` (Ollama)
Embeddings	Ollama `nomic-embed-text`
Vector DB	ChromaDB (`langchain-chroma`)
LLM / Agent	Anthropic Claude via `langchain-anthropic`; tools for search/list/context
CLI	Typer + Rich

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Example1.png		Example1.png
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

podwise

What it does

Prerequisites

Setup

Usage

Environment

Project layout

Tech stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

podwise

What it does

Prerequisites

Setup

Usage

Environment

Project layout

Tech stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages