WhatYouSaid

The Vectorized Intelligence & Diarization Hub

WhatYouSaid is a state-of-the-art vectorized data hub designed to explore any knowledge domain. It transforms unstructured audio, video, files, and web content into structured, searchable intelligence using advanced AI techniques, including Speaker Diarization, Voice Recognition, and RAG (Retrieval-Augmented Generation).

✨ Features

🎧 Diarization & Voice Intelligence

Speaker Segmentation: Automatically split audio/video files by speaker using WhisperX/Whisper for unmatched accuracy.
Voice Recognition: Identify and label speakers across your entire knowledge base using trained voice profiles.
Diarization Pipeline: Interactive dashboard to review, edit, and finalize transcripts and speaker assignments before indexing.

📥 Multi-Source Ingestion

YouTube Ecosystem: Full support for individual videos, entire playlists, or entire channels.
Document Extractors: High-fidelity extraction from PDF, DOCX, and TXT files.
Web Intelligence: Powerful scraping via Crawl4AI and Docling for websites and remote URLs.
Robust Pipeline: Step-by-step progress tracking with real-time SSE notifications and full rollback support on failure.

🔍 Advanced Semantic Search

Hybrid Search: Combining Vector (FAISS/Weaviate/Chroma) and Keyword (BM25) search for maximum precision.
Re-Ranking: Specialized Cross-Encoders ensure the most relevant context is always at the top.
Pluggable Architecture: Seamlessly switch between SQL databases (SQLite/Postgres/MySQL) and Vector stores.

🚀 Quick Start

WhatYouSaid is powered by Python 3.12 and uses uv for high-speed dependency management.

1. Prerequisites

uv (Recommended) or pip
Docker

2. Environment Setup

# Clone the repository
git clone https://github.com/ericksonlopes/WhatYouSaid.git
cd WhatYouSaid

# Install dependencies (including dev groups)
uv sync --group dev

3. Spin Up Infrastructure

# Lite mode: SQLite + FAISS + Redis
docker-compose up -d

# Scalable mode: PostgreSQL + Weaviate + Redis
docker-compose --profile base up -d

4. Run Application

# Start Backend (FastAPI)
python main.py

# Start Frontend (React)
cd frontend
npm install
npm run dev

🐳 Deployment Profiles

We use Docker Profiles to keep your environment lean. Only the services you need are started.

Component	Lite Profile (Default)	Scalable Profile (`base`)
Relational DB	SQLite (File-based)	PostgreSQL / MySQL / MariaDB
Vector Store	FAISS (Local)	Weaviate / ChromaDB / Qdrant
Task Queue	Redis	Redis (Production-ready)

Tip

Use the Scalable profile if you require high-concurrency access or plan to manage multi-gigabyte vector indexes.

🏗️ Clean Architecture

The system follows a modular approach ensuring maximum testability and maintainability:

Application Layer: Orchestrates logic via use cases and resolves background worker dependencies through a ServiceRegistry.
Infrastructure Layer:
- extractors/: Fetch raw content from specialized sources (Docling, YouTube, Crawl4AI).
- repositories/: Persistence via SQL (SQLAlchemy) and specialized Vector clients.
- services/: Core providers for embeddings, text splitting, and re-ranking.
Presentation Layer: FastAPI-based REST API with real-time event broadcasting and a modern React dashboard.

🤝 Contributing & Quality

Contributions are what make the open-source community such an amazing place! Please:

Open an Issue to discuss proposed changes.
Ensure uv run ruff check . --fix and uv run mypy . pass.
Run all tests: uv run pytest.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Hand-crafted with ❤️ by Erickson Lopes

Name		Name	Last commit message	Last commit date
Latest commit History 383 Commits
.agents		.agents
.github/workflows		.github/workflows
alembic		alembic
docs		docs
frontend		frontend
scripts		scripts
src		src
tests		tests
tmp		tmp
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
main.py		main.py
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
test_dispatcher.py		test_dispatcher.py
test_playlist.py		test_playlist.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhatYouSaid

The Vectorized Intelligence & Diarization Hub

✨ Features

🎧 Diarization & Voice Intelligence

📥 Multi-Source Ingestion

🔍 Advanced Semantic Search

🚀 Quick Start

1. Prerequisites

2. Environment Setup

3. Spin Up Infrastructure

4. Run Application

🐳 Deployment Profiles

🏗️ Clean Architecture

🤝 Contributing & Quality

📄 License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WhatYouSaid

The Vectorized Intelligence & Diarization Hub

✨ Features

🎧 Diarization & Voice Intelligence

📥 Multi-Source Ingestion

🔍 Advanced Semantic Search

🚀 Quick Start

1. Prerequisites

2. Environment Setup

3. Spin Up Infrastructure

4. Run Application

🐳 Deployment Profiles

🏗️ Clean Architecture

🤝 Contributing & Quality

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages