WhatYouSaid is a state-of-the-art vectorized data hub designed to explore any knowledge domain. It transforms unstructured audio, video, files, and web content into structured, searchable intelligence using advanced AI techniques, including Speaker Diarization, Voice Recognition, and RAG (Retrieval-Augmented Generation).
- Speaker Segmentation: Automatically split audio/video files by speaker using WhisperX/Whisper for unmatched accuracy.
- Voice Recognition: Identify and label speakers across your entire knowledge base using trained voice profiles.
- Diarization Pipeline: Interactive dashboard to review, edit, and finalize transcripts and speaker assignments before indexing.
- YouTube Ecosystem: Full support for individual videos, entire playlists, or entire channels.
- Document Extractors: High-fidelity extraction from PDF, DOCX, and TXT files.
- Web Intelligence: Powerful scraping via Crawl4AI and Docling for websites and remote URLs.
- Robust Pipeline: Step-by-step progress tracking with real-time SSE notifications and full rollback support on failure.
- Hybrid Search: Combining Vector (FAISS/Weaviate/Chroma) and Keyword (BM25) search for maximum precision.
- Re-Ranking: Specialized Cross-Encoders ensure the most relevant context is always at the top.
- Pluggable Architecture: Seamlessly switch between SQL databases (SQLite/Postgres/MySQL) and Vector stores.
WhatYouSaid is powered by Python 3.12 and uses uv for high-speed dependency management.
# Clone the repository
git clone https://github.com/ericksonlopes/WhatYouSaid.git
cd WhatYouSaid
# Install dependencies (including dev groups)
uv sync --group dev# Lite mode: SQLite + FAISS + Redis
docker-compose up -d
# Scalable mode: PostgreSQL + Weaviate + Redis
docker-compose --profile base up -d# Start Backend (FastAPI)
python main.py
# Start Frontend (React)
cd frontend
npm install
npm run devWe use Docker Profiles to keep your environment lean. Only the services you need are started.
| Component | Lite Profile (Default) | Scalable Profile (base) |
|---|---|---|
| Relational DB | SQLite (File-based) | PostgreSQL / MySQL / MariaDB |
| Vector Store | FAISS (Local) | Weaviate / ChromaDB / Qdrant |
| Task Queue | Redis | Redis (Production-ready) |
Tip
Use the Scalable profile if you require high-concurrency access or plan to manage multi-gigabyte vector indexes.
The system follows a modular approach ensuring maximum testability and maintainability:
- Application Layer: Orchestrates logic via use cases and resolves background worker dependencies through a
ServiceRegistry. - Infrastructure Layer:
extractors/: Fetch raw content from specialized sources (Docling, YouTube, Crawl4AI).repositories/: Persistence via SQL (SQLAlchemy) and specialized Vector clients.services/: Core providers for embeddings, text splitting, and re-ranking.
- Presentation Layer: FastAPI-based REST API with real-time event broadcasting and a modern React dashboard.
Contributions are what make the open-source community such an amazing place! Please:
- Open an Issue to discuss proposed changes.
- Ensure
uv run ruff check . --fixanduv run mypy .pass. - Run all tests:
uv run pytest.
This project is licensed under the MIT License - see the LICENSE file for details.