Skip to content

ericksonlopes/WhatYouSaid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

383 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WhatYouSaid

The Vectorized Intelligence & Diarization Hub

codecov Tests Code Quality Security

Python React FastAPI Redis Postgres

WhatYouSaid is a state-of-the-art vectorized data hub designed to explore any knowledge domain. It transforms unstructured audio, video, files, and web content into structured, searchable intelligence using advanced AI techniques, including Speaker Diarization, Voice Recognition, and RAG (Retrieval-Augmented Generation).


✨ Features

🎧 Diarization & Voice Intelligence

  • Speaker Segmentation: Automatically split audio/video files by speaker using WhisperX/Whisper for unmatched accuracy.
  • Voice Recognition: Identify and label speakers across your entire knowledge base using trained voice profiles.
  • Diarization Pipeline: Interactive dashboard to review, edit, and finalize transcripts and speaker assignments before indexing.

πŸ“₯ Multi-Source Ingestion

  • YouTube Ecosystem: Full support for individual videos, entire playlists, or entire channels.
  • Document Extractors: High-fidelity extraction from PDF, DOCX, and TXT files.
  • Web Intelligence: Powerful scraping via Crawl4AI and Docling for websites and remote URLs.
  • Robust Pipeline: Step-by-step progress tracking with real-time SSE notifications and full rollback support on failure.

πŸ” Advanced Semantic Search

  • Hybrid Search: Combining Vector (FAISS/Weaviate/Chroma) and Keyword (BM25) search for maximum precision.
  • Re-Ranking: Specialized Cross-Encoders ensure the most relevant context is always at the top.
  • Pluggable Architecture: Seamlessly switch between SQL databases (SQLite/Postgres/MySQL) and Vector stores.

πŸš€ Quick Start

WhatYouSaid is powered by Python 3.12 and uses uv for high-speed dependency management.

1. Prerequisites

2. Environment Setup

# Clone the repository
git clone https://github.com/ericksonlopes/WhatYouSaid.git
cd WhatYouSaid

# Install dependencies (including dev groups)
uv sync --group dev

3. Spin Up Infrastructure

# Lite mode: SQLite + FAISS + Redis
docker-compose up -d

# Scalable mode: PostgreSQL + Weaviate + Redis
docker-compose --profile base up -d

4. Run Application

# Start Backend (FastAPI)
python main.py

# Start Frontend (React)
cd frontend
npm install
npm run dev

🐳 Deployment Profiles

We use Docker Profiles to keep your environment lean. Only the services you need are started.

Component Lite Profile (Default) Scalable Profile (base)
Relational DB SQLite (File-based) PostgreSQL / MySQL / MariaDB
Vector Store FAISS (Local) Weaviate / ChromaDB / Qdrant
Task Queue Redis Redis (Production-ready)

Tip

Use the Scalable profile if you require high-concurrency access or plan to manage multi-gigabyte vector indexes.


πŸ—οΈ Clean Architecture

The system follows a modular approach ensuring maximum testability and maintainability:

  • Application Layer: Orchestrates logic via use cases and resolves background worker dependencies through a ServiceRegistry.
  • Infrastructure Layer:
    • extractors/: Fetch raw content from specialized sources (Docling, YouTube, Crawl4AI).
    • repositories/: Persistence via SQL (SQLAlchemy) and specialized Vector clients.
    • services/: Core providers for embeddings, text splitting, and re-ranking.
  • Presentation Layer: FastAPI-based REST API with real-time event broadcasting and a modern React dashboard.

🀝 Contributing & Quality

Contributions are what make the open-source community such an amazing place! Please:

  1. Open an Issue to discuss proposed changes.
  2. Ensure uv run ruff check . --fix and uv run mypy . pass.
  3. Run all tests: uv run pytest.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Hand-crafted with ❀️ by Erickson Lopes

LinkedIn

About

🧠 WhatYouSaid is a vectorized data hub that extracts, processes, and indexes content from YouTube videos, local files, remote URLs, and websites for semantic search and RAG workflows.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors