Private, local-only desktop app for audio transcription with speaker diarization and AI-powered summarization.
Diaricat is a Windows desktop application that transcribes audio/video files, identifies who said what (speaker diarization), and generates AI-powered summaries — all running locally on your machine. No data ever leaves your computer.
Diaricat is part of a broader vision for local-first AI systems focused on privacy, autonomy, and offline intelligence.
- Accurate transcription powered by Faster Whisper (large-v3 model with CUDA acceleration)
- Speaker diarization using SpeechBrain ECAPA-TDNN embeddings with custom agglomerative clustering
- AI correction & summarization with local LLM inference via llama.cpp, running GGUF models such as Qwen 2.5 7B Instruct (Q4_K_M)
- 100% offline & private — no API keys, no cloud services, no data upload
- Bilingual UI — Spanish and English with one-click toggle
- Multiple export formats — TXT, SRT, DOCX, PDF, JSON
- Modern dark UI — glassmorphism design built with React + Tailwind CSS
Diaricat follows a design language I call Purple Space Glass.
It blends glassmorphism, deep-space aesthetics and soft neon reflections to create interfaces that feel both modern and fluid — almost like interacting with an intelligent system rather than a static tool.
The goal is not just visual appeal, but to make AI systems feel:
- responsive
- ambient
- alive, without being intrusive
This design direction is part of a broader vision where local AI systems are not only powerful and private, but also intuitive and pleasant to use.
+---------------------------------------------------+
| Desktop Shell |
| (pywebview + .NET/Edge) |
+---------------------------------------------------+
| Frontend (UI) |
| React TypeScript Vite Tailwind |
| Radix UI Lucide shadcn/ui |
+---------------------------------------------------+
| REST API Layer |
| FastAPI Uvicorn Pydantic |
+--------------+------------+-----------+-----------+
| Transcription|Diarization | LLM Post- | Export |
| Service | Service | process | Service |
| (Whisper) |(SpeechBrain)|(llama.cpp)|(DOCX/PDF) |
+--------------+------------+-----------+-----------+
| Pipeline Orchestrator |
| Job queue Progress Cancellation |
+---------------------------------------------------+
| Component | Technology |
|---|---|
| Desktop shell | pywebview 5.x (Edge WebView2) |
| Frontend | React 18 + TypeScript + Vite + Tailwind CSS |
| Backend API | FastAPI + Uvicorn |
| ASR (Speech-to-Text) | Faster Whisper (CTranslate2 backend) |
| Speaker Diarization | SpeechBrain ECAPA-TDNN + custom clustering |
| LLM Inference | llama-cpp-python (GGUF models) |
| Packaging | PyInstaller (onedir mode) |
| Minimum | Recommended | |
|---|---|---|
| OS | Windows 10 64-bit | Windows 11 |
| RAM | 8 GB | 16 GB+ |
| GPU | — | NVIDIA GPU with 6+ GB VRAM |
| Disk | 5 GB (app + models) | 10 GB |
| Runtime | Edge WebView2 | Edge WebView2 |
- Python 3.11+ (tested with 3.14)
- Node.js 18+ (for frontend)
- NVIDIA CUDA Toolkit 12.x (for GPU acceleration)
- Visual Studio Build Tools 2022 (for building llama-cpp-python)
- Download the latest release from Releases
- Extract the
Diarcat/folder - (Optional) Place a GGUF model in
workspace/models/for AI summaries - Run
Diarcat.exe
# Clone the repository
git clone https://github.com/nia-huck/Diaricat.git
cd Diaricat
# Create Python virtual environment
python -m venv .venv
.venv\Scripts\activate
# Install Python dependencies
pip install -e ".[dev]"
# Install torch with CUDA (optional, for GPU acceleration)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
# Install faster-whisper and speechbrain
pip install faster-whisper speechbrain
# Install llama-cpp-python (requires Visual Studio Build Tools)
pip install llama-cpp-python
# Install frontend dependencies
cd frontend
npm install
# Start frontend dev server
npm run dev
# In another terminal, start the backend
cd ..
python -m diaricat api --host 127.0.0.1 --port 8765# Build the frontend
cd frontend
npm run build
cd ..
# Run PyInstaller
python -m PyInstaller packaging/diaricat.spec --distpath dist --workpath build --noconfirm
# Output: dist/Diarcat/Diarcat.exeThe build uses onedir mode for fast startup (~1 second vs minutes for onefile).
Diaricat uses llama.cpp as the local inference engine for transcript correction and summarization, running GGUF-compatible models. Without a configured model, it falls back to rule-based processing.
- Qwen 2.5 7B Instruct (Q4_K_M) — configured by default for the local post-processing pipeline
- Default model path:
models/qwen2.5-7b-instruct-q4_k_m.gguf - Context size:
4096 - Inference engine:
llama.cpp
| Model | Size | Min RAM | Quality |
|---|---|---|---|
| Qwen 2.5 1.5B (Q4_K_M) | ~1 GB | 4 GB | Basic |
| Qwen 2.5 3B (Q4_K_M) | ~2 GB | 6 GB | Good |
| Qwen 2.5 7B (Q4_K_M) | ~4.7 GB | 10 GB | Best |
Place the .gguf file in workspace/models/ and configure the path in Settings. Model path and runtime parameters can be updated from the application Settings screen.
- Validation — Verify source file exists and is a supported format
- Audio normalization — Extract audio, convert to 16kHz mono WAV via FFmpeg
- Transcription — Speech-to-text with Faster Whisper (chunked for long audio)
- Speaker diarization — Identify speakers using ECAPA-TDNN embeddings + agglomerative clustering
- Segment merge — Align ASR segments with speaker turns
- Correction (optional) — Fix ASR errors using local LLM
- Summarization (optional) — Generate structured summary with key points, decisions, and topics
Diaricat/
├── src/diaricat/ # Python backend
│ ├── api/ # FastAPI routes and middleware
│ ├── core/ # Orchestrator, job queue, alignment
│ ├── services/ # Transcription, diarization, postprocess, export
│ ├── models/ # Pydantic domain and API models
│ ├── utils/ # Device detection, logging, compatibility
│ ├── desktop.py # pywebview desktop shell
│ ├── main.py # CLI entry point
│ └── settings.py # Configuration management
├── frontend/ # React/TypeScript UI
│ └── src/
│ ├── components/ # UI components (screens, ui primitives)
│ ├── context/ # React context (AppContext, I18nContext)
│ ├── lib/ # API client, i18n translations
│ └── types/ # TypeScript type definitions
├── config/ # Default configuration (YAML)
├── packaging/ # PyInstaller spec and runtime hooks
├── scripts/ # Build scripts
├── tests/ # Unit tests
├── vendor/ # Bundled FFmpeg binaries
└── pyproject.toml # Project metadata and dependencies
Settings are stored in config/default.yaml and can be modified through the Settings screen in the app:
| Setting | Default | Description |
|---|---|---|
whisper_model |
large-v3 |
Whisper model size |
whisper_compute_type |
float16 |
Compute precision (float16/int8) |
diarization_profile |
quality |
Diarization quality (fast/balanced/quality) |
llama_model_path |
models/qwen2.5-7b-instruct-q4_k_m.gguf |
Path to GGUF model |
llama_n_ctx |
4096 |
LLM context window size |
device_mode |
auto |
Device selection (auto/cpu/cuda) |
Backend: Python 3.14 · FastAPI · Uvicorn · Pydantic · PyYAML · SpeechBrain · Faster Whisper · CTranslate2 · llama-cpp-python · PyInstaller
Frontend: React 18 · TypeScript · Vite · Tailwind CSS · Radix UI · shadcn/ui · Lucide Icons
AI Models: Whisper large-v3 (ASR) · pyannote/speaker-diarization-3.1 + ECAPA-TDNN embeddings (diarization) · Qwen 2.5 7B Instruct Q4_K_M (correction/summary via llama.cpp)
MIT License — see LICENSE file for details.
Built with privacy in mind. Your audio never leaves your machine.
