Skip to content

CrispStrobe/Susurrus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Susurrus: Audio Transcription Suite

Susurrus is a professional, modular audio transcription application that leverages various AI models and backends to convert speech to text. Built with a clean architecture, it supports multiple Whisper implementations, speaker diarization, and extensive customization options.

Part of the Crisp ecosystem

Project Role
Susurrus This repo β€” Python ASR GUI + CLI with 12 backends
CrispASR C++ ASR engine β€” 11 backends, ggml inference. Available as a Susurrus backend (auto-downloads if not found).
CrisperWeaver Flutter transcription app powered by CrispASR β€” desktop + mobile, fully offline
CrispEmbed Text embedding engine (ggml) β€” XLM-R, Qwen3-Embed, Gemma3, dense + sparse + ColBERT

✨ Features

Core Transcription

  • Multiple Backend Support: mlx-whisper, OpenAI Whisper, faster-whisper, transformers, whisper.cpp, ctranslate2, whisper-jax, insanely-fast-whisper, Voxtral, CrispASR (11 ggml backends β€” parakeet, canary, qwen3, granite, voxtral, wav2vec2, etc.; auto-downloads if not installed)
  • Flexible Input: Local files, URLs, also of videos
  • Audio Format Support: MP3, WAV, FLAC, M4A, AAC, OGG, OPUS, WebM, MP4, WMA
  • Language Detection: Automatic or manual language selection
  • Time-based Trimming: Transcribe specific portions of audio
  • Word-level Timestamps: Precise timing information (backend-dependent)

Speaker Diarization

  • Multi-speaker Identification: Automatically detect and label different speakers
  • Language-specific Models: Optimized models for English, German, Chinese, Spanish, Japanese
  • Configurable Parameters: Set min/max speaker counts
  • Multiple Output Formats: TXT, SRT, VTT, JSON with speaker labels
  • PyAnnote.audio Integration: State-of-the-art diarization engine

Voxtral Support (New!)

  • Voxtral Local: On-device inference with Mistral's speech model
  • Voxtral API: Cloud-based inference via Mistral AI API
  • 8 Language Support: EN, FR, ES, DE, IT, PT, PL, NL
  • Long Audio Processing: Automatic chunking for files over 25 minutes

Advanced Features

  • Proxy Support: HTTP/SOCKS5 proxy for network requests
  • Device Selection: Auto-detect or manually choose CPU/GPU/MPS
  • Model Conversion: Automatic CTranslate2 model conversion
  • Progress Tracking: Real-time progress with ETA estimation
  • Settings Persistence: Save your preferences between sessions
  • Dependency Management: Built-in installer for missing components
  • CUDA Diagnostics: Detailed GPU/CUDA troubleshooting tools

πŸ“¦ Installation

Quick Start

# Clone the repository
git clone https://github.com/CrispStrobe/Susurrus.git
cd Susurrus

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python main.py
# Or as a module:
python -m susurrus

Prerequisites

  • Python 3.8+
  • FFmpeg (for audio format conversion)
  • Git
  • C++ compiler (for whisper.cpp, optional)
  • CUDA Toolkit (for GPU acceleration, optional)

Platform-Specific Setup

Windows

# Install Chocolatey (if not installed)
Set-ExecutionPolicy Bypass -Scope Process -Force
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# Install dependencies
choco install cmake ffmpeg git python

# For GPU support
choco install cuda

macOS

# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install dependencies
brew install ffmpeg cmake python git

# For Apple Silicon optimization
pip install mlx mlx-whisper

Linux (Ubuntu/Debian)

# Install dependencies
sudo apt update
sudo apt install ffmpeg cmake build-essential python3 python3-pip git

# For GPU support
# Follow CUDA installation guide for your distribution

Optional Backend Installation

# MLX (Apple Silicon only)
pip install mlx-whisper

# Faster Whisper (recommended)
pip install faster-whisper

# Transformers
pip install transformers torch torchaudio

# Whisper.cpp (manual build required)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && mkdir build && cd build
cmake .. && make

# CTranslate2
pip install ctranslate2

# Whisper-JAX
pip install whisper-jax

# Insanely Fast Whisper
pip install insanely-fast-whisper

# Voxtral (requires dev transformers)
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git
pip install mistral-common[audio] soundfile

Speaker Diarization Setup

# Install pyannote.audio
pip install pyannote.audio

# Get Hugging Face token
# 1. Sign up at https://huggingface.co
# 2. Create token at https://huggingface.co/settings/tokens
# 3. Accept license at https://huggingface.co/pyannote/speaker-diarization

# Set token (choose one method):
# Method 1: Environment variable
export HF_TOKEN="your_token_here"  # Linux/macOS
setx HF_TOKEN "your_token_here"    # Windows

# Method 2: Config file
mkdir -p ~/.huggingface
echo "your_token_here" > ~/.huggingface/token

# Method 3: Enter in GUI

Voxtral API Setup

# Get Mistral API key from https://console.mistral.ai/

# Set API key (choose one method):
# Method 1: Environment variable
export MISTRAL_API_KEY="your_key_here"  # Linux/macOS
setx MISTRAL_API_KEY "your_key_here"    # Windows

# Method 2: Config file
mkdir -p ~/.mistral
echo "your_key_here" > ~/.mistral/api_key

# Method 3: Enter in GUI

πŸš€ Usage

GUI Application

# Start the application
python main.py

# Or as a module
python -m susurrus

Basic Workflow:

  1. Select Audio Source: Choose file or enter URL
  2. Choose Backend: Select transcription engine
  3. Configure Options: Set language, model, device
  4. Enable Diarization (optional): Identify speakers
  5. Start Transcription: Click "Transcribe"
  6. Save Results: Export to TXT, SRT, or VTT

Command Line Workers

Transcription Worker

python workers/transcribe_worker.py \
  --audio-input audio.mp3 \
  --backend faster-batched \
  --model-id large-v3 \
  --language en \
  --device auto

Diarization Worker

python workers/diarize_worker.py \
  --audio-input audio.mp3 \
  --hf-token YOUR_TOKEN \
  --transcribe \
  --model-id base \
  --backend faster-batched \
  --output-formats txt,srt,vtt

Python API

# Transcription backend example
from workers.transcription.backends import get_backend

backend = get_backend(
    'faster-batched',
    model_id='large-v3',
    device='auto',
    language='en'
)

for start, end, text in backend.transcribe('audio.mp3'):
    print(f"[{start:.2f}s -> {end:.2f}s] {text}")
# Diarization example
from backends.diarization import DiarizationManager

manager = DiarizationManager(hf_token="YOUR_TOKEN")
segments, files = manager.diarize_and_split('audio.mp3')

for segment in segments:
    print(f"{segment['speaker']}: {segment['text']}")

πŸ§ͺ Development

Architecture Overview

susurrus/
β”œβ”€β”€ main.py                    # Application entry point
β”œβ”€β”€ config.py                  # Central configuration
β”œβ”€β”€ backends/                  # Transcription & diarization backends
β”‚   β”œβ”€β”€ diarization/          # Speaker diarization module
β”‚   β”‚   β”œβ”€β”€ manager.py        # Diarization orchestration
β”‚   β”‚   └── progress.py       # Enhanced progress tracking
β”‚   └── transcription/        # Transcription backends
β”‚       β”œβ”€β”€ voxtral_local.py  # Voxtral local inference
β”‚       └── voxtral_api.py    # Voxtral API integration
β”œβ”€β”€ gui/                       # User interface components
β”‚   β”œβ”€β”€ main_window.py        # Main application window
β”‚   β”œβ”€β”€ widgets/              # Custom widgets
β”‚   β”‚   β”œβ”€β”€ collapsible_box.py
β”‚   β”‚   β”œβ”€β”€ diarization_settings.py
β”‚   β”‚   β”œβ”€β”€ voxtral_settings.py
β”‚   β”‚   └── advanced_options.py
β”‚   └── dialogs/              # Dialog windows
β”‚       β”œβ”€β”€ dependencies_dialog.py
β”‚       β”œβ”€β”€ installer_dialog.py
β”‚       └── cuda_diagnostics_dialog.py
β”œβ”€β”€ workers/                   # Background processing
β”‚   β”œβ”€β”€ transcription_thread.py    # GUI thread wrapper
β”‚   β”œβ”€β”€ transcribe_worker.py       # Standalone transcription worker
β”‚   β”œβ”€β”€ diarize_worker.py          # Standalone diarization worker
β”‚   └── transcription/             # Transcription backend implementations
β”‚       β”œβ”€β”€ backends/
β”‚       β”‚   β”œβ”€β”€ base.py           # Base backend interface
β”‚       β”‚   β”œβ”€β”€ mlx_backend.py
β”‚       β”‚   β”œβ”€β”€ faster_whisper_backend.py
β”‚       β”‚   β”œβ”€β”€ transformers_backend.py
β”‚       β”‚   β”œβ”€β”€ whisper_cpp_backend.py
β”‚       β”‚   β”œβ”€β”€ ctranslate2_backend.py
β”‚       β”‚   β”œβ”€β”€ whisper_jax_backend.py
β”‚       β”‚   β”œβ”€β”€ insanely_fast_backend.py
β”‚       β”‚   β”œβ”€β”€ openai_whisper_backend.py
β”‚       β”‚   └── voxtral_backend.py
β”‚       └── utils.py
β”œβ”€β”€ utils/                     # Utility modules
β”‚   β”œβ”€β”€ device_detection.py   # CUDA/MPS/CPU detection
β”‚   β”œβ”€β”€ audio_utils.py        # Audio processing utilities
β”‚   β”œβ”€β”€ download_utils.py     # URL downloading
β”‚   β”œβ”€β”€ dependency_check.py   # Dependency verification
β”‚   └── format_utils.py       # Time formatting utilities
β”œβ”€β”€ models/                    # Model configuration
β”‚   └── model_config.py       # Model mappings & utilities
└── scripts/                   # Standalone utility scripts
    β”œβ”€β”€ test_voxtral.py       # Voxtral testing
    └── pyannote_torch26.py   # PyTorch 2.6+ compatibility

Running Tests

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_backends.py

# Run with coverage
pytest --cov=. --cov-report=html

Code Quality

# Format code
black .

# Lint
flake8 .
pylint susurrus/

# Type checking
mypy .

Adding a New Backend

  1. Create a new file in workers/transcription/backends/
  2. Inherit from TranscriptionBackend
  3. Implement required methods:
    class MyBackend(TranscriptionBackend):
        def transcribe(self, audio_path):
            # Yield (start, end, text) tuples
            pass
        
        def preprocess_audio(self, audio_path):
            # Optional preprocessing
            return audio_path
        
        def cleanup(self):
            # Optional cleanup
            pass
  4. Register in workers/transcription/backends/__init__.py
  5. Add to config.py BACKEND_MODEL_MAP

πŸ”§ Configuration

Settings Location

  • Windows: %APPDATA%\Susurrus\AudioTranscription.ini
  • macOS: ~/Library/Preferences/com.Susurrus.AudioTranscription.plist
  • Linux: ~/.config/Susurrus/AudioTranscription.conf

Environment Variables

  • HF_TOKEN: Hugging Face API token (diarization)
  • MISTRAL_API_KEY: Mistral AI API key (Voxtral)
  • PYTORCH_MPS_HIGH_WATERMARK_RATIO: MPS memory optimization
  • CUDA_VISIBLE_DEVICES: GPU selection

πŸ“Š Performance Tips

GPU Acceleration

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"

Apple Silicon Optimization

# Use MLX backend for best performance
pip install mlx-whisper

# Or use MPS device with other backends
# Will auto-detect in GUI

Memory Management

  • Use smaller models for limited RAM
  • Enable chunking for long audio files
  • Use faster-batched backend with appropriate batch size
  • Close other applications during processing

πŸ› Troubleshooting

Common Issues

"No module named 'X'"

pip install X

FFmpeg not found

# Verify installation
ffmpeg -version

# Add to PATH if needed (Windows)
setx PATH "%PATH%;C:\path\to\ffmpeg\bin"

CUDA errors

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Use Tools > CUDA Diagnostics in GUI for detailed info

Diarization authentication fails

# Verify token
python -c "from huggingface_hub import HfApi; HfApi().whoami(token='YOUR_TOKEN')"

# Accept license
# Visit: https://huggingface.co/pyannote/speaker-diarization

PyTorch 2.6+ compatibility issues

# Run the compatibility script
python scripts/pyannote_torch26.py

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/Susurrus.git
cd Susurrus

# Create feature branch
git checkout -b feature-name

# Install dev dependencies
pip install -r requirements-dev.txt

# Make changes and test
pytest tests/

# Submit PR

πŸ™ Acknowledgements

CLI Usage

Susurrus also works headless without the GUI:

# List available backends
python cli.py --list-backends

# Transcribe with CrispASR (auto-downloads binary if not found)
python cli.py -b crispasr -m parakeet-tdt-0.6b-v3.gguf -f audio.wav

# Transcribe with faster-whisper
python cli.py -b faster-sequenced -m large-v3 -f audio.wav

# CrispASR with specific sub-backend + VAD
python cli.py -b crispasr -m model.gguf -f audio.wav --vad --split-on-punct

License

MIT β€” see LICENSE.

Model licenses vary. Most ASR models (Whisper, Parakeet, Canary, Voxtral, Qwen3-ASR) are permissive (MIT/Apache/CC-BY). Pyannote speaker-diarization-3.1 is MIT. Check the individual model card on HuggingFace for the exact terms before commercial deployment.

About

speech to text gui for different (mostly Whisper, also Voxtral) models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2; applies pyannote for diarization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors