GitHub - mrtozner/vox: Local voice AI framework for Rust. Whisper + LLM + TTS with no cloud dependencies.

Local-first voice AI framework. Speech-to-text, text-to-speech, speaker diarization, and voice chat — all running on your machine.

No API keys. No cloud. No data leaving your machine.

Mic --> VAD (Silero) --> STT (Whisper/Sherpa) --> Speaker ID --> Your Code --> TTS (Kokoro/Piper/Qwen3) --> Speaker

Highlights

Speech-to-Text — Whisper, Distil-Whisper (6x faster), Sherpa-ONNX (streaming + multilingual)
Text-to-Speech — Kokoro (57 voices, 9 languages), Piper (multilingual), Qwen3 (state-of-the-art), Pocket (pure Rust), Chatterbox (voice cloning)
Speaker Diarization (experimental) — Know who is speaking. Real-time speaker identification with voice embeddings, auto-enrollment, and a persistent speaker database
Live Talk (experimental) — Barge-in voice chat. Talk to an LLM and interrupt it mid-sentence, just like a real conversation
Web Interface — 5-tab browser UI: Listen, Speak, Chat, Live Talk, Dashboard
HTTP + WebSocket API — REST endpoints and 4 WebSocket channels for real-time streaming
Fully Local — Everything runs on-device. Works offline after model download
Pluggable Backends — Swap any component via Rust traits

Quick Start

# Install with CLI + Kokoro TTS
cargo install --git https://github.com/mrtozner/vox --features cli,kokoro

# Transcribe speech from your microphone
vox listen

# Text-to-speech with voice selection
vox speak "Hello from Vox!" --voice af_heart

# Voice chat with Ollama
vox chat --llm llama3.2

# Start the web server
cargo install --git https://github.com/mrtozner/vox --features cli,server,kokoro,piper
vox serve --port 3000
# Open http://localhost:3000

Models auto-download on first run. Pass -y to skip prompts.

Web Interface

Start the server and open your browser:

vox serve --port 3000

Tab	What it does
Listen	Real-time mic transcription with optional speaker diarization (experimental). Toggle "Identify speakers" to see who is talking, with color-coded transcript entries and a speaker sidebar for renaming/removing speakers.
Speak	Text-to-speech synthesis. Pick a voice from the dropdown (57 Kokoro voices, Piper voices, etc.), type text, and hear it.
Chat	Voice chat with any Ollama LLM. Speak a question, get a spoken answer.
Live Talk	Experimental barge-in voice chat. Full-duplex — interrupt the LLM mid-sentence by speaking. Select voice and LLM model from dropdowns.
Dashboard	Server stats, loaded backends, model cache info, and environment capabilities.

No separate frontend build. The UI is a single embedded HTML file served by the Rust binary.

CLI

vox listen                                 # transcribe from microphone (Whisper)
vox listen --model base.en                 # use a larger Whisper model
vox listen --stt-backend sherpa            # Sherpa SenseVoice (multilingual)
vox listen --stt-backend sherpa-streaming  # real-time streaming transcription
vox listen --stt-backend distil-whisper    # Distil-Whisper (6x faster)
vox speak "Hello from Vox!"               # text-to-speech
vox speak "Hello" --voice af_heart         # Kokoro with voice selection
vox speak "Hello" --backend qwen3          # Qwen3 state-of-the-art TTS
vox speak "Hallo" --backend piper          # Piper multilingual TTS
vox speak "Hi" --backend pocket            # pure Rust TTS (edge-ready)
vox speak "Hi" --backend chatterbox --voice ref.wav  # voice cloning
vox chat --llm llama3.2                    # voice chat with Ollama
vox serve --port 3000                      # start web + API server
vox test                                   # audio I/O diagnostics
vox benchmark                              # benchmark STT/TTS performance
vox config                                 # interactive setup wizard
vox models list                            # show downloaded models
vox models download kokoro                 # download a specific model
vox models path                            # show models directory

API

REST Endpoints

# Transcribe audio
curl -X POST http://localhost:3000/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

# Synthesize speech
curl -X POST http://localhost:3000/v1/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Vox!", "voice": "af_heart"}'

# List available voices
curl http://localhost:3000/v1/voices

# List Ollama models
curl http://localhost:3000/v1/ollama-models

# Environment capabilities
curl http://localhost:3000/v1/capabilities

# Server stats
curl http://localhost:3000/v1/stats

# Health check
curl http://localhost:3000/health

WebSocket Endpoints

/v1/listen — Real-time speech-to-text

Send PCM f32 LE frames at 16kHz mono. Receive JSON events:

{"type": "speech_start"}
{"type": "partial", "text": "hello", "is_final": false}
{"type": "transcript", "text": "hello world", "duration_ms": 1200, "speaker": "speaker_1"}
{"type": "speech_end"}

When diarization is enabled (server feature), transcripts include a speaker field identifying the speaker.

/v1/speak — Streaming text-to-speech

Send JSON, receive chunked audio for gapless playback:

{"text": "Hello world", "voice": "af_heart"}

/v1/converse — Continuous voice chat

Combines VAD + STT + LLM + TTS in a single WebSocket. Speak, get a spoken LLM response, repeat.

/v1/live-talk — Barge-in voice chat (experimental)

Full-duplex conversation. You can interrupt the LLM mid-response by speaking. Requires an Ollama instance.

Rust Library

use vox::{Vox, SileroVad, WhisperBackend};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let vox = Vox::builder()
        .vad(SileroVad::new("silero_vad.onnx")?)
        .stt(WhisperBackend::from_model("ggml-tiny.en.bin")?)
        .on_utterance(|result, _ctx| {
            println!("{}", result.text);
        })
        .build()?;

    vox.listen().await?;
    Ok(())
}

Python Bindings

cd python && pip install maturin
maturin develop --features whisper,silero,kokoro

from vox_voice import Vox, SileroVad, WhisperStt

vox = Vox(vad=SileroVad(), stt=WhisperStt("tiny.en"))
for result in vox.listen():
    print(result.text)

Architecture

+--------+     +-----+     +-----+     +------------+     +-----+     +---------+
|  Mic   | --> | VAD | --> | STT | --> | Speaker ID | --> |  CB | --> |   TTS   |
| (cpal) |     |     |     |     |     | (optional) |     |     |     |         |
+--------+     +-----+     +-----+     +------------+     +-----+     +---------+
                  |                         |                |
            Silero ONNX              ECAPA-TDNN         Your callback
            v5 model                voice embeddings     gets text +
                                    + SQLite DB          VoxContext

Audio is captured via cpal, resampled to 16kHz mono, and fed frame-by-frame to VAD. On speech end, the utterance goes to STT. If diarization is enabled, the audio is also processed by a speaker encoder that extracts a 512-dimensional voice embedding, compares it against known speakers, and assigns an identity. Your callback receives the text, speaker label, and a VoxContext for optional TTS reply.

Models

Component	Model	Size	Notes
VAD	Silero VAD v5	2 MB	Speech activity detection
STT	Whisper tiny.en	75 MB	Fast, English
	Whisper base.en	142 MB	Better accuracy
	Whisper small.en	466 MB	High accuracy
	Whisper medium.en	1.5 GB	Highest accuracy
	Distil-Whisper tiny.en	75 MB	6x faster than Whisper
	Sherpa SenseVoice	230 MB	Multilingual (zh/en/ja/ko/yue)
	Sherpa Streaming Zipformer	27 MB	Real-time partial results
TTS	Kokoro	310 MB	57 voices, 9 languages, 24 kHz
	Kokoro INT8	77 MB	Quantized, 4x smaller
	Qwen3 0.6B	1.2 GB	20 voices, 10 languages, streaming
	Qwen3 1.7B	3.4 GB	Higher quality, streaming
	Piper	63 MB/voice	Multilingual, fast CPU synthesis
	Pocket	82 MB	Pure Rust, edge/embedded
	Chatterbox	350 MB	Voice cloning from reference audio
Diarization	ECAPA-TDNN	~30 MB	512-dim speaker embeddings

vox models download silero-vad           # 2 MB
vox models download whisper-tiny.en      # 75 MB
vox models download kokoro               # 310 MB
vox models download kokoro-voices        # 27 MB
vox models download piper-en-us          # 63 MB
vox models download qwen3-0.6b           # 1.2 GB

Model Storage

Models are stored in platform-specific directories:

macOS: ~/Library/Application Support/vox/models
Linux: ~/.local/share/vox/models
Windows: {FOLDERPATH}/vox/models

Run vox models path to find yours. Use vox models download <name> --force to re-download corrupted files.

Feature Flags

Flag	Default	Description
`cli`	no	CLI binary (`vox listen`, `vox speak`, `vox chat`, `vox serve`)
`server`	no	HTTP/WebSocket server with web UI (includes `diarization`)
`whisper`	yes	Whisper STT via whisper-rs
`distil-whisper`	no	Distil-Whisper STT (6x faster)
`silero`	yes	Silero VAD via ONNX Runtime
`sherpa`	no	Sherpa-ONNX STT (SenseVoice, Zipformer, streaming)
`kokoro`	no	Kokoro TTS (57 voices, 9 languages)
`qwen3`	no	Qwen3 TTS (state-of-the-art, streaming)
`qwen3-metal`	no	Qwen3 with Apple Metal GPU
`qwen3-cuda`	no	Qwen3 with NVIDIA CUDA GPU
`piper`	no	Piper TTS (multilingual, fast CPU)
`pocket`	no	Pocket TTS (pure Rust, edge-ready)
`pocket-metal`	no	Pocket with Apple Metal GPU
`chatterbox`	no	Chatterbox TTS (voice cloning)
`chatterbox-coreml`	no	Chatterbox with CoreML (macOS)
`diarization`	no	Speaker identification (auto-enabled by `server`)
`tts`	no	Audio playback for TTS output

Common Feature Combinations

# Minimal: just transcription
cargo install --git https://github.com/mrtozner/vox --features cli

# Full server with Kokoro + Piper voices
cargo install --git https://github.com/mrtozner/vox --features cli,server,kokoro,piper

# Server with Qwen3 on Apple Silicon
cargo install --git https://github.com/mrtozner/vox --features cli,server,qwen3-metal

# Server with Sherpa streaming STT + Kokoro TTS
cargo install --git https://github.com/mrtozner/vox --features cli,server,sherpa,kokoro

# Edge deployment (smallest footprint)
cargo install --git https://github.com/mrtozner/vox --features cli,distil-whisper,pocket

Performance

Measured on Apple M1 MacBook Pro:

Metric	Value
VAD frame latency	~1 ms per 32 ms frame
Whisper STT (3 s utterance)	~200 ms
Streaming STT (per chunk)	<1 ms (0.03x real-time)
End-to-end (speech end to text)	~250 ms
Kokoro TTS ("Hello world")	~300 ms
Piper TTS ("Hello world")	~200 ms
Qwen3 0.6B TTS (streaming, M4 Pro)	0.96 RTF
Qwen3 1.7B TTS (streaming, M4 Pro)	1.14 RTF
Memory (idle pipeline)	~150 MB
Memory (Kokoro loaded)	~400 MB
Memory (Qwen3 0.6B loaded)	~2.6 GB

Raspberry Pi

Vox runs on Raspberry Pi 4+. Recommended setup:

cargo install --git https://github.com/mrtozner/vox --features cli,distil-whisper,pocket
vox models download silero-vad
vox models download distil-whisper-tiny.en-int8
vox listen --stt-backend distil-whisper --model tiny.en-int8

Config	RTF	Memory
Distil-Whisper tiny.en-int8	0.3x	120 MB
Distil-Whisper base.en-int8	0.8x	180 MB
Pocket TTS INT8	real-time	~50 MB

Total footprint: ~350 MB on a 1 GB Pi.

Examples

cargo run --example simple_listen --features whisper,silero
cargo run --example voice_assistant --features whisper,silero,kokoro
cargo run --example diarization_demo --features whisper,silero,diarization
cargo run --example live_diarization --features whisper,silero,diarization
cargo run --example tts_speak --features kokoro
cargo run --example piper_speak --features piper
cargo run --example pocket_speak --features pocket
cargo run --example chatterbox_speak --features chatterbox
cargo run --example test_streaming --features qwen3

Platform Support

Platform	Status
macOS (Apple Silicon)	Tested
macOS (Intel)	Tested
Linux (x86_64)	CI tested
Windows (x86_64)	CI tested

Contributing

Fork the repository
Create a feature branch (git checkout -b feat/my-feature)
Make your changes and add tests
Run cargo test and cargo clippy
Submit a pull request

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.cargo		.cargo
.github/workflows		.github/workflows
Formula		Formula
assets		assets
benches		benches
examples		examples
python		python
scripts		scripts
src		src
tests		tests
vendor		vendor
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
Dockerfile.static		Dockerfile.static
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

Quick Start

Web Interface

CLI

API

REST Endpoints

WebSocket Endpoints

Rust Library

Python Bindings

Architecture

Models

Model Storage

Feature Flags

Common Feature Combinations

Performance

Raspberry Pi

Examples

Platform Support

Contributing

License

About

Licenses found

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Highlights

Quick Start

Web Interface

CLI

API

REST Endpoints

WebSocket Endpoints

Rust Library

Python Bindings

Architecture

Models

Model Storage

Feature Flags

Common Feature Combinations

Performance

Raspberry Pi

Examples

Platform Support

Contributing

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages