NekoSpeak is a high-performance, on-device Text-to-Speech (TTS) engine for Android, capable of running 100% offline with low latency. It bridges the gap between modern AI voice synthesis and the standard Android TTS API.
This project was born from two key inspirations:
-
Accessibility and Custom Voices - While watching Ahren Belisle, a non-verbal stand-up comedian with cerebral palsy who uses a text-to-speech app on his phone to deliver his comedy routines, I was struck by how powerful TTS technology can be as a voice for those who need it. It made me wonder about custom voices and how high-quality, expressive TTS could be a life-changing quality-of-life improvement for many people.
-
Natural Audiobook Reading - I wanted to read books from my MoonReader app and have them sound somewhat natural without relying on cloud-based services.
The first and foremost consideration was fully offline functionality - no internet required, no data sent anywhere, complete privacy.
Future Plans: Newer models like Qwen 3-TTS are very promising, and I plan to experiment with quantized ONNX versions if they become available. The goal is always to maintain offline-first functionality.
- Multi-Engine Support:
- Pocket-TTS β RECOMMENDED: Zero-shot voice cloning with celebrity voices and custom voice enrollment. Best quality and most natural sounding!
- Piper: Fast, efficient, and multilingual. Supports hundreds of community voices (English, Tamil, Spanish, etc.).
- Voice Cloning: Record your own voice or upload audio to create custom TTS voices (Pocket-TTS).
- Celebrity Voices: Download and use celebrity voice profiles (Oprah Winfrey, Greta Thunberg, and more).
- Privacy First: All processing happens 100% on-device. No data is ever sent to the cloud.
- System-Wide Integration: Works with any Android app that supports TTS (MoonReader, @Voice, etc.).
- Advanced Voice Management:
- Cloud Voice Store: Browse and download hundreds of Piper voices directly within the app.
- Quality Filters: Filter voices by quality (x_low to high).
- Persistence: Remembers your preferred voice and speed settings.
What's included? Only Piper with Amy Low voice (~63MB) is bundled for instant offline use. Other engines (Kokoro, Kitten TTS, Pocket-TTS) are downloaded on-demand when you select them. We recommend trying Pocket-TTS for the best quality experience!
| Engine | Model | Size | Quality | Availability |
|---|---|---|---|---|
| Piper | Amy Low | ~63MB | Good | β Bundled (instant) |
| Pocket-TTS | Full Model | ~176MB | Excellent | π₯ On-demand. Recommended! |
| Kokoro | v1.0 | ~115MB | Excellent | π₯ On-demand |
| Kitten TTS | Nano | ~23MB | Fair | π₯ On-demand |
| Piper | Community Voices | Varies | Varies | π₯ 100+ downloadable voices |
| Onboarding | Voice Selection | Settings |
|---|---|---|
![]() |
![]() |
![]() |
| Voice Downloader | System Selection |
|---|---|
![]() |
![]() |
v1.4.2 is now available! (voice cloning stability + in-app diagnostics)
Why is the APK size large? NekoSpeak comes pre-packaged with the Piper engine and Amy Low voice to ensure 100% offline functionality right out of the box. The Pocket-TTS model is downloaded separately on first use.
- Universal (135 MB): Works on all devices.
- arm64-v8a (88 MB): Optimized for modern devices (Pixel, Samsung S-series).
- armeabi-v7a (82 MB): Optimized for older/low-end devices.
.
βββ app
β βββ src
β β βββ main
β β β βββ java/com/nekospeak/tts (Kotlin Source)
β β β βββ cpp/ (Native C++ / JNI)
β β β βββ assets/ (Bundled Models)
β β β βββ res/ (UI Resources)
β βββ build.gradle.kts (App Build Config)
βββ gradle (Gradle Wrapper)
βββ build.gradle.kts (Root Build Config)
βββ README.md (Documentation)
For a detailed architectural breakdown, component analysis, system integration diagrams, and ONNX implementation details, please refer to the Technical Deep Dive.
For architecture decision records (ADRs) documenting major technical decisions, see the docs/adr directory.
GPU/NPU Acceleration: We investigated NNAPI and Qualcomm QNN for hardware acceleration. Due to model architecture constraints (dynamic shapes, unsupported ops), these were not viable. See ADR-001 and ADR-002 for details. Current performance on CPU with multi-threading is optimized and works well across devices.
NekoSpeak features a novel adaptive streaming architecture that automatically optimizes audio generation for any device. This is particularly important for the Pocket-TTS engine, which uses large neural network models that may run slower than real-time on some devices.
On-device neural TTS faces a fundamental problem: generation speed varies by device. A flagship phone might generate audio faster than real-time, while a budget device might run at 50% speed. Traditional approaches either:
- Buffer everything first (long wait times)
- Stream immediately (choppy audio on slower devices)
flowchart TB
subgraph Generator["Generator Coroutine"]
G1["Flow LM Main ~100ms"]
G2["Flow Matching ~3ms"]
G1 --> G2
end
subgraph Tracker["Performance Tracker"]
T1["Measure Frame Time"]
T2["Calculate Ratio"]
T3["Auto-Tune Buffers"]
T1 --> T2 --> T3
end
subgraph Decoder["Decoder Coroutine"]
D1["Wait for Buffer"]
D2["Decode Chunks"]
D3["Stream Audio"]
D1 --> D2 --> D3
end
G2 -->|"Channel"| D1
G2 --> T1
T3 -.->|"Adaptive"| D1
- Parallel Coroutines: Generator and decoder run concurrently using Kotlin coroutines with a channel buffer
- Real-Time Performance Measurement: Tracks generation time for each frame and calculates the ratio vs playback speed
- Automatic Buffer Tuning: Adjusts buffer sizes based on measured device performance
flowchart LR
subgraph Measurement["Device Performance"]
M1["Frame 0-10: Measure avg time"]
M2["Calculate: ratio = avg / 80ms"]
end
subgraph Tuning["Auto-Tuning"]
T1{"Ratio?"}
T2["β€1.0: Fast Device<br/>buffer=8, threshold=3"]
T3["β€1.5: Medium<br/>buffer=15, threshold=8"]
T4[">1.5: Slow Device<br/>buffer=20, threshold=10"]
end
M1 --> M2 --> T1
T1 -->|"Fast"| T2
T1 -->|"Medium"| T3
T1 -->|"Slow"| T4
| Device Speed | Ratio | Initial Buffer | Decode Threshold | Reserve |
|---|---|---|---|---|
| Faster than real-time | β€1.0 | 8 frames (~640ms) | 3 | 2 |
| Slightly slower | β€1.2 | 10 frames | 5 | 2 |
| Moderately slower | β€1.5 | 15 frames (~1.2s) | 8 | 4 |
| Quite slow | β€2.0 | 20 frames (~1.6s) | 10 | 6 |
| Very slow | >2.0 | 30 frames (~2.4s) | 10 | 6 |
- Zero Configuration: Works optimally on any device without user tuning
- Smooth Playback: Larger buffers on slower devices prevent audio gaps
- Lower Latency: Faster devices get smaller buffers for quicker startup
- Continuous Adaptation: Re-measures every 10 frames to handle thermal throttling
Note: Even with adaptive streaming, very long sentences may still experience some choppiness on slower devices. This is because the entire pipeline runs on CPU-only ONNX inference - we deliberately avoid GPU/NPU acceleration (NNAPI, QNN) as these accelerators don't support all operations in the transformer models.
For the smoothest experience on long audiobook chapters, consider using Batch mode in settings, which generates all audio before playback.
Future: We're exploring Qualcomm QNN and custom ONNX Runtime builds for potential GPU acceleration on compatible devices.
- Clone the repository.
- Open in Android Studio (Ladybug+).
- Build and Run (
Shift + F10).- Note: Ensure NDK is installed for C++ builds.
I gratefully acknowledge the incredible work of the open-source AI community:
- Pocket-TTS (Original Model)
- Thanks to Kyutai Labs for the original Pocket-TTS model and default voices.
- Pocket-TTS ONNX Export & Models
- Thanks to KevinAHM for the ONNX export tool and reference implementation that made this Android port possible.
- Celebrity Voice Dataset
- Thanks to sdialog for the celebrity voice samples used for voice cloning.
- Kokoro-ONNX
- Thanks to thewh1teagle for the inspiration and ONNX export work.
- KittenTTS
- Thanks to the KittenML team for their work on efficient TTS architectures.
- Piper & Piper Voices
- Thanks to the Rhasspy team for the amazing Piper architecture and the massive collection of high-quality voices.
- Piper Tamil Voice (Valluvar)
- Special thanks to Jeyaram-K for training and providing the high-quality Tamil "Valluvar" model.
- Misaki
- G2P logic ported from this excellent library.
- Espeak-NG
- The backbone of multilingual phonemization.
NekoSpeak is licensed under the MIT License.
Note: While the NekoSpeak application code is MIT, it bundles dependencies with their own licenses:
- Espeak-NG: GPL v3.0
- ONNX Models: Apache 2.0 / CC-BY-4.0 (Check specific model licenses)
Developed by Sivasubramanian Ramanathan




