Silly AI

This is my playground of local AI LLM experiments. It's super useful when you want AI capabilities while offline. Even if you have internet, local LLMs avoids transmitting data to a 3rd-party.

silly - "Hey Silly" is a CLI based AI voice chat assistant using real-time speech transcription, local LLMs, and TTS. It allows answering questions from LLM while completely offline, making this a good replacement for cloud based personal assistants like Siri, Alexa, Google, ChatGPT if you are ever concerned about sending anything over the internet
silly transcribe - Converts ogg/wav files into text
silly orb-demo - Interactive demo of graphical orb animations (cycles through all states and styles)
silly summarize - Summarizes transcripts
ai department - A department/roundtable of "experts"

Features

Real-time speech-to-text using transcribe-rs with NVIDIA Parakeet
Voice Activity Detection (VAD) with Silero for utterance segmentation
Live preview transcription (gray text) while speaking
Local LLM inference via llama.cpp with Metal GPU acceleration (or Ollama)
Auto-download models from HuggingFace on first run
Text-to-speech with Kokoro TTS or Supertonic
Streaming TTS: speech starts as soon as the first sentence is generated
Real-time audio visualization: Animated bars showing microphone input and TTS output volume levels
Multi-threaded architecture: separate threads for audio capture, VAD, preview transcription, and final transcription
Hardware acceleration: Metal on Apple Silicon for LLM, CoreML for VAD, transcription, and TTS
Crosstalk mode: Continue listening while TTS plays, with barge-in support
Multiple modes: Chat, Transcribe, and Note-taking modes

Demo

demo

Architecture

┌─────────┐    ┌─────┐    ┌─────────────────┐    ┌─────────┐    ┌───────────┐    ┌─────┐
│  Audio  │───▶│ VAD │───▶│ Final Transcr.  │───▶│         │───▶│ llama.cpp │───▶│ TTS │
│ Capture │    │     │    └─────────────────┘    │ Display │    │  (Metal)  │    │     │
└─────────┘    │     │    ┌─────────────────┐    │         │    └───────────┘    └─────┘
               │     │───▶│Preview Transcr. │───▶│         │
               └─────┘    └─────────────────┘    └─────────┘
                          (lossy channel)

Crosstalk Flow

When crosstalk is enabled, the system continues processing audio while TTS is playing:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Crosstalk System                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────┐    ┌─────┐    ┌──────────────────────┐                        │
│   │  Mic    │───▶│ VAD │───▶│ Crosstalk Enabled?   │                        │
│   │ Input   │    │     │    └──────────┬───────────┘                        │
│   └─────────┘    └─────┘               │                                    │
│                                        │                                    │
│                    ┌───────────────────┴───────────────────┐                │
│                    │                                       │                │
│                    ▼                                       ▼                │
│            ┌──────────────┐                       ┌──────────────┐          │
│            │     Yes      │                       │      No      │          │
│            │ Continue     │                       │ Mute during  │          │
│            │ Processing   │                       │    TTS       │          │
│            └──────┬───────┘                       └──────────────┘          │
│                   │                                                         │
│                   ▼                                                         │
│         ┌─────────────────┐                                                 │
│         │ Speech Detected │                                                 │
│         │  during TTS?    │                                                 │
│         └────────┬────────┘                                                 │
│                  │                                                          │
│         ┌────────┴────────┐                                                 │
│         │                 │                                                 │
│         ▼                 ▼                                                 │
│   ┌───────────┐    ┌───────────┐                                           │
│   │ Duck TTS  │    │ Barge-in: │                                           │
│   │ to 20%    │    │ Stop TTS  │                                           │
│   │ volume    │    │ + Process │                                           │
│   └───────────┘    │ new input │                                           │
│                    └───────────┘                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Crosstalk behaviors:

Volume ducking: When you speak while TTS is playing, volume reduces to 20%
Barge-in: Your speech stops TTS and processes your new input
Stop command: Say "stop" to halt TTS without triggering a new response

Installation

Quick install (recommended)

curl -fsSL https://raw.githubusercontent.com/zz85/silly-ai/main/install.sh | bash

This detects your platform (macOS/Linux, x86_64/ARM) and installs the silly binary to ~/.local/bin/. On first run, required AI models (~500MB) are downloaded automatically to ~/.local/share/silly/models/.

Homebrew (macOS)

brew install zz85/silly-ai/silly

Build from source

git clone https://github.com/zz85/silly-ai.git
cd silly-ai
cargo build --release
./target/release/silly

Setup

Models are downloaded automatically on first run. No manual setup is required for VAD, speech-to-text, or TTS models.

If you prefer to manage models manually, or need to customize paths, see the sections below.

Configure your LLM backend

Create or edit config.toml:

Option A: Use LM Studio (OpenAI-compatible API)

[llm]
backend = "openai-compat"
preset = "lm_studio"
model = "your-model-name"

Start LM Studio with a model loaded.

Option B: Use llama.cpp (fully offline)

[llm]
backend = "llama-cpp"
hf_repo = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
hf_file = "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
prompt_format = "chatml"

Option C: Use Ollama

[llm]
backend = "ollama"
model = "mistral:7b-instruct"

Start Ollama server: ollama serve

Build variants

# Default (OpenAI-compatible API + Supertonic TTS with CoreML)
cargo build --release

# With Kokoro TTS instead
cargo build --release --no-default-features --features openai-compat,kokoro

# With llama.cpp instead (Metal GPU, auto-downloads models)
cargo build --release --no-default-features --features supertonic,llama-cpp

# With Ollama instead
cargo build --release --no-default-features --features supertonic,ollama

# With acoustic echo cancellation (AEC)
cargo build --release --features aec

# With voice-to-keyboard typing
cargo build --release --features typing

Note: On Apple Silicon (M1/M2/M3), hardware acceleration is automatically enabled:

LLM: Metal GPU via llama.cpp or CoreML
VAD: CoreML (Silero)
TTS: CoreML (Supertonic)
Transcription: CoreML (Parakeet)

CLI Commands

# Full voice assistant mode (default)
silly

# Transcription-only mode (no LLM/TTS)
silly transcribe

# Quick LLM test
silly probe "What is the capital of France?"

# Test UI rendering without audio
silly test-ui [scene]  # scenes: idle, preview, thinking, speaking, response, all

# Listen mode - continuous audio capture and transcription (requires --features listen)
silly listen                      # Interactive source picker
silly listen -s mic               # Microphone input
silly listen -s system            # System audio (all apps)
silly listen -s "Safari"          # Specific app audio
silly listen --list               # List available apps
silly listen -s mic -o notes.txt  # Custom output file

# Summarize a transcription file
silly summarize -i transcript.txt

# Voice-to-keyboard typing mode (requires --features typing)
silly typing                      # Type speech into active application
silly typing --input-method direct  # Use direct typing instead of clipboard

Building with Listen feature

cargo build --release --features listen

Note: The listen feature uses ScreenCaptureKit for system audio capture, which requires Swift runtime libraries. The .cargo/config.toml includes the necessary linker flags.

Building with Typing feature

cargo build --release --features typing

The typing feature allows you to dictate text directly into any application. Speech is transcribed and typed into the currently focused window.

Usage

Say the wake word ("Hey Silly" by default) to activate, then speak your question. The CLI will:

Show preview text in gray while you're speaking
Print final transcription with > prefix when you pause
Send to LLM and stream the response in cyan
Speak the response using TTS (streaming sentence-by-sentence)

After responding, the assistant listens for follow-up questions for 30 seconds (configurable via wake_timeout_secs) before requiring the wake word again.

Keyboard Commands

Command	Aliases	Description
`/mute`	`/mic`	Toggle microphone mute
`/speak`	`/tts`	Toggle TTS output
`/wake`		Toggle wake word requirement
`/crosstalk`		Toggle crosstalk mode (listen during TTS)
`/aec`	`/echo`	Toggle acoustic echo cancellation
`/mode <mode>`		Switch mode: `chat`, `transcribe`, `note`
`/stats`		Show inference performance stats
`/help`	`/h`, `/?`	Show available commands

Type text and press Enter to submit directly (bypasses transcription).

Voice Commands

These commands are recognized from speech and processed before the LLM:

Command	Phrases	Action
Stop	"stop", "quiet", "shut up", "enough"	Stop TTS playback
Mute	"mute", "be quiet"	Disable TTS output
Unmute	"unmute", "speak"	Enable TTS output
Start Chat	"start chat", "let's chat"	Enter chat mode
Start Transcription	"start transcription"	Enter transcribe mode
Take Note	"take a note"	Enter note-taking mode
Typing Mode	"typing mode", "start typing"	Enter voice-to-keyboard mode
Stand Down	"stand down"	Graceful shutdown

Application Modes

Mode	Description
Idle	Default mode. Requires wake word to activate.
Chat	Conversational mode. No wake word needed, continuous conversation.
Transcribe	Speech-to-text only. No LLM processing, just transcription.
Note	Note-taking mode. Transcriptions are appended to a notes file with timestamps.
Typing	Voice-to-keyboard. Speech is typed into the active application. (requires `--features typing`)

Typing Mode Commands

When in typing mode, special voice commands control punctuation, navigation, and editing:

Category	Commands	Action
Punctuation	"period", "comma", "question mark", "exclamation point"	Insert punctuation
Whitespace	"enter", "new line", "tab"	Send key press
Editing	"undo", "redo", "delete", "backspace", "delete word"	Edit operations
Navigation	"go to end of line", "go to start of line", "select all"	Cursor movement
Control	"stop typing", "stop", "pause", "resume"	Control typing mode

Smart command detection: Commands are distinguished from text based on:

Pause duration before speaking (longer pauses suggest commands)
Phrase length (short phrases like "enter" are likely commands)
Pattern matching (recognized command phrases)

Inline punctuation: Say "hello comma world" and it will type "hello, world"

The current mode is displayed in the status bar with color coding.

Auto-Submit

Voice input auto-submits after 2 seconds of silence, showing a progress bar. Any typing or new speech cancels the timer.

Press Ctrl+C to stop.

The assistant greets you on startup when in full mode.

Configuration

Create config.toml to customize (see config.example.toml):

name = "Silly"
wake_word = "Hey Silly"
wake_timeout_secs = 30  # Seconds to wait for follow-up before requiring wake word again

[llm]
backend = "llama-cpp"
hf_repo = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
hf_file = "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
prompt_format = "chatml"  # chatml, mistral, or llama3

# Or use a local model:
# model_path = "models/my-model.gguf"

# Or use OpenAI-compatible API (LM Studio, OpenAI, etc.):
# [llm]
# backend = "openai-compat"
# preset = "lm_studio"  # or "openai" or "ollama"
# model = "model-name"
# api_key = "${OPENAI_API_KEY}"  # optional, for OpenAI

[tts]
engine = "supertonic"
onnx_dir = "models/supertonic/onnx"
voice_style = "models/supertonic/voice_styles/M1.json"
speed = 1.1

[interaction]
# Enable processing input while TTS is playing
crosstalk = false

# Enable acoustic echo cancellation (requires --features aec)
aec = false

# Volume level when user speaks during TTS (0.0-1.0)
duck_volume = 0.2

# Phrases that stop TTS but don't go to LLM
stop_phrases = ["stop", "quiet", "shut up", "enough"]

Configuration Reference

Setting	Default	Description
`name`	"Silly"	Assistant name
`wake_word`	"Hey Silly"	Phrase to activate the assistant
`wake_timeout_secs`	30	After responding, how long to wait for follow-up questions before requiring the wake word again
`interaction.crosstalk`	false	When true, continue listening while TTS plays (enables barge-in)
`interaction.aec`	false	When true, apply acoustic echo cancellation to remove TTS from mic input
`interaction.duck_volume`	0.2	TTS volume (0.0-1.0) when user speaks during playback
`interaction.stop_phrases`	["stop", ...]	Phrases that stop TTS without triggering LLM

LLM Backends

Silly supports multiple LLM backends:

OpenAI-Compatible API (Recommended)

Works with LM Studio, OpenAI, Together.ai, Groq, and any OpenAI-compatible endpoint.

[llm]
backend = "openai-compat"
preset = "lm_studio"  # or "openai" or "ollama"
model = "model-name"

Presets:

lm_studio - Local LM Studio server (port 1234)
openai - OpenAI API (requires api_key)
ollama - Ollama API mode (port 11434)

Custom endpoints:

[llm]
backend = "openai-compat"
base_url = "https://api.together.xyz/v1"
model = "mistralai/Mixtral-8x7B-Instruct-v0.1"
api_key = "${TOGETHER_API_KEY}"  # Supports environment variables

Supported providers:

LM Studio (local)
OpenAI (gpt-4, gpt-4o, etc.)
Together.ai
Groq
LocalAI
Text Generation WebUI
Any OpenAI-compatible endpoint

Ollama

Uses the native Ollama Rust SDK for specialized features.

[llm]
backend = "ollama"
model = "mistral:7b-instruct"

llama.cpp

Local inference with GGUF models (auto-downloads from HuggingFace).

[llm]
backend = "llama-cpp"
hf_repo = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
hf_file = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
prompt_format = "mistral"

Kalosm

Pure Rust inference library.

[llm]
backend = "kalosm"
model = "phi3"

LLM Models

Model	Size	prompt_format	hf_repo	hf_file
TinyLlama	~670MB	chatml	`TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF`	`tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf`
Mistral 7B	~4GB	mistral	`TheBloke/Mistral-7B-Instruct-v0.2-GGUF`	`mistral-7b-instruct-v0.2.Q4_K_M.gguf`
Llama 3 8B	~4.5GB	llama3	`QuantFactory/Meta-Llama-3-8B-Instruct-GGUF`	`Meta-Llama-3-8B-Instruct.Q4_K_M.gguf`

For Kokoro TTS:

[tts]
engine = "kokoro"
model = "models/kokoro-v1.0.onnx"
voices = "models/voices-v1.0.bin"
speed = 1.1

Other settings:

VAD thresholds: Edit constants in src/audio.rs and src/vad.rs
Preview interval: PREVIEW_INTERVAL in src/audio.rs (default 500ms)

Runtime State

The application maintains a centralized, thread-safe runtime state that can be queried and modified at runtime. Key state variables:

State	Description
`mic_muted`	Whether microphone input is muted
`mic_level`	Current microphone RMS level (0.0-1.0)
`tts_enabled`	Whether TTS output is enabled
`tts_playing`	Whether TTS is currently playing
`tts_volume`	Current TTS volume (0.0-1.0)
`tts_level`	Current TTS output RMS level (0.0-1.0) for real-time visualization
`crosstalk_enabled`	Whether to process audio during TTS
`aec_enabled`	Whether acoustic echo cancellation is active
`wake_enabled`	Whether wake word is required
`in_conversation`	Whether within wake timeout window
`mode`	Current application mode (Idle/Chat/Transcribe/Note)

All state can be toggled via keyboard commands (e.g., /mute, /crosstalk, /aec).

Audio Visualization

The status bar displays real-time audio levels using animated Unicode bars:

Microphone input: Green bars (▁▂▄▆█) show your voice volume while speaking
TTS output: Magenta bars show the assistant's voice volume during speech playback

Both visualizations use RMS (Root Mean Square) calculation updated every 50ms for smooth animation.

Profiling with hotpath

cargo install hotpath --features="tui"
cargo run --release --features hotpath
# In another terminal:
hotpath console

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.cargo		.cargo
.github/workflows		.github/workflows
.opencode/plans		.opencode/plans
HomebrewFormula		HomebrewFormula
ai-department		ai-department
docs		docs
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
config.example.toml		config.example.toml
install.sh		install.sh
next_steps.txt		next_steps.txt

Folders and files

Latest commit

History

Repository files navigation

Silly AI

Features

Demo

Architecture

Crosstalk Flow

Installation

Quick install (recommended)

Homebrew (macOS)

Build from source

Setup

Configure your LLM backend

Build variants

CLI Commands

Building with Listen feature

Building with Typing feature

Usage

Keyboard Commands

Voice Commands

Application Modes

Typing Mode Commands

Auto-Submit

Configuration

Configuration Reference

LLM Backends

OpenAI-Compatible API (Recommended)

Ollama

llama.cpp

Kalosm

LLM Models

Runtime State

Audio Visualization

Profiling with hotpath

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages