Squish
The Local AI
Agent Runtime.
Run any AI model, fully local, on Apple Silicon. Squish compresses 70B models to fit in 18 GB and starts them in under 2 seconds—no GPU, no cloud, no API keys.
squish
The local AI agent runtime
Install once
brew install squish-ai/squish
✓ squish 47.0.0 installed
One command does everything
squish run qwen3:8b
↓ Pulling qwen3:8b 4.4 GB
✓ Model ready 0.43s
✓ Chat open at http://localhost:11435 🌐
Up and running in two steps
Install once. Then squish run handles pull, compress, serve, and opens your chat UI automatically.
One Homebrew command. No Docker, no CUDA, no virtual environment setup.
brew install squish-ai/squish
Downloads the pre-optimised model if needed, loads in milliseconds, opens your chat UI in the browser.
squish run qwen3:8b
squish serve is an alias for squish run — use whichever feels right.
Your data never
leaves your Mac
Every inference runs on your hardware, in your memory. No telemetry on conversations, no API quotas, no usage bills. Fast, private AI you own outright.
No network round-trip means TTFT under 500ms — beating most hosted endpoints on raw latency.
INT4 compression turns a 16 GB BF16 8B model into 4.4 GB. Run two models where you used to fit one.
Calibrated quantisation preserves benchmark accuracy to within statistical noise. Not a lossy compromise.
v10 ships 228 new optimisation modules. Each release improves TTFT and decode throughput, applied automatically.
Built for speed at every layer
From storage format to HTTP serving, every decision is optimised for Apple Silicon unified memory.
Memory-mapped INT4 tensors load directly into Metal unified memory with zero dtype conversion. Cold start 0.4s, first token 443ms for an 8B model on M3.
Zero code changes. LangChain, LlamaIndex, OpenAI SDK, Cursor, and any tool that speaks /v1/chat/completions works out of the box.
Agents resend the same 10,000-token system prompt every turn. Squish's RadixTree Cache computes it exactly once—giving you instantaneous thought loops at any context depth.
Small models hallucinate syntax. Squish uses engine-level Finite State Machine (FSM) masking to constrain every token to valid JSON matching your schema. Agents never crash a parser again.
A 32k context window normally pushes a 16 GB Mac into swap. Squish's Asymmetric INT4 KV Cache shrinks the KV footprint by 75%, keeping all context hot in unified memory.
Process multiple prompts in a single request. Essential for evals, data pipelines, and bulk generation—a capability Ollama and LM Studio don't offer.
Why Squish beats the rest
Real measurements, same hardware. qwen3:8b on M3 MacBook Pro, cold start, no caching.
| Metric | Ollama | LM Studio | Squish ✶ |
|---|---|---|---|
| TTFT — qwen3:8b cold | 20–30 s | ~18–28 s | 443 ms ✶ |
| Cold-start load time | ~28 s | ~20 s | 0.4 s ✶ |
| RAM during load (8B) | ~2.5 GB | ~2.5 GB | 160 MB ✶ |
| Disk size — 8B model | 4.7 GB (GGUF Q4) | 4.7 GB (GGUF Q4) | 4.4 GB INT4 ✶ |
| OpenAI API | ✓ | ✓ | ✓ |
| Batch requests | ✗ | ✗ | ✓ |
| Pre-optimised weights (HuggingFace) | ✗ | ✗ | ✓ 29 models |
| Auto-open chat UI | ✗ | ✓ | ✓ |
| Zero-copy mmap Metal load | ✗ | ✗ | ✓ |
| 10k-Token Loop TTFT (Agent Reprompt) | 5–8 s | ~4 s | 150 ms (RadixTree) ✶ |
| Guaranteed JSON Syntax (FSM) | ✗ | ✗ | ✓ 100% Reliable |
| Context Window Compression | FP16 Only (High VRAM) | FP16 Only | INT4 (75% Less VRAM) |
✶ Measured on M3 16 GB · qwen3:8b · cold start · Squish v9.0. Ollama TTFT includes cold model load. RAM = peak during weight-loading phase.
Everything you need, right here
macOS via Homebrew (recommended)
brew install squish-ai/squish
✓ squish 47.0.0 installed
Or via pip (Python 3.10+)
pip install squish
Verify installation
squish --version
squish 9.0.0
One command: pull, optimise, serve, open browser
squish run qwen3:8b
↓ Pulling qwen3:8b 4.4 GB ██████████ 100%
✓ Model ready 0.43s
✓ Server http://localhost:11435
✓ Chat UI opening in browser... 🌐
No model? Interactive picker appears
squish run
? Choose a model:
> qwen3:8b 4.4 GB · INT4 (recommended)
qwen3:4b 2.3 GB · INT4
llama3.2:3b 1.5 GB · INT4
Browser UI opens automatically after squish run
┌─────────────────────────────────────┐
│ 🟣 squish localhost:11435 │
├─────────────────────────────────────┤
│ Model: qwen3:8b ▾ │
├─────────────────────────────────────┤
│ │
│ 🟢 Hi! Running on your Mac. │
│ No cloud. No cost. Fully private. │
│ │
│ You: [ ] → │
└─────────────────────────────────────┘
Terminal chat (no browser)
squish chat qwen3:8b
Loading... 0.4s
You: Hello!
AI: Hi! How can I help? <streams instantly>
squish run already starts a server; or start it manually
squish serve qwen3:8b
→ http://localhost:11435 (OpenAI-compatible)
Zero code changes from OpenAI SDK
python3
import openai
client = openai.OpenAI(
base_url="http://localhost:11435/v1",
api_key="local"
)
r = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role":"user","content":"Hello!"}]
)
print(r.choices[0].message.content)
Ollama-compatible too
export OLLAMA_HOST=http://localhost:11435
Compress any HuggingFace model to INT4 locally
squish pull meta-llama/Llama-3.3-70B-Instruct --int4
Downloading weights...
Quantising INT4 ████████████ 100%
✓ 18.2 GB → 4.9 GB (73% smaller)
INT8 for near-lossless quality (~50% smaller)
squish pull meta-llama/Llama-3.3-70B-Instruct --int8
✓ 18.2 GB → 9.2 GB (statistically identical outputs)
Rust quantizer: 4-6x faster compression (optional)
cargo build --release -p squish_quant_rs
Join the Squish community
Chat, contribute, and share pre-squished models with others running local AI on Apple Silicon.
Reclaim your VRAM.
Unleash your Agents.
Turn your MacBook into a production-grade AI agent server in under 60 seconds. No cloud, no API bills, no out-of-memory crashes.