Skip to content
v47 · Apple Silicon native

Squish
The Local AI
Agent Runtime.

Run any AI model, fully local, on Apple Silicon. Squish compresses 70B models to fit in 18 GB and starts them in under 2 seconds—no GPU, no cloud, no API keys.

Free for personal use macOS M1–M5 No GPU required
Squish mascot

squish

The local AI agent runtime

terminal — squish

Install once

brew install squish-ai/squish

  ✓ squish 47.0.0 installed

One command does everything

squish run qwen3:8b

  ↓ Pulling qwen3:8b   4.4 GB

  ✓ Model ready      0.43s

  ✓ Chat open at http://localhost:11435 🌐

  

150ms TTFT (10k Context Loop)
100% Perfect JSON (FSM Masking)
73% Smaller Model Disk Size
4x More Context via INT4 Cache
Drop-in OpenAI & Ollama API
Getting started

Up and running in two steps

Install once. Then squish run handles pull, compress, serve, and opens your chat UI automatically.

1
Install Squish

One Homebrew command. No Docker, no CUDA, no virtual environment setup.

brew install squish-ai/squish
2
Run a Model

Downloads the pre-optimised model if needed, loads in milliseconds, opens your chat UI in the browser.

squish run qwen3:8b

squish serve is an alias for squish run — use whichever feels right.

🔒

Your data never
leaves your Mac

Every inference runs on your hardware, in your memory. No telemetry on conversations, no API quotas, no usage bills. Fast, private AI you own outright.

🏠 Runs 100% locally 📴 Works fully offline 🚫 Zero conversation logging 🆓 No API keys needed
Faster than any cloud API

No network round-trip means TTFT under 500ms — beating most hosted endpoints on raw latency.

💾
73% smaller on disk

INT4 compression turns a 16 GB BF16 8B model into 4.4 GB. Run two models where you used to fit one.

🧠
Statistically identical quality

Calibrated quantisation preserves benchmark accuracy to within statistical noise. Not a lossy compromise.

📈
Gets better every release

v10 ships 228 new optimisation modules. Each release improves TTFT and decode throughput, applied automatically.

Features

Built for speed at every layer

From storage format to HTTP serving, every decision is optimised for Apple Silicon unified memory.

2-second cold start. First token in 443ms.

Memory-mapped INT4 tensors load directly into Metal unified memory with zero dtype conversion. Cold start 0.4s, first token 443ms for an 8B model on M3.

443ms TTFT · qwen3:8b M3
🔌
Drop-in for OpenAI. Any OpenAI SDK.

Zero code changes. LangChain, LlamaIndex, OpenAI SDK, Cursor, and any tool that speaks /v1/chat/completions works out of the box.

/v1/chat/completions
🗜
10x faster on repeat prompts.

Agents resend the same 10,000-token system prompt every turn. Squish's RadixTree Cache computes it exactly once—giving you instantaneous thought loops at any context depth.

150ms TTFT on repetitive prompts
🧠
Zero broken JSON.

Small models hallucinate syntax. Squish uses engine-level Finite State Machine (FSM) masking to constrain every token to valid JSON matching your schema. Agents never crash a parser again.

Zero JSONDecodeErrors
4x more context. Same RAM.

A 32k context window normally pushes a 16 GB Mac into swap. Squish's Asymmetric INT4 KV Cache shrinks the KV footprint by 75%, keeping all context hot in unified memory.

4x Context Capacity
📦
100 concurrent requests. One model.

Process multiple prompts in a single request. Essential for evals, data pipelines, and bulk generation—a capability Ollama and LM Studio don't offer.

"batch": [req1, req2, …]
Comparison

Why Squish beats the rest

Real measurements, same hardware. qwen3:8b on M3 MacBook Pro, cold start, no caching.

Metric Ollama LM Studio Squish ✶
TTFT — qwen3:8b cold 20–30 s ~18–28 s 443 ms ✶
Cold-start load time ~28 s ~20 s 0.4 s ✶
RAM during load (8B) ~2.5 GB ~2.5 GB 160 MB ✶
Disk size — 8B model 4.7 GB (GGUF Q4) 4.7 GB (GGUF Q4) 4.4 GB INT4 ✶
OpenAI API
Batch requests
Pre-optimised weights (HuggingFace) ✓ 29 models
Auto-open chat UI
Zero-copy mmap Metal load
10k-Token Loop TTFT (Agent Reprompt) 5–8 s ~4 s 150 ms (RadixTree) ✶
Guaranteed JSON Syntax (FSM) ✓ 100% Reliable
Context Window Compression FP16 Only (High VRAM) FP16 Only INT4 (75% Less VRAM)

✶ Measured on M3 16 GB · qwen3:8b · cold start · Squish v9.0. Ollama TTFT includes cold model load. RAM = peak during weight-loading phase.

Quick Start

Everything you need, right here

📦 Install
🚀 Run a model
💬 Chat UI
🔌 API server
🗜 Compress

macOS via Homebrew (recommended)

brew install squish-ai/squish

  ✓ squish 47.0.0 installed

Or via pip (Python 3.10+)

pip install squish

Verify installation

squish --version

  squish 9.0.0

One command: pull, optimise, serve, open browser

squish run qwen3:8b

  ↓ Pulling qwen3:8b   4.4 GB  ██████████ 100%

  ✓ Model ready     0.43s

  ✓ Server          http://localhost:11435

  ✓ Chat UI         opening in browser... 🌐

No model? Interactive picker appears

squish run

  ? Choose a model:

  > qwen3:8b      4.4 GB · INT4  (recommended)

    qwen3:4b      2.3 GB · INT4

    llama3.2:3b  1.5 GB · INT4

Browser UI opens automatically after squish run

  ┌─────────────────────────────────────┐

  │ 🟣 squish            localhost:11435  │

  ├─────────────────────────────────────┤

  │ Model: qwen3:8b ▾                     │

  ├─────────────────────────────────────┤

  │                                      │

  │ 🟢 Hi! Running on your Mac.           │

  │    No cloud. No cost. Fully private.   │

  │                                      │

  │ You: [                          ] → │

  └─────────────────────────────────────┘

Terminal chat (no browser)

squish chat qwen3:8b

  Loading... 0.4s

  You: Hello!

  AI: Hi! How can I help? <streams instantly>

squish run already starts a server; or start it manually

squish serve qwen3:8b

  → http://localhost:11435  (OpenAI-compatible)

Zero code changes from OpenAI SDK

python3

import openai

client = openai.OpenAI(

    base_url="http://localhost:11435/v1",

    api_key="local"

)

r = client.chat.completions.create(

    model="qwen3:8b",

    messages=[{"role":"user","content":"Hello!"}]

)

print(r.choices[0].message.content)

Ollama-compatible too

  export OLLAMA_HOST=http://localhost:11435

Compress any HuggingFace model to INT4 locally

squish pull meta-llama/Llama-3.3-70B-Instruct --int4

  Downloading weights...

  Quantising INT4  ████████████ 100%

  ✓ 18.2 GB → 4.9 GB  (73% smaller)

INT8 for near-lossless quality (~50% smaller)

squish pull meta-llama/Llama-3.3-70B-Instruct --int8

  ✓ 18.2 GB → 9.2 GB  (statistically identical outputs)

Rust quantizer: 4-6x faster compression (optional)

  cargo build --release -p squish_quant_rs

Squish

Reclaim your VRAM.
Unleash your Agents.

Turn your MacBook into a production-grade AI agent server in under 60 seconds. No cloud, no API bills, no out-of-memory crashes.