Run large language models locally on a Mac Studio M3 Ultra (96GB) and connect coding agents over LAN.
Mistral Small 4 note: On Apple Silicon, MLX support is still incomplete. Prefer
GGUFonllama.cpp/LM Studio/Ollamafor local use, orvLLMfor Mistral's official full-feature self-deployment path.
MacBook / Linux / WSL ββββ LAN ββββ> Mac Studio M3 Ultra (96GB)
Claude Code vllm-mlx (primary) :8000
OpenCode mlx-openai-server :8000
OpenClaw oMLX (multi-model) :8000
Pi vmlx (JANGTQ) :8000
OpenAI + Anthropic API (+ Ollama for vmlx)
This repository is primarily an operations notebook + config bundle for the Mac Studio inference stack, not a single application codebase.
| Path | Purpose | Start Here |
|---|---|---|
docs/server/ |
Server runbooks, setup, maintenance, and JANG patches | Pick one summary.md per server |
docs/models/ |
Model catalog, compatibility notes, conversion guides, benchmarks | docs/models/model-summary.md |
docs/clients/ |
Client-side setup for Claude Code, OpenCode, OpenClaw, and Pi | The client you are configuring |
configs/ |
Ready-to-copy client config templates grouped by server type | configs/README.md |
scripts/ |
Small local patch helpers for upstream server packages | Read before re-running after upgrades |
plans/ |
Research notes and future work, not the primary source of truth for live setup | Use only for backlog/context |
- Read this
README.mdfor the stack overview and server selection. - Read one server runbook in
docs/server/<server>/summary.md. - Read
configs/README.mdfor the matching client config templates. - Read
docs/models/model-summary.mdwhen choosing or adding models. - Read the relevant maintenance or patch docs only when upgrading or debugging.
summary.mdfiles are the main operational entry points.maintenance.mdandjang-patch.mdfiles are task-specific follow-ups.plans/captures ideas, experiments, and pending investigations; it is not the live runbook layer.
Pick one β all serve on port 8000. Stop others first if switching.
# vllm-mlx β fastest, single model
nohup ~/vllm-mlx-env/bin/python ~/run_vllm_jang.py serve \
~/.omlx/models/JANGQ-AI--Qwen3.5-122B-A10B-JANG_2S \
--served-model-name JANGQ-AI/Qwen3.5-122B-A10B-JANG_2S \
--port 8000 --host 0.0.0.0 > /tmp/vllm-mlx.log 2>&1 &
# mlx-openai-server β multi-model, low overhead
JANG_PATCH_ENABLED=1 nohup ~/mlx-openai-server-env/bin/mlx-openai-server launch \
--config ~/mlx-openai-server-multimodel.yaml --no-log-file \
> /tmp/mlx-openai-server.log 2>&1 &
# oMLX β 9 models, hot-swap
/opt/homebrew/bin/brew services start omlx
# vmlx β JANGTQ CRACK (MLX Studio bundled Python, headless)
# Tool use + Qwen3 thinking require all three parser flags AND a one-time
# source patch (scripts/patch_vmlx_jangtq_mllm_tools.py). See
# docs/server/vmlx/maintenance.md#tool-use-and-reasoning-mllm-models.
BP=/Applications/vMLX.app/Contents/Resources/bundled-python/python
SNAP=~/.cache/huggingface/hub/models--dealignai--MiniMax-M2.7-JANGTQ-CRACK/snapshots/033d5537f48f2f836ce3dfbe392304a2b30f8536
nohup $BP/bin/python3 -m vmlx_engine.cli serve "$SNAP" \
--host 0.0.0.0 --port 8000 \
--enable-auto-tool-choice --tool-call-parser qwen3 --reasoning-parser qwen3 \
> /tmp/vmlx.log 2>&1 &
pkill -f vllm-mlx # stop vllm-mlx
pkill -f mlx-openai-server # stop mlx-openai-server
/opt/homebrew/bin/brew services stop omlx # stop oMLX
pkill -f vmlx_engine # stop vmlxcurl -s http://<MAC_STUDIO_IP>:8000/v1/models | python3 -m json.tool # vllm-mlx / mlx-openai-server
curl -s http://<MAC_STUDIO_IP>:8000/v1/models \
-H "Authorization: Bearer <YOUR_API_KEY>" | python3 -m json.tool # oMLX (auth required)
open http://<MAC_STUDIO_IP>:8000/admin # oMLX dashboard
tail -f /tmp/vllm-mlx.log # vllm-mlx logs
tail -f /tmp/mlx-openai-server.log # mlx-openai-server logs
tail -f ~/.omlx/logs/server.log # oMLX logs
tail -f /tmp/vmlx.log # vmlx logsWorks on all servers β swap <MODEL_NAME> from /v1/models. Add auth header for oMLX.
curl -s http://<MAC_STUDIO_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<MODEL_NAME>","messages":[{"role":"user","content":"Say hello"}],"max_tokens":50}' \
| python3 -m json.tool| Server | Speed | Models | API | Best For |
|---|---|---|---|---|
| vllm-mlx | β‘ Fastest | Single | OpenAI + Anthropic | Daily use β lowest overhead on Apple Silicon |
| mlx-openai-server | π’ Fast | Multi (YAML) | OpenAI | Prompt caching, speculative decoding |
| mlx-lm | π‘ Good | Single | OpenAI | Lightweight dev/testing |
| oMLX | π΄ Slower | 9 hot-swap | OpenAI + Anthropic | Model variety with SSD caching |
| vmlx (MLX Studio bundled) | π’ Fast | JANGTQ only | OpenAI + Anthropic + Ollama | TurboQuant CRACK models β 43.7 tok/s on MiniMax-M2.7 |
All servers support JANG mixed-precision models via patches: vllm-mlx Β· oMLX Β· mlx-openai-server Β· mlx-lm
Server maintenance: vllm-mlx Β· oMLX Β· mlx-openai-server Β· vmlx
Current mlx-openai-server roster: mlx-community/Qwen3.6-35B-A3B-6bit (single-model, Qwen3.6-only mode β switched 2026-04-18 for through-server benchmarking).
All models fit in 96GB unified memory.
| Model | Type | Size|GB | Context | Best For |
|---|---|---|---|---|
| Gemma 4 26B-A4B (4-bit) | MoE 26B/4B | 15 | 256K | Vision + video + reasoning + tool use |
| Qwen3.5-122B-A10B JANG 2S | MoE 122B/10B | 35 | 200K+ | Compact 122B, instant load |
| Qwen3-Coder-Next 6-bit | Dense 80B | 60 | 170K | Coding specialist |
| Qwen3-Coder-30B-A3B Instruct 4-bit | MoE 30.5B/3.3B | 17.2 | 262K | Compact coding model |
| Qwen3.5-122B-A10B 4-bit | MoE 122B/10B | 65 | 128K | Full-precision alternative |
| Qwen3.5-27B Opus Distilled | Dense 27B | 19 | 128K | Reasoning / chain-of-thought |
| OmniCoder-9B 8-bit | Dense 9B | 9.5 | 262K | Lightweight coding agent |
| Qwen3.5-35B-A3B JANG 4K | MoE 35B/3B | 19 | 262K | Fast small MoE |
| Qwen3.6-35B-A3B 6-bit | Hybrid MoE 35B/3B + VL | 27 | 262K (1M YaRN) | Vision + hybrid linear attention |
| Nemotron 3 Super 120B | MoE 120B/12B | 66.5 | 200K | Mamba-2 hybrid |
| Nemotron 3 Nano 30B | MoE 32B/3B | 34 | 262K | NVIDIA MoE |
| Nemotron Cascade 2 30B | Hybrid 30B/3B | 17 | 262K | Mamba-2 + MoE |
| MiniMax-M2.7 JANGTQ-CRACK | MoE 230B/10B | 57 | 128K | Uncensored, TurboQuant (vmlx only) β see uncen-model |
| Qwen3.6-35B-A3B JANGTQ4-CRACK | MoE 35B/3B + VL | 19.7 | 262K | Uncensored efficiency-frontier winner β ties MiniMax at 10/10 useful compliance at 1/3 the weights, ~64 tok/s sustained, ~4.66 s avg latency. VLM. Default thinking ON. (vmlx only) |
| Qwen3.6-35B-A3B JANGTQ2-CRACK | MoE 35B/3B + VL | 11.6 | 262K | Uncensored VLM, 2-bit TurboQuant (vmlx only). Fastest CRACK (~66 tok/s) but quality-impaired (4/10 useful compliance). Requires enable_thinking=false |
Full specs and per-model details: Model Summary
Quantization key: JANG = adaptive mixed-precision (jangq.ai), MoE = Mixture of Experts (total/active params), nvfp4 = NVIDIA 4-bit float.
Qwen3-Coder-Next 6-bit (dense 60GB):
| Server | 512 | 8K | 32K | 64K |
|---|---|---|---|---|
| vllm-mlx | 68.8 π₯ | 63.8 π₯ | 56.4 π₯ | 51.7 π₯ |
| mlx-lm | 68.4 π₯ | 62.7 π₯ | 54.0 π₯ | 47.7 π₯ |
| oMLX | 66.5 | 56.9 | 40.4 | 34.8 |
Gemma 4 26B-A4B 4-bit (MoE, multimodal β mlx-openai-server 1.7.1, Apr 2026):
Tokens include both reasoning (
reasoning_content) and output (content) phases. 512 warm values shown (run 1 cold-start: 59.4 tok/s / 28 tok/s prefill / 18.7s TTFT).
| Context | Gen (tok/s) | Prefill (tok/s) | TTFT (s) |
|---|---|---|---|
| 512 | 62.5 | 1,710 | 0.30 |
| 4K | 54.6 | 3,117 | 1.32 |
| 8K | 60.6 | 3,154 | 2.60 |
| 32K | 50.6 | 2,892 | 11.34 |
| 64K | 42.0 | 2,542 | 25.78 |
| 128K | 27.1 | 1,995 | 65.70 |
Qwen3.6-35B-A3B 6-bit (Hybrid MoE, multimodal β mlx-openai-server 1.7.1, Apr 2026):
Tokens include both
reasoning_content(always-on<think>) andcontent. Server-validated; standalone-only Apr 17 numbers in standalone benchmarks carry a VLM double-prefill artefact and are not directly comparable.
| Context | Gen (tok/s) | Prefill (tok/s) | TTFT (s) |
|---|---|---|---|
| 512 | 52.5 | 1,401 | 0.34 |
| 4K | 53.0 | 2,237 | 1.64 |
| 8K | 51.3 | 2,197 | 3.32 |
| 32K | 46.3 | 1,798 | 16.22 |
| 64K | 40.3 | 1,408 | 41.40 |
| 128K | 35.6 | 927 | 125.73 |
Hybrid Gated DeltaNet pays off at long context β Qwen3.6's 35.6 tok/s @ 128K is 31% faster than Gemma 4 at the same context, despite Qwen3.6 carrying a vision encoder.
Qwen3.5-35B-A3B JANG (MoE, primary architecture):
| Server | 32K | 64K |
|---|---|---|
| vllm-mlx | 83.8 π₯ | 71.6 π₯ |
| mlx-openai-server | 81.3 π₯ | 62.8 |
| mlx-lm | 77.6 | 65.1 π₯ |
| oMLX | 59.9 | 49.0 |
128K cross-model (long-context comparison β all measured through-server, Apr 2026):
| Model | Server | Gen tok/s | Prefill tok/s | TTFT (s) |
|---|---|---|---|---|
| Qwen3-Coder-Next 6-bit (Dense 80B) | vllm-mlx | 44.2 π₯ | 736 π₯ | 158 π₯ |
| Qwen3.6-35B-A3B 6-bit (Hybrid MoE + VL) | mlx-openai-server | 35.6 π₯ | 927 | 126 π₯ |
| Qwen3.5-122B JANG 2S | vllm-mlx (JANG) | 34.5 | 405 | 324 |
| Qwen3.5-35B-A3B JANG 4K | oMLX (Mar leaderboard) | 33.8 | 295 | β |
| Gemma 4 26B-A4B 4-bit (MoE + VL) | mlx-openai-server | 27.1 | 1,995 | 66 |
Full results: Standalone Β· API Server Β· TurboQuant KV Cache
| Agent | Description | Setup |
|---|---|---|
| Claude Code | Anthropic's official CLI | Guide |
| OpenCode | Autonomous coding agent | Guide |
| OpenClaw | Multi-agent framework | Guide |
| Pi | Coding assistant | Guide |
Server constraints:
- oMLX β No GGUF, no MXFP8, starlette 1.0 dashboard bug (#361). JANG+Nemotron-H matmul mismatch (details). Maintenance
- mlx-openai-server β No Anthropic API, single-request queue, 15% overhead at 64K context, tool arg string bug (patch). Maintenance
- vllm-mlx β Single model only, no dashboard, manual start, v0.2.6 return bug needs patch. Maintenance
- vmlx β JANGTQ only (MLX Studio DMG bundled Python), no GUI but overwritten on every DMG upgrade. MLLM path drops
tools[], ignorestools=in chat template, and crashes on multi-turn tool replay β fix withscripts/patch_vmlx_jangtq_mllm_tools.py(detail). Requires--enable-auto-tool-choice --tool-call-parser qwen3 --reasoning-parser qwen3. Maintenance
Model compatibility:
- Nemotron family β Only works on vllm-mlx (chat template not packaged in MLX weights). Details
- Mistral Small 4 β Broken on current MLX servers here (missing native
mistral4MLA support in upstreammlx-lm). For Apple Silicon, the practical local path isGGUFonllama.cpp/LM Studio/Ollama; Mistral's official full-feature deployment guidance still points tovLLM. Details - Qwen3.5-122B + OpenClaw β HTTP 500 with large system prompts (#42)
Maintenance: