Benchmark to choose. Dashboard to monitor. History to spot problems.asiai's REST API lets your AI agents monitor, diagnose, and optimize local LLM infrastructure autonomously.
asiai bench
asiai web
{
"chip": "Apple M4 Pro",
"ram_gb": 64.0,
"memory_pressure": "normal",
"gpu_utilization_percent": 45.2,
"engines": {
"ollama": { "running": true, "models_loaded": 2 },
"lmstudio": { "running": true, "models_loaded": 1 }
}
}
{
"system": {
"chip": "Apple M4 Pro",
"gpu_cores": 20,
"gpu_utilization_percent": 45.2,
"gpu_renderer_percent": 38.1,
"thermal_state": "nominal"
},
"engines": [{
"name": "ollama",
"models": [{ "name": "qwen3.5:latest", "size_params": "35B" }]
}]
}
Sound familiar?
Ollama, LM Studio, mlx-lm — each with its own CLI, formats, and metrics. No common ground.
No real-time VRAM monitoring, no power tracking, no thermal alerts. You're flying blind.
Benchmarking means curl scripts, copy-pasting numbers, and comparing in spreadsheets.
Everything you need to benchmark, monitor, and optimize local inference.
Same model on Ollama vs LM Studio vs mlx-lm. One command, real numbers. No vibes.
Measure GPU power during inference. Know your tok/s per watt — nobody else does this.
Ollama, LM Studio, mlx-lm, llama.cpp, oMLX, vllm-mlx, Exo. Auto-detected, auto-configured.
stdlib Python only. No requests, no psutil, no rich. Installs in seconds.
Real-time GPU utilization, renderer, tiler, and memory — via passive IOReport. Live gauges, sparklines, historical charts. See your Apple Silicon GPU like never before.
Auto-detects performance drops after OS or engine updates. SQLite history with 90-day retention.
Full JSON API for automation. /api/snapshot, /api/status, /api/metrics — integrate with any stack.
Built-in /metrics endpoint. Plug into Grafana, Datadog, or any Prometheus-compatible tool. Zero config.
POST to Slack, Discord, or any URL on memory pressure, thermal throttling, or engine down. Transition-based — no spam.
Share benchmarks anonymously. Compare your Mac against the community. See what others achieve on the same chip.
"On your M4 Pro 64 GB, for code: Qwen3.5-35B on mlx-lm at 71 tok/s." Data-driven answers to the #1 question on r/LocalLLaMA.
Benchmark Exo clusters. 2 Mac Minis = Llama 3.3 70B. asiai measures the swarm like a single machine.
One command, one shareable image. Run asiai bench --card and get a 1200x630 dark-themed card with your model, chip, engine comparison, and winner. Post it on Reddit, X, or Discord. The Speedtest for local LLMs.
Built for humans. Ready for AI agents. REST API with JSON endpoints, Prometheus metrics, diagnostic decision trees, and inference activity signals. Give your AI agent a URL and let it self-monitor.
Real questions from r/LocalLLaMA, answered in one command.
Head-to-head comparison — the #1 question on r/LocalLLaMA.
LLMs running 24/7 for AI agents — track VRAM, thermal, and performance.
tok/s per watt between engines. Critical for 24/7 Mac Mini homelabs.
Did the Ollama or macOS update break your performance? Auto-detection via SQLite.
--context-size 64k benchmarks. Does your model survive 256k context?
Drift detection across benchmark runs. Unique to asiai.
MLPerf/SPEC methodology. Warmup, median, greedy decoding. Share with confidence.
asiai doctor diagnoses system, engines, and database with fix suggestions.
Dark/light web dashboard with live charts, SSE progress, benchmark controls.
Same engine, different models. Which quantization wins?
Expose /metrics, scrape with Prometheus, visualize in Grafana. Production-grade observability.
GPU activity, TCP connections, KV cache — know when your agents are thinking, idle, or overloaded. API-ready for swarm orchestrators.
Three commands. That's it.
brew install asiai
$ asiai detect
✔ ollama (11434)
✔ lmstudio (1234)
✔ mlx-lm (8080)
→ 3 engines found
$ asiai bench -m qwen3.5
Engine tok/s TTFT
lmstudio 71.2 42ms
ollama 54.8 61ms
mlx-lm 30.1 38ms
Numbers from actual benchmarks on Apple Silicon.
MLX is 2.3x faster for MoE architectures (Qwen3.5-35B-A3B) on Apple Silicon.
VRAM stays constant from 64k to 256k context with DeltaNet — not documented anywhere else.
Same model, same Mac: 30 tok/s on one engine, 71 tok/s on another. The engine matters more.
Auto-detected, zero configuration needed.
| Engine | Default Port | API | Format | VRAM |
|---|---|---|---|---|
| Ollama | 11434 |
Native | GGUF | ✔ |
| LM Studio | 1234 |
OpenAI-compatible | GGUF + MLX | ✔ |
| mlx-lm | 8080 |
OpenAI-compatible | MLX | — |
| llama.cpp | 8080 |
OpenAI-compatible | GGUF | — |
| oMLX | 8000 |
OpenAI-compatible | MLX | — |
| vllm-mlx | 8000 |
OpenAI-compatible | MLX | — |
| Exo | 52415 |
OpenAI-compatible | MLX | — |
8 metrics, consistent methodology, every run.
Generation speed (tokens/sec)
Time to first token
GPU power draw in watts
Energy efficiency
Run-to-run variance
GPU memory footprint
Throttling state
Long context perf scaling
Install in seconds. Zero dependencies.
brew tap druide67/tap
brew install asiai
pip install asiai