Which LLM? Which engine?
Which combo wins on your Mac?Give your AI agents
eyes on inference

Benchmark to choose. Dashboard to monitor. History to spot problems.asiai's REST API lets your AI agents monitor, diagnose, and optimize local LLM infrastructure autonomously.

🧑 Human AI Agent 🤖

Get Started Leaderboard View on GitHub Agent Guide API Reference Give your AI this URL

Python 3.11+ Apache 2.0 Apple Silicon Agent-Ready

asiai bench

asiai web

GET /api/status ≤ 500ms

{
  "chip": "Apple M4 Pro",
  "ram_gb": 64.0,
  "memory_pressure": "normal",
  "gpu_utilization_percent": 45.2,
  "engines": {
    "ollama": { "running": true, "models_loaded": 2 },
    "lmstudio": { "running": true, "models_loaded": 1 }
  }
}

GET /api/snapshot Full state

{
  "system": {
    "chip": "Apple M4 Pro",
    "gpu_cores": 20,
    "gpu_utilization_percent": 45.2,
    "gpu_renderer_percent": 38.1,
    "thermal_state": "nominal"
  },
  "engines": [{
    "name": "ollama",
    "models": [{ "name": "qwen3.5:latest", "size_params": "35B" }]
  }]
}

The Local LLM Problem

Sound familiar?

🧩

Fragmented

Ollama, LM Studio, mlx-lm — each with its own CLI, formats, and metrics. No common ground.

🙈

Blind

No real-time VRAM monitoring, no power tracking, no thermal alerts. You're flying blind.

📋

Manual

Benchmarking means curl scripts, copy-pasting numbers, and comparing in spreadsheets.

Built for Apple Silicon Power Users

Everything you need to benchmark, monitor, and optimize local inference.

⚔️

Head-to-Head Benchmarks

Same model on Ollama vs LM Studio vs mlx-lm. One command, real numbers. No vibes.

⚡

Energy Efficiency

Measure GPU power during inference. Know your tok/s per watt — nobody else does this.

🔧

7 Engines, One CLI

Ollama, LM Studio, mlx-lm, llama.cpp, oMLX, vllm-mlx, Exo. Auto-detected, auto-configured.

📦

Zero Dependencies

stdlib Python only. No requests, no psutil, no rich. Installs in seconds.

🔍

GPU Observability

Real-time GPU utilization, renderer, tiler, and memory — via passive IOReport. Live gauges, sparklines, historical charts. See your Apple Silicon GPU like never before.

📉

Regression Detection

Auto-detects performance drops after OS or engine updates. SQLite history with 90-day retention.

🌐

REST API

Full JSON API for automation. /api/snapshot, /api/status, /api/metrics — integrate with any stack.

📈

Prometheus Native

Built-in /metrics endpoint. Plug into Grafana, Datadog, or any Prometheus-compatible tool. Zero config.

🔔

Alert Webhooks

POST to Slack, Discord, or any URL on memory pressure, thermal throttling, or engine down. Transition-based — no spam.

🏆

Community Leaderboard

Share benchmarks anonymously. Compare your Mac against the community. See what others achieve on the same chip.

💡

Smart Recommendations

"On your M4 Pro 64 GB, for code: Qwen3.5-35B on mlx-lm at 71 tok/s." Data-driven answers to the #1 question on r/LocalLLaMA.

🌐

Distributed Inference

Benchmark Exo clusters. 2 Mac Minis = Llama 3.3 70B. asiai measures the swarm like a single machine.

🎴

Benchmark Card

One command, one shareable image. Run asiai bench --card and get a 1200x630 dark-themed card with your model, chip, engine comparison, and winner. Post it on Reddit, X, or Discord. The Speedtest for local LLMs.

🤖

Agent-Ready API

Built for humans. Ready for AI agents. REST API with JSON endpoints, Prometheus metrics, diagnostic decision trees, and inference activity signals. Give your AI agent a URL and let it self-monitor.

What Will You Discover?

Real questions from r/LocalLLaMA, answered in one command.

🏆

"Which engine is fastest?"

Head-to-head comparison — the #1 question on r/LocalLLaMA.

🤖

"Monitor a multi-agent swarm"

LLMs running 24/7 for AI agents — track VRAM, thermal, and performance.

🔋

"Compare energy efficiency"

tok/s per watt between engines. Critical for 24/7 Mac Mini homelabs.

🚨

"Detect regressions after updates"

Did the Ollama or macOS update break your performance? Auto-detection via SQLite.

📏

"Test long context support"

--context-size 64k benchmarks. Does your model survive 256k context?

🔥

"Is my Mac thermal throttling?"

Drift detection across benchmark runs. Unique to asiai.

📊

"Reproducible benchmarks"

MLPerf/SPEC methodology. Warmup, median, greedy decoding. Share with confidence.

🩺

"Health check in one command"

asiai doctor diagnoses system, engines, and database with fix suggestions.

💻

"Visual dashboard"

Dark/light web dashboard with live charts, SSE progress, benchmark controls.

🔄

"Compare LLMs head-to-head"

Same engine, different models. Which quantization wins?

📡

"Prometheus + Grafana monitoring"

Expose /metrics, scrape with Prometheus, visualize in Grafana. Production-grade observability.

🧠

"Track AI agent inference"

GPU activity, TCP connections, KV cache — know when your agents are thinking, idle, or overloaded. API-ready for swarm orchestrators.

Up and Running in 60 Seconds

Three commands. That's it.

Install

brew install asiai

Detect

$ asiai detect
✔ ollama (11434)
✔ lmstudio (1234)
✔ mlx-lm (8080)
→ 3 engines found

Benchmark

$ asiai bench -m qwen3.5
Engine     tok/s  TTFT
lmstudio   71.2   42ms
ollama     54.8   61ms
mlx-lm     30.1   38ms

Real Discoveries

Numbers from actual benchmarks on Apple Silicon.

2.3x

MLX vs llama.cpp

MLX is 2.3x faster for MoE architectures (Qwen3.5-35B-A3B) on Apple Silicon.

Flat

VRAM: 64k → 256k

VRAM stays constant from 64k to 256k context with DeltaNet — not documented anywhere else.

30 vs 71

Engine > Model

Same model, same Mac: 30 tok/s on one engine, 71 tok/s on another. The engine matters more.

Supported Engines

Auto-detected, zero configuration needed.

Engine	Default Port	API	Format	VRAM
Ollama	`11434`	Native	GGUF	✔
LM Studio	`1234`	OpenAI-compatible	GGUF + MLX	✔
mlx-lm	`8080`	OpenAI-compatible	MLX	—
llama.cpp	`8080`	OpenAI-compatible	GGUF	—
oMLX	`8000`	OpenAI-compatible	MLX	—
vllm-mlx	`8000`	OpenAI-compatible	MLX	—
Exo	`52415`	OpenAI-compatible	MLX	—

What We Measure

8 metrics, consistent methodology, every run.

🚀

tok/s

Generation speed (tokens/sec)

⏱️

TTFT

Time to first token

⚡

Power (W)

GPU power draw in watts

🔋

tok/s/W

Energy efficiency

📈

Stability

Run-to-run variance

💾

VRAM

GPU memory footprint

🌡️

Thermal

Throttling state

📏

Context

Long context perf scaling

Get Started

Install in seconds. Zero dependencies.

Homebrew

brew tap druide67/tap
brew install asiai

pip

pip install asiai

GitHub Documentation Methodology Apache 2.0 ❤ Sponsor

⭐ If asiai helped you, a star helps others find it

Which LLM? Which engine?Which combo wins on your Mac?Give your AI agentseyes on inference