NVIDIA Triton Inference Server, ported to Apple Silicon via MLX.
Triton-MLX is a production-grade inference server that runs large language models natively on Apple Silicon using the MLX framework. It provides a fully OpenAI-compatible API, so you can use any OpenAI SDK client to serve models locally on your Mac.
- OpenAI-compatible API — Drop-in replacement for OpenAI endpoints
/v1/chat/completions(streaming + non-streaming)/v1/completions/v1/embeddings/v1/models
- Native Apple Silicon — MLX Metal GPU acceleration, unified memory (zero-copy)
- Any HuggingFace model — Load models from mlx-community or convert your own
- Streaming — Server-Sent Events (SSE) with real-time token delivery
- Prometheus metrics — Request latency, token throughput, memory usage
- Health checks —
/v1/health/liveand/v1/health/readyfor orchestration - Hot-reload — Poll-based model repository watching for new models
- Concurrency control — Semaphore-based request limiting to prevent OOM
- API key auth — Optional Bearer token authentication (
--api-keyor env var) - Structured logging — Human-readable or JSON format (
--json-logs) - Request middleware — Request IDs, logging, OpenAI-compatible error handling
- Model warm-up — Automatic GPU warm-up on model load for fast first token
- Prompt caching — LRU cache for tokenized prompts
# Clone the repo
git clone https://github.com/RobotFlow-Labs/triton-mlx.git
cd triton-mlx
# Create environment (requires Python 3.11+ and Apple Silicon Mac)
uv venv .venv --python 3.12
uv pip install -e ".[dev]"# Download a small 4-bit quantized model (~680MB)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
'mlx-community/Llama-3.2-1B-Instruct-4bit',
local_dir='model_repository/Llama-3.2-1B-Instruct-4bit'
)
"triton-mlx --model-repository ./model_repositoryThe server starts on http://localhost:9000 by default.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:9000/v1",
api_key="not-needed", # No API key required
)
# Chat completion
response = client.chat.completions.create(
model="Llama-3.2-1B-Instruct-4bit",
messages=[{"role": "user", "content": "What is the meaning of life?"}],
max_tokens=100,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="Llama-3.2-1B-Instruct-4bit",
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Or use curl:
# Chat completion
curl http://localhost:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# List models
curl http://localhost:9000/v1/models
# Health check
curl http://localhost:9000/v1/health/readyPlace HuggingFace model directories in your model repository folder. Each model needs a config.json file:
model_repository/
├── Llama-3.2-1B-Instruct-4bit/
│ ├── config.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── model.safetensors
├── Phi-3-mini-4k-instruct-4bit/
│ ├── config.json
│ ├── tokenizer.json
│ └── model.safetensors
Any model supported by mlx-lm works, including:
- Llama (3.2, 3.1, 3, 2)
- Mistral / Mixtral
- Phi (3, 3.5)
- Qwen (2, 2.5)
- Gemma (2)
- Cohere Command R
- StarCoder / CodeLlama
# Convert any HuggingFace model to MLX format
python -m mlx_lm.convert \
--hf-path meta-llama/Llama-3.2-1B-Instruct \
--mlx-path model_repository/Llama-3.2-1B-Instruct-4bit \
-q # 4-bit quantizationtriton-mlx --help
Options:
--model-repository PATH Model repository directory (required)
--host HOST Bind address (default: 0.0.0.0)
--openai-port PORT OpenAI API port (default: 9000)
--http-port PORT HTTP API port (default: 8000)
--log-level LEVEL DEBUG, INFO, WARNING, ERROR (default: INFO)
--max-memory-gb GB Max Metal memory to use
--default-max-tokens N Default max tokens (default: 256)
--chat-template PATH Custom Jinja2 chat template
--model-control-mode MODE none, poll, explicit (default: poll)
--repository-poll-secs N Poll interval for new models (default: 15)
--max-concurrent-requests N Max concurrent inference requests (default: 4)
--api-key KEY API key for authentication (or TRITON_MLX_API_KEY env)
--json-logs Enable JSON structured logging
Add a triton_mlx.json alongside config.json for per-model settings:
{
"display_name": "My Custom Model",
"max_batch_size": 4,
"max_tokens": 4096,
"default_temperature": 0.7
}POST /v1/chat/completions
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | string | required | Model name |
| messages | array | required | Chat messages |
| temperature | float | 1.0 | Sampling temperature (0-2) |
| top_p | float | 1.0 | Nucleus sampling (0-1) |
| max_tokens | int | 256 | Max tokens to generate |
| stop | string/array | null | Stop sequences |
| stream | bool | false | Enable SSE streaming |
| frequency_penalty | float | 0.0 | Frequency penalty (-2 to 2) |
POST /v1/completions
Same parameters as chat, but uses prompt (string) instead of messages.
POST /v1/embeddings
| Parameter | Type | Description |
|---|---|---|
| model | string | Model name |
| input | string/array | Text to embed |
GET /v1/models # List all models
GET /v1/models/{model_name} # Get model details
GET /v1/health/live # Always 200 if server running
GET /v1/health/ready # 200 if models loaded, 503 otherwise
GET /v1/metrics # Prometheus format
┌─────────────────────────────────────────┐
│ OpenAI SDK / curl │
└──────────────┬──────────────────────────┘
│ HTTP
┌──────────────▼──────────────────────────┐
│ FastAPI Application │
│ /v1/chat/completions │
│ /v1/completions │
│ /v1/embeddings │
│ /v1/models, /v1/health, /v1/metrics │
└──────────────┬──────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ MLX Engine Adapter │
│ Chat template • Parameter mapping │
│ Request routing • Response formatting │
└──────────────┬──────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ MLX Inference Runtime │
│ mlx_lm.stream_generate() │
│ Sampler (temp, top_p, penalties) │
│ Stop sequence detection │
└──────────────┬──────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ Apple Silicon Metal GPU │
│ Unified memory • Zero-copy inference │
└─────────────────────────────────────────┘
# Run tests (83 tests, <4s)
.venv/bin/pytest tests/ -q
# Unit + integration only (no model needed)
.venv/bin/pytest tests/unit tests/integration -q
# Functional tests (requires model in examples/model_repository/)
.venv/bin/pytest tests/functional -q
# Lint
.venv/bin/ruff check src/ tests/
# Start in debug mode
triton-mlx --model-repository ./model_repository --log-level DEBUGThis project is a port of NVIDIA Triton Inference Server (v2.67.0dev). The upstream C++ server is tracked at upstream/main.
Key differences from upstream:
- Python-first — Pure Python server (no C++ compilation required)
- MLX backend — Replaces CUDA/TensorRT with Apple MLX Metal
- Unified memory — No CPU↔GPU data copies (Apple Silicon advantage)
- mlx-lm — Leverages the MLX LLM ecosystem for model loading
Apache 2.0 (same as upstream Triton)