Triton-MLX

NVIDIA Triton Inference Server, ported to Apple Silicon via MLX.

Triton-MLX is a production-grade inference server that runs large language models natively on Apple Silicon using the MLX framework. It provides a fully OpenAI-compatible API, so you can use any OpenAI SDK client to serve models locally on your Mac.

Features

OpenAI-compatible API — Drop-in replacement for OpenAI endpoints
- /v1/chat/completions (streaming + non-streaming)
- /v1/completions
- /v1/embeddings
- /v1/models
Native Apple Silicon — MLX Metal GPU acceleration, unified memory (zero-copy)
Any HuggingFace model — Load models from mlx-community or convert your own
Streaming — Server-Sent Events (SSE) with real-time token delivery
Prometheus metrics — Request latency, token throughput, memory usage
Health checks — /v1/health/live and /v1/health/ready for orchestration
Hot-reload — Poll-based model repository watching for new models
Concurrency control — Semaphore-based request limiting to prevent OOM
API key auth — Optional Bearer token authentication (--api-key or env var)
Structured logging — Human-readable or JSON format (--json-logs)
Request middleware — Request IDs, logging, OpenAI-compatible error handling
Model warm-up — Automatic GPU warm-up on model load for fast first token
Prompt caching — LRU cache for tokenized prompts

Quickstart

Install

# Clone the repo
git clone https://github.com/RobotFlow-Labs/triton-mlx.git
cd triton-mlx

# Create environment (requires Python 3.11+ and Apple Silicon Mac)
uv venv .venv --python 3.12
uv pip install -e ".[dev]"

Download a model

# Download a small 4-bit quantized model (~680MB)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    'mlx-community/Llama-3.2-1B-Instruct-4bit',
    local_dir='model_repository/Llama-3.2-1B-Instruct-4bit'
)
"

Start the server

triton-mlx --model-repository ./model_repository

The server starts on http://localhost:9000 by default.

Use it

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9000/v1",
    api_key="not-needed",  # No API key required
)

# Chat completion
response = client.chat.completions.create(
    model="Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    max_tokens=100,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Or use curl:

# Chat completion
curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# List models
curl http://localhost:9000/v1/models

# Health check
curl http://localhost:9000/v1/health/ready

Model Repository

Place HuggingFace model directories in your model repository folder. Each model needs a config.json file:

model_repository/
├── Llama-3.2-1B-Instruct-4bit/
│   ├── config.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── model.safetensors
├── Phi-3-mini-4k-instruct-4bit/
│   ├── config.json
│   ├── tokenizer.json
│   └── model.safetensors

Supported models

Any model supported by mlx-lm works, including:

Llama (3.2, 3.1, 3, 2)
Mistral / Mixtral
Phi (3, 3.5)
Qwen (2, 2.5)
Gemma (2)
Cohere Command R
StarCoder / CodeLlama

Converting models

# Convert any HuggingFace model to MLX format
python -m mlx_lm.convert \
  --hf-path meta-llama/Llama-3.2-1B-Instruct \
  --mlx-path model_repository/Llama-3.2-1B-Instruct-4bit \
  -q  # 4-bit quantization

Configuration

CLI Options

triton-mlx --help

Options:
  --model-repository PATH     Model repository directory (required)
  --host HOST                 Bind address (default: 0.0.0.0)
  --openai-port PORT          OpenAI API port (default: 9000)
  --http-port PORT            HTTP API port (default: 8000)
  --log-level LEVEL           DEBUG, INFO, WARNING, ERROR (default: INFO)
  --max-memory-gb GB          Max Metal memory to use
  --default-max-tokens N      Default max tokens (default: 256)
  --chat-template PATH        Custom Jinja2 chat template
  --model-control-mode MODE   none, poll, explicit (default: poll)
  --repository-poll-secs N    Poll interval for new models (default: 15)
  --max-concurrent-requests N Max concurrent inference requests (default: 4)
  --api-key KEY               API key for authentication (or TRITON_MLX_API_KEY env)
  --json-logs                 Enable JSON structured logging

Optional model config

Add a triton_mlx.json alongside config.json for per-model settings:

{
  "display_name": "My Custom Model",
  "max_batch_size": 4,
  "max_tokens": 4096,
  "default_temperature": 0.7
}

API Reference

Chat Completions

POST /v1/chat/completions

Parameter	Type	Default	Description
model	string	required	Model name
messages	array	required	Chat messages
temperature	float	1.0	Sampling temperature (0-2)
top_p	float	1.0	Nucleus sampling (0-1)
max_tokens	int	256	Max tokens to generate
stop	string/array	null	Stop sequences
stream	bool	false	Enable SSE streaming
frequency_penalty	float	0.0	Frequency penalty (-2 to 2)

Text Completions

POST /v1/completions

Same parameters as chat, but uses prompt (string) instead of messages.

Embeddings

POST /v1/embeddings

Parameter	Type	Description
model	string	Model name
input	string/array	Text to embed

Models

GET /v1/models              # List all models
GET /v1/models/{model_name} # Get model details

Health

GET /v1/health/live    # Always 200 if server running
GET /v1/health/ready   # 200 if models loaded, 503 otherwise

Metrics

GET /v1/metrics        # Prometheus format

Architecture

┌─────────────────────────────────────────┐
│           OpenAI SDK / curl              │
└──────────────┬──────────────────────────┘
               │ HTTP
┌──────────────▼──────────────────────────┐
│         FastAPI Application              │
│  /v1/chat/completions                    │
│  /v1/completions                         │
│  /v1/embeddings                          │
│  /v1/models, /v1/health, /v1/metrics     │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│          MLX Engine Adapter              │
│  Chat template • Parameter mapping       │
│  Request routing • Response formatting   │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│        MLX Inference Runtime             │
│  mlx_lm.stream_generate()               │
│  Sampler (temp, top_p, penalties)        │
│  Stop sequence detection                 │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│       Apple Silicon Metal GPU            │
│  Unified memory • Zero-copy inference    │
└─────────────────────────────────────────┘

Development

# Run tests (83 tests, <4s)
.venv/bin/pytest tests/ -q

# Unit + integration only (no model needed)
.venv/bin/pytest tests/unit tests/integration -q

# Functional tests (requires model in examples/model_repository/)
.venv/bin/pytest tests/functional -q

# Lint
.venv/bin/ruff check src/ tests/

# Start in debug mode
triton-mlx --model-repository ./model_repository --log-level DEBUG

Upstream

This project is a port of NVIDIA Triton Inference Server (v2.67.0dev). The upstream C++ server is tracked at upstream/main.

Key differences from upstream:

Python-first — Pure Python server (no C++ compilation required)
MLX backend — Replaces CUDA/TensorRT with Apple MLX Metal
Unified memory — No CPU↔GPU data copies (Apple Silicon advantage)
mlx-lm — Leverages the MLX LLM ecosystem for model loading

License

Apache 2.0 (same as upstream Triton)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.claude		.claude
docs/prd		docs/prd
src/triton_mlx		src/triton_mlx
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton-MLX

Features

Quickstart

Install

Download a model

Start the server

Use it

Model Repository

Supported models

Converting models

Configuration

CLI Options

Optional model config

API Reference

Chat Completions

Text Completions

Embeddings

Models

Health

Metrics

Architecture

Development

Upstream

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Triton-MLX

Features

Quickstart

Install

Download a model

Start the server

Use it

Model Repository

Supported models

Converting models

Configuration

CLI Options

Optional model config

API Reference

Chat Completions

Text Completions

Embeddings

Models

Health

Metrics

Architecture

Development

Upstream

License

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages