Skip to content

RobotFlow-Labs/Triton-mlx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Triton-MLX

NVIDIA Triton Inference Server, ported to Apple Silicon via MLX.

Triton-MLX is a production-grade inference server that runs large language models natively on Apple Silicon using the MLX framework. It provides a fully OpenAI-compatible API, so you can use any OpenAI SDK client to serve models locally on your Mac.

Features

  • OpenAI-compatible API — Drop-in replacement for OpenAI endpoints
    • /v1/chat/completions (streaming + non-streaming)
    • /v1/completions
    • /v1/embeddings
    • /v1/models
  • Native Apple Silicon — MLX Metal GPU acceleration, unified memory (zero-copy)
  • Any HuggingFace model — Load models from mlx-community or convert your own
  • Streaming — Server-Sent Events (SSE) with real-time token delivery
  • Prometheus metrics — Request latency, token throughput, memory usage
  • Health checks/v1/health/live and /v1/health/ready for orchestration
  • Hot-reload — Poll-based model repository watching for new models
  • Concurrency control — Semaphore-based request limiting to prevent OOM
  • API key auth — Optional Bearer token authentication (--api-key or env var)
  • Structured logging — Human-readable or JSON format (--json-logs)
  • Request middleware — Request IDs, logging, OpenAI-compatible error handling
  • Model warm-up — Automatic GPU warm-up on model load for fast first token
  • Prompt caching — LRU cache for tokenized prompts

Quickstart

Install

# Clone the repo
git clone https://github.com/RobotFlow-Labs/triton-mlx.git
cd triton-mlx

# Create environment (requires Python 3.11+ and Apple Silicon Mac)
uv venv .venv --python 3.12
uv pip install -e ".[dev]"

Download a model

# Download a small 4-bit quantized model (~680MB)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    'mlx-community/Llama-3.2-1B-Instruct-4bit',
    local_dir='model_repository/Llama-3.2-1B-Instruct-4bit'
)
"

Start the server

triton-mlx --model-repository ./model_repository

The server starts on http://localhost:9000 by default.

Use it

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9000/v1",
    api_key="not-needed",  # No API key required
)

# Chat completion
response = client.chat.completions.create(
    model="Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    max_tokens=100,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Or use curl:

# Chat completion
curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# List models
curl http://localhost:9000/v1/models

# Health check
curl http://localhost:9000/v1/health/ready

Model Repository

Place HuggingFace model directories in your model repository folder. Each model needs a config.json file:

model_repository/
├── Llama-3.2-1B-Instruct-4bit/
│   ├── config.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── model.safetensors
├── Phi-3-mini-4k-instruct-4bit/
│   ├── config.json
│   ├── tokenizer.json
│   └── model.safetensors

Supported models

Any model supported by mlx-lm works, including:

  • Llama (3.2, 3.1, 3, 2)
  • Mistral / Mixtral
  • Phi (3, 3.5)
  • Qwen (2, 2.5)
  • Gemma (2)
  • Cohere Command R
  • StarCoder / CodeLlama

Converting models

# Convert any HuggingFace model to MLX format
python -m mlx_lm.convert \
  --hf-path meta-llama/Llama-3.2-1B-Instruct \
  --mlx-path model_repository/Llama-3.2-1B-Instruct-4bit \
  -q  # 4-bit quantization

Configuration

CLI Options

triton-mlx --help

Options:
  --model-repository PATH     Model repository directory (required)
  --host HOST                 Bind address (default: 0.0.0.0)
  --openai-port PORT          OpenAI API port (default: 9000)
  --http-port PORT            HTTP API port (default: 8000)
  --log-level LEVEL           DEBUG, INFO, WARNING, ERROR (default: INFO)
  --max-memory-gb GB          Max Metal memory to use
  --default-max-tokens N      Default max tokens (default: 256)
  --chat-template PATH        Custom Jinja2 chat template
  --model-control-mode MODE   none, poll, explicit (default: poll)
  --repository-poll-secs N    Poll interval for new models (default: 15)
  --max-concurrent-requests N Max concurrent inference requests (default: 4)
  --api-key KEY               API key for authentication (or TRITON_MLX_API_KEY env)
  --json-logs                 Enable JSON structured logging

Optional model config

Add a triton_mlx.json alongside config.json for per-model settings:

{
  "display_name": "My Custom Model",
  "max_batch_size": 4,
  "max_tokens": 4096,
  "default_temperature": 0.7
}

API Reference

Chat Completions

POST /v1/chat/completions
Parameter Type Default Description
model string required Model name
messages array required Chat messages
temperature float 1.0 Sampling temperature (0-2)
top_p float 1.0 Nucleus sampling (0-1)
max_tokens int 256 Max tokens to generate
stop string/array null Stop sequences
stream bool false Enable SSE streaming
frequency_penalty float 0.0 Frequency penalty (-2 to 2)

Text Completions

POST /v1/completions

Same parameters as chat, but uses prompt (string) instead of messages.

Embeddings

POST /v1/embeddings
Parameter Type Description
model string Model name
input string/array Text to embed

Models

GET /v1/models              # List all models
GET /v1/models/{model_name} # Get model details

Health

GET /v1/health/live    # Always 200 if server running
GET /v1/health/ready   # 200 if models loaded, 503 otherwise

Metrics

GET /v1/metrics        # Prometheus format

Architecture

┌─────────────────────────────────────────┐
│           OpenAI SDK / curl              │
└──────────────┬──────────────────────────┘
               │ HTTP
┌──────────────▼──────────────────────────┐
│         FastAPI Application              │
│  /v1/chat/completions                    │
│  /v1/completions                         │
│  /v1/embeddings                          │
│  /v1/models, /v1/health, /v1/metrics     │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│          MLX Engine Adapter              │
│  Chat template • Parameter mapping       │
│  Request routing • Response formatting   │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│        MLX Inference Runtime             │
│  mlx_lm.stream_generate()               │
│  Sampler (temp, top_p, penalties)        │
│  Stop sequence detection                 │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│       Apple Silicon Metal GPU            │
│  Unified memory • Zero-copy inference    │
└─────────────────────────────────────────┘

Development

# Run tests (83 tests, <4s)
.venv/bin/pytest tests/ -q

# Unit + integration only (no model needed)
.venv/bin/pytest tests/unit tests/integration -q

# Functional tests (requires model in examples/model_repository/)
.venv/bin/pytest tests/functional -q

# Lint
.venv/bin/ruff check src/ tests/

# Start in debug mode
triton-mlx --model-repository ./model_repository --log-level DEBUG

Upstream

This project is a port of NVIDIA Triton Inference Server (v2.67.0dev). The upstream C++ server is tracked at upstream/main.

Key differences from upstream:

  • Python-first — Pure Python server (no C++ compilation required)
  • MLX backend — Replaces CUDA/TensorRT with Apple MLX Metal
  • Unified memory — No CPU↔GPU data copies (Apple Silicon advantage)
  • mlx-lm — Leverages the MLX LLM ecosystem for model loading

License

Apache 2.0 (same as upstream Triton)

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages