A pure Python implementation of Mini-SGLang using Cute-DSL.
prek installuv venv
source .venv/bin/activate
uv pip install modal==1.3.5
modal setupFor the shell and server clients, you can specify the following environment variables:
NNODES: number of nodes (1..4)N_GPU: number of GPUs per node (1..8)GPU_TYPE: GPU typeRDMA: whether to use RDMA (0 or 1)
For multi-node deployment on Hopper and Blackwell chips:
- Your Modal workspace must have RDMA support.
- You must pass
--rdmato the commands below.
Run an interactive shell client:
modal run -i -m llmeng.shellServe an OpenAI-compatible API server:
modal serve llmeng/server.pyRun offline benchmarks:
modal run benchmark/offline/bench.py
modal run benchmark/offline/bench_wildchat.pyRun online benchmarks:
- Deploy the server:
N_GPU=4 GPU_TYPE=h200 modal deploy llmeng/app.py- Run the benchmarks:
modal run benchmark/online/bench_qwen.py
modal run benchmark/online/bench_simple.py- port mini-sglang to Modal
- replace nccl with penny
- rewrite C++/CUDA/Triton in Cute-DSL
- add speculative speculative decoding (SSD)
- mini-sglang
- Cute-DSL, Simon Vietner's blog posts: 1
- Penny, worklogs 1, 2, 3
- SSD, paper
- Tristan Hume's blog post on profiling