Skip to content

andrewhinh/llm-engine

Repository files navigation

llm-engine

A pure Python implementation of Mini-SGLang using Cute-DSL.

icon

Development

Installation

prek install

Setup

uv venv
source .venv/bin/activate
uv pip install modal==1.3.5
modal setup

Commands

For the shell and server clients, you can specify the following environment variables:

  • NNODES: number of nodes (1..4)
  • N_GPU: number of GPUs per node (1..8)
  • GPU_TYPE: GPU type
  • RDMA: whether to use RDMA (0 or 1)

For multi-node deployment on Hopper and Blackwell chips:

  1. Your Modal workspace must have RDMA support.
  2. You must pass --rdma to the commands below.

Run an interactive shell client:

modal run -i -m llmeng.shell

Serve an OpenAI-compatible API server:

modal serve llmeng/server.py

Run offline benchmarks:

modal run benchmark/offline/bench.py
modal run benchmark/offline/bench_wildchat.py

Run online benchmarks:

  1. Deploy the server:
N_GPU=4 GPU_TYPE=h200 modal deploy llmeng/app.py
  1. Run the benchmarks:
modal run benchmark/online/bench_qwen.py
modal run benchmark/online/bench_simple.py

Roadmap

  • port mini-sglang to Modal
  • replace nccl with penny
  • rewrite C++/CUDA/Triton in Cute-DSL
  • add speculative speculative decoding (SSD)

Credit

About

A pure Python high-performance LLM inference engine.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors