This directory contains production-ready LLM inference deployments optimized for NVIDIA DGX Spark and multi-GPU VMs.
A production-ready multi-model inference gateway that runs multiple vLLM model servers behind an NGINX reverse proxy with HTTPS support.
Features:
- Multiple models served concurrently (GPT-OSS-20B, GPT-OSS-120B, Qwen-30B)
- Unified HTTPS endpoint with path-based routing
- OpenAI-compatible API
- Health monitoring and load balancing
- Support for both DGX Spark (UMA) and multi-GPU VMs
┌─────────────────────────────────────────────────────────┐
│ NGINX Gateway (HTTPS) │
│ Port 443 → Path-based routing to model servers │
└─────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌─────▼─────┐ ┌─────────▼────────┐ ┌──────▼──────┐
│ GPT-OSS │ │ GPT-OSS │ │ Qwen-30B │
│ 20B │ │ 120B │ │ Coder │
│ (1 GPU) │ │ (3 GPUs) │ │ (2 GPUs) │
└───────────┘ └──────────────────┘ └─────────────┘
- Hardware: DGX Spark or multi-GPU VM (minimum 3 GPUs recommended)
- Software:
- Docker & Docker Compose v2.0+
- NVIDIA Container Toolkit
- CUDA 13.0+
- Access: HuggingFace account with token for gated models
-
Navigate to vLLM directory:
cd llms/vllm/ -
Follow the Quick Start guide:
- Standard Deployment - for DGX Spark or single-model setups
- VM GPU Deployment - for multi-model VMs
-
Access the gateway:
curl -k https://localhost/v1/models
- vLLM Multi-Model Gateway Documentation - Complete setup, configuration, and usage guide
- vLLM Official Docs - vLLM framework documentation
- OpenAI API Reference - API compatibility reference
For issues or questions:
- Check the vLLM Troubleshooting Guide
- Review vLLM GitHub Issues
- For DGX Spark specific issues, contact NVIDIA Enterprise Support