Skip to content

fabric-testbed/self-hosted-llms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Services

This directory contains production-ready LLM inference deployments optimized for NVIDIA DGX Spark and multi-GPU VMs.

Contents

A production-ready multi-model inference gateway that runs multiple vLLM model servers behind an NGINX reverse proxy with HTTPS support.

Features:

  • Multiple models served concurrently (GPT-OSS-20B, GPT-OSS-120B, Qwen-30B)
  • Unified HTTPS endpoint with path-based routing
  • OpenAI-compatible API
  • Health monitoring and load balancing
  • Support for both DGX Spark (UMA) and multi-GPU VMs

Architecture

┌─────────────────────────────────────────────────────────┐
│                  NGINX Gateway (HTTPS)                   │
│  Port 443 → Path-based routing to model servers         │
└─────────────────────────────────────────────────────────┘
                          │
      ┌───────────────────┼───────────────────┐
      │                   │                   │
┌─────▼─────┐   ┌─────────▼────────┐   ┌──────▼──────┐
│ GPT-OSS   │   │  GPT-OSS         │   │  Qwen-30B   │
│ 20B       │   │  120B            │   │  Coder      │
│ (1 GPU)   │   │  (3 GPUs)        │   │  (2 GPUs)   │
└───────────┘   └──────────────────┘   └─────────────┘

Prerequisites

  • Hardware: DGX Spark or multi-GPU VM (minimum 3 GPUs recommended)
  • Software:
    • Docker & Docker Compose v2.0+
    • NVIDIA Container Toolkit
    • CUDA 13.0+
  • Access: HuggingFace account with token for gated models

Getting Started

  1. Navigate to vLLM directory:

    cd llms/vllm/
  2. Follow the Quick Start guide:

  3. Access the gateway:

    curl -k https://localhost/v1/models

Documentation

Support

For issues or questions:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages