Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Examples

This directory contains example snapshots, recipes, and component bundles for testing and documentation purposes.

Directory Structure

examples/
├── components/       # Generated deployment bundles (from e2e tests)
│   ├── recipe.yaml   # Recipe used to generate bundles
│   └── bundles/      # Bundle output directories by test scenario
├── recipes/          # Optimized configuration recipes  
│   ├── eks-gb200-training.yaml
│   └── eks-h100-training.yaml
└── snapshots/        # System configuration snapshots
    ├── gb200-h100-comp.md
    ├── gb200.yaml
    └── h100.yaml

Snapshots

Example system configuration snapshots captured from GPU clusters:

GB200 System (gb200.yaml)

Snapshot captured from a GB200 NVL72 system. Contents:

  • Operating system: Ubuntu 24.04
  • GPU hardware: GB200 with NVLink interconnect
  • Kubernetes distribution: Amazon EKS 1.33
  • SystemD services: containerd, kubelet states
  • Container images: Installed versions in cluster

Usage: Generate recipe for GB200 training workloads

eidos recipe --snapshot examples/snapshots/gb200.yaml --intent training

H100 System (h100.yaml)

Snapshot from an H100 GPU cluster with:

  • OS configuration (Ubuntu 22.04)
  • H100 GPU specifications
  • Kubernetes configuration (GKE 1.32)
  • GPU Operator ClusterPolicy settings

Use case: Generate recipes optimized for H100 inference workloads

eidos recipe --snapshot examples/snapshots/h100.yaml --intent inference

Recipes

Optimized configuration recipes generated from query parameters:

EKS GB200 Training (eks-gb200-training.yaml)

Recipe for GB200 training workloads on Amazon EKS:

  • Optimized GPU Operator settings for GB200
  • NVLink-aware configurations
  • Training-specific driver parameters

EKS H100 Training (eks-h100-training.yaml)

Recipe for H100 training workloads on Amazon EKS:

  • H100-optimized configurations
  • PCIe topology settings
  • Training workload tuning

Generate recipe from query:

eidos recipe \
  --service eks \
  --accelerator gb200 \
  --os ubuntu \
  --intent training \
  --output my-recipe.yaml

Generate bundle from recipe:

eidos bundle --recipe examples/recipes/eks-gb200-training.yaml --output ./my-bundles

Component Bundles

The components/ directory contains deployment bundles generated by the e2e integration tests (tools/e2e). These demonstrate bundle generation with various CLI flag combinations.

Bundle Test Scenarios

Directory Description CLI Flags
basic/ Default bundle generation (none)
system-selector/ System node selectors --system-node-selector
accel-selector/ Accelerated node selectors --accelerated-node-selector
system-toleration/ System node tolerations --system-node-toleration
accel-toleration/ Accelerated node tolerations --accelerated-node-toleration
value-override/ Custom value overrides --set
combined/ All flags combined All of the above

Generated Components

Each bundle scenario generates these components:

  • cert-manager – Certificate management
  • gpu-operator – NVIDIA GPU Operator
  • nvsentinel – NVSentinel monitoring
  • skyhook-operator – Node optimization
  • nvidia-dra-driver-gpu – NVIDIA DRA (Dynamic Resource Allocation) Driver (GB200 only)

Bundle Contents

Each component bundle contains:

  • values.yaml – Helm chart configuration
  • checksums.txt – SHA256 checksums for verification
  • README.md – Deployment instructions
  • scripts/install.sh – Automated installation script
  • scripts/uninstall.sh – Cleanup script

Example: Deploy GPU Operator with system node selectors

cd examples/components/bundles/system-selector/gpu-operator
sha256sum -c checksums.txt
chmod +x scripts/install.sh
./scripts/install.sh

Comparisons

GB200 vs H100 Comparison (gb200-h100-comp.md)

Detailed comparison document showing configuration differences between GB200 and H100 systems:

  • Hardware specifications
  • Driver and CUDA versions
  • Network configuration (NVLink vs PCIe)
  • Memory topology
  • Recommended settings per GPU type

Use case: Understand platform-specific optimizations

Complete Workflow Example

End-to-end example using the provided files:

# 1. Review example snapshot
cat examples/snapshots/gb200.yaml

# 2. Generate optimized recipe for training
eidos recipe \
  --snapshot examples/snapshots/gb200.yaml \
  --intent training \
  --output my-recipe.yaml

# 3. Compare with provided recipe
diff my-recipe.yaml examples/recipes/eks-gb200-training.yaml

# 4. Generate deployment bundle
eidos bundle \
  --recipe my-recipe.yaml \
  --output ./my-deployment

# 5. Review generated bundle
tree my-deployment/gpu-operator/
cat my-deployment/gpu-operator/README.md

# 6. Verify checksums
cd my-deployment/gpu-operator
sha256sum -c checksums.txt

# 7. Deploy to cluster
./scripts/install.sh

Generate Your Own Examples

Capture Snapshot

From your GPU cluster:

# Capture snapshot to file
eidos snapshot --output my-snapshot.yaml

# Or deploy agent to Kubernetes
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/eidos/main/deployments/eidos-agent/1-deps.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/eidos/main/deployments/eidos-agent/2-job.yaml
kubectl logs -n gpu-operator job/eidos > my-snapshot.yaml

Generate Recipe

From snapshot or query:

# From snapshot
eidos recipe --snapshot my-snapshot.yaml --intent training --output my-recipe.yaml

# From query parameters
eidos recipe \
  --service eks \
  --accelerator gb200 \
  --os ubuntu \
  --osv 24.04 \
  --k8s 1.33 \
  --intent training \
  --output my-recipe.yaml

Create Bundle

From recipe:

# Generate all bundlers
eidos bundle --recipe my-recipe.yaml --output ./bundles

# Generate specific bundler with overrides
eidos bundle \
  --recipe my-recipe.yaml \
  --bundlers gpu-operator \
  --system-node-selector nodeGroup=system-pool \
  --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
  --output ./bundles

Running E2E Tests

The component bundles are regenerated by the e2e integration tests:

# Run e2e tests (regenerates components/bundles/)
./tools/e2e

# Run with custom output directory
./tools/e2e --output ./my-test-output