This directory contains example snapshots, recipes, and component bundles for testing and documentation purposes.
examples/
├── components/ # Generated deployment bundles (from e2e tests)
│ ├── recipe.yaml # Recipe used to generate bundles
│ └── bundles/ # Bundle output directories by test scenario
├── recipes/ # Optimized configuration recipes
│ ├── eks-gb200-training.yaml
│ └── eks-h100-training.yaml
└── snapshots/ # System configuration snapshots
├── gb200-h100-comp.md
├── gb200.yaml
└── h100.yaml
Example system configuration snapshots captured from GPU clusters:
GB200 System (gb200.yaml)
Snapshot captured from a GB200 NVL72 system. Contents:
- Operating system: Ubuntu 24.04
- GPU hardware: GB200 with NVLink interconnect
- Kubernetes distribution: Amazon EKS 1.33
- SystemD services: containerd, kubelet states
- Container images: Installed versions in cluster
Usage: Generate recipe for GB200 training workloads
eidos recipe --snapshot examples/snapshots/gb200.yaml --intent trainingH100 System (h100.yaml)
Snapshot from an H100 GPU cluster with:
- OS configuration (Ubuntu 22.04)
- H100 GPU specifications
- Kubernetes configuration (GKE 1.32)
- GPU Operator ClusterPolicy settings
Use case: Generate recipes optimized for H100 inference workloads
eidos recipe --snapshot examples/snapshots/h100.yaml --intent inferenceOptimized configuration recipes generated from query parameters:
EKS GB200 Training (eks-gb200-training.yaml)
Recipe for GB200 training workloads on Amazon EKS:
- Optimized GPU Operator settings for GB200
- NVLink-aware configurations
- Training-specific driver parameters
EKS H100 Training (eks-h100-training.yaml)
Recipe for H100 training workloads on Amazon EKS:
- H100-optimized configurations
- PCIe topology settings
- Training workload tuning
Generate recipe from query:
eidos recipe \
--service eks \
--accelerator gb200 \
--os ubuntu \
--intent training \
--output my-recipe.yamlGenerate bundle from recipe:
eidos bundle --recipe examples/recipes/eks-gb200-training.yaml --output ./my-bundlesThe components/ directory contains deployment bundles generated by the e2e integration tests (tools/e2e). These demonstrate bundle generation with various CLI flag combinations.
| Directory | Description | CLI Flags |
|---|---|---|
basic/ |
Default bundle generation | (none) |
system-selector/ |
System node selectors | --system-node-selector |
accel-selector/ |
Accelerated node selectors | --accelerated-node-selector |
system-toleration/ |
System node tolerations | --system-node-toleration |
accel-toleration/ |
Accelerated node tolerations | --accelerated-node-toleration |
value-override/ |
Custom value overrides | --set |
combined/ |
All flags combined | All of the above |
Each bundle scenario generates these components:
- cert-manager – Certificate management
- gpu-operator – NVIDIA GPU Operator
- nvsentinel – NVSentinel monitoring
- skyhook-operator – Node optimization
- nvidia-dra-driver-gpu – NVIDIA DRA (Dynamic Resource Allocation) Driver (GB200 only)
Each component bundle contains:
values.yaml– Helm chart configurationchecksums.txt– SHA256 checksums for verificationREADME.md– Deployment instructionsscripts/install.sh– Automated installation scriptscripts/uninstall.sh– Cleanup script
Example: Deploy GPU Operator with system node selectors
cd examples/components/bundles/system-selector/gpu-operator
sha256sum -c checksums.txt
chmod +x scripts/install.sh
./scripts/install.shGB200 vs H100 Comparison (gb200-h100-comp.md)
Detailed comparison document showing configuration differences between GB200 and H100 systems:
- Hardware specifications
- Driver and CUDA versions
- Network configuration (NVLink vs PCIe)
- Memory topology
- Recommended settings per GPU type
Use case: Understand platform-specific optimizations
End-to-end example using the provided files:
# 1. Review example snapshot
cat examples/snapshots/gb200.yaml
# 2. Generate optimized recipe for training
eidos recipe \
--snapshot examples/snapshots/gb200.yaml \
--intent training \
--output my-recipe.yaml
# 3. Compare with provided recipe
diff my-recipe.yaml examples/recipes/eks-gb200-training.yaml
# 4. Generate deployment bundle
eidos bundle \
--recipe my-recipe.yaml \
--output ./my-deployment
# 5. Review generated bundle
tree my-deployment/gpu-operator/
cat my-deployment/gpu-operator/README.md
# 6. Verify checksums
cd my-deployment/gpu-operator
sha256sum -c checksums.txt
# 7. Deploy to cluster
./scripts/install.shFrom your GPU cluster:
# Capture snapshot to file
eidos snapshot --output my-snapshot.yaml
# Or deploy agent to Kubernetes
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/eidos/main/deployments/eidos-agent/1-deps.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/eidos/main/deployments/eidos-agent/2-job.yaml
kubectl logs -n gpu-operator job/eidos > my-snapshot.yamlFrom snapshot or query:
# From snapshot
eidos recipe --snapshot my-snapshot.yaml --intent training --output my-recipe.yaml
# From query parameters
eidos recipe \
--service eks \
--accelerator gb200 \
--os ubuntu \
--osv 24.04 \
--k8s 1.33 \
--intent training \
--output my-recipe.yamlFrom recipe:
# Generate all bundlers
eidos bundle --recipe my-recipe.yaml --output ./bundles
# Generate specific bundler with overrides
eidos bundle \
--recipe my-recipe.yaml \
--bundlers gpu-operator \
--system-node-selector nodeGroup=system-pool \
--accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
--output ./bundlesThe component bundles are regenerated by the e2e integration tests:
# Run e2e tests (regenerates components/bundles/)
./tools/e2e
# Run with custom output directory
./tools/e2e --output ./my-test-output