examples

Examples

This directory contains example snapshots, recipes, and component bundles for testing and documentation purposes.

Directory Structure

examples/
├── components/       # Generated deployment bundles (from e2e tests)
│   ├── recipe.yaml   # Recipe used to generate bundles
│   └── bundles/      # Bundle output directories by test scenario
├── recipes/          # Optimized configuration recipes  
│   ├── eks-gb200-training.yaml
│   └── eks-h100-training.yaml
└── snapshots/        # System configuration snapshots
    ├── gb200-h100-comp.md
    ├── gb200.yaml
    └── h100.yaml

Snapshots

Example system configuration snapshots captured from GPU clusters:

GB200 System (gb200.yaml)

Snapshot captured from a GB200 NVL72 system. Contents:

Operating system: Ubuntu 24.04
GPU hardware: GB200 with NVLink interconnect
Kubernetes distribution: Amazon EKS 1.33
SystemD services: containerd, kubelet states
Container images: Installed versions in cluster

Usage: Generate recipe for GB200 training workloads

eidos recipe --snapshot examples/snapshots/gb200.yaml --intent training

H100 System (h100.yaml)

Snapshot from an H100 GPU cluster with:

OS configuration (Ubuntu 22.04)
H100 GPU specifications
Kubernetes configuration (GKE 1.32)
GPU Operator ClusterPolicy settings

Use case: Generate recipes optimized for H100 inference workloads

eidos recipe --snapshot examples/snapshots/h100.yaml --intent inference

Recipes

Optimized configuration recipes generated from query parameters:

EKS GB200 Training (eks-gb200-training.yaml)

Recipe for GB200 training workloads on Amazon EKS:

Optimized GPU Operator settings for GB200
NVLink-aware configurations
Training-specific driver parameters

EKS H100 Training (eks-h100-training.yaml)

Recipe for H100 training workloads on Amazon EKS:

H100-optimized configurations
PCIe topology settings
Training workload tuning

Generate recipe from query:

eidos recipe \
  --service eks \
  --accelerator gb200 \
  --os ubuntu \
  --intent training \
  --output my-recipe.yaml

Generate bundle from recipe:

eidos bundle --recipe examples/recipes/eks-gb200-training.yaml --output ./my-bundles

Component Bundles

The components/ directory contains deployment bundles generated by the e2e integration tests (tools/e2e). These demonstrate bundle generation with various CLI flag combinations.

Bundle Test Scenarios

Directory	Description	CLI Flags
`basic/`	Default bundle generation	(none)
`system-selector/`	System node selectors	`--system-node-selector`
`accel-selector/`	Accelerated node selectors	`--accelerated-node-selector`
`system-toleration/`	System node tolerations	`--system-node-toleration`
`accel-toleration/`	Accelerated node tolerations	`--accelerated-node-toleration`
`value-override/`	Custom value overrides	`--set`
`combined/`	All flags combined	All of the above

Generated Components

Each bundle scenario generates these components:

cert-manager – Certificate management
gpu-operator – NVIDIA GPU Operator
nvsentinel – NVSentinel monitoring
skyhook-operator – Node optimization
nvidia-dra-driver-gpu – NVIDIA DRA (Dynamic Resource Allocation) Driver (GB200 only)

Bundle Contents

Each component bundle contains:

values.yaml – Helm chart configuration
checksums.txt – SHA256 checksums for verification
README.md – Deployment instructions
scripts/install.sh – Automated installation script
scripts/uninstall.sh – Cleanup script

Example: Deploy GPU Operator with system node selectors

cd examples/components/bundles/system-selector/gpu-operator
sha256sum -c checksums.txt
chmod +x scripts/install.sh
./scripts/install.sh

Comparisons

GB200 vs H100 Comparison (gb200-h100-comp.md)

Detailed comparison document showing configuration differences between GB200 and H100 systems:

Hardware specifications
Driver and CUDA versions
Network configuration (NVLink vs PCIe)
Memory topology
Recommended settings per GPU type

Use case: Understand platform-specific optimizations

Complete Workflow Example

End-to-end example using the provided files:

# 1. Review example snapshot
cat examples/snapshots/gb200.yaml

# 2. Generate optimized recipe for training
eidos recipe \
  --snapshot examples/snapshots/gb200.yaml \
  --intent training \
  --output my-recipe.yaml

# 3. Compare with provided recipe
diff my-recipe.yaml examples/recipes/eks-gb200-training.yaml

# 4. Generate deployment bundle
eidos bundle \
  --recipe my-recipe.yaml \
  --output ./my-deployment

# 5. Review generated bundle
tree my-deployment/gpu-operator/
cat my-deployment/gpu-operator/README.md

# 6. Verify checksums
cd my-deployment/gpu-operator
sha256sum -c checksums.txt

# 7. Deploy to cluster
./scripts/install.sh

Generate Your Own Examples

Capture Snapshot

From your GPU cluster:

# Capture snapshot to file
eidos snapshot --output my-snapshot.yaml

# Or deploy agent to Kubernetes
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/eidos/main/deployments/eidos-agent/1-deps.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/eidos/main/deployments/eidos-agent/2-job.yaml
kubectl logs -n gpu-operator job/eidos > my-snapshot.yaml

Generate Recipe

From snapshot or query:

# From snapshot
eidos recipe --snapshot my-snapshot.yaml --intent training --output my-recipe.yaml

# From query parameters
eidos recipe \
  --service eks \
  --accelerator gb200 \
  --os ubuntu \
  --osv 24.04 \
  --k8s 1.33 \
  --intent training \
  --output my-recipe.yaml

Create Bundle

From recipe:

# Generate all bundlers
eidos bundle --recipe my-recipe.yaml --output ./bundles

# Generate specific bundler with overrides
eidos bundle \
  --recipe my-recipe.yaml \
  --bundlers gpu-operator \
  --system-node-selector nodeGroup=system-pool \
  --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
  --output ./bundles

Running E2E Tests

The component bundles are regenerated by the e2e integration tests:

# Run e2e tests (regenerates components/bundles/)
./tools/e2e

# Run with custom output directory
./tools/e2e --output ./my-test-output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Examples

Directory Structure

Snapshots

GB200 System (gb200.yaml)

H100 System (h100.yaml)

Recipes

EKS GB200 Training (eks-gb200-training.yaml)

EKS H100 Training (eks-h100-training.yaml)

Component Bundles

Bundle Test Scenarios

Generated Components

Bundle Contents

Comparisons

GB200 vs H100 Comparison (gb200-h100-comp.md)

Complete Workflow Example

Generate Your Own Examples

Capture Snapshot

Generate Recipe

Create Bundle

Running E2E Tests

Name		Name	Last commit message	Last commit date
parent directory ..
bundles		bundles
data		data
recipes		recipes
snapshots		snapshots
README.md		README.md

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

Examples

Directory Structure

Snapshots

GB200 System (gb200.yaml)

H100 System (h100.yaml)

Recipes

EKS GB200 Training (eks-gb200-training.yaml)

EKS H100 Training (eks-h100-training.yaml)

Component Bundles

Bundle Test Scenarios

Generated Components

Bundle Contents

Comparisons

GB200 vs H100 Comparison (gb200-h100-comp.md)

Complete Workflow Example

Generate Your Own Examples

Capture Snapshot

Generate Recipe

Create Bundle

Running E2E Tests