Code Quality (Non-Negotiable):
- Tests must pass with race detector (
make test) - Never disable tests to make CI green (including "temporary" skips)
- Use structured errors from
pkg/errorswith error codes (neverfmt.Errorf) - Context with timeouts for all I/O operations (prevent resource leaks)
- Check
ctx.Done()in loops and long-running operations - Stop after 3 failed attempts at same fix → reassess approach
Development Process:
- Plan non-trivial work in stages (use
IMPLEMENTATION_PLAN.mdfor complex tasks) - Follow red → green → refactor cycle
- Commit incrementally with "why" explanations
- Learn existing patterns before inventing new ones
- Verify assumptions with code/tests, never assume
Go Code Requirements:
- Handle context cancellation explicitly
- Define timeouts at API boundaries (collectors: 10s, handlers: 30s)
- Wrap errors with
pkg/errors:errors.Wrap(errors.ErrCodeInternal, "operation failed", err) - Use table-driven tests for multiple scenarios
- Run with
-raceflag enabled
Decision Framework: Choose solutions based on: testability, readability, consistency, simplicity, reversibility
NVIDIA Eidos (Eidos) provides validated GPU-accelerated Kubernetes configurations through a four-stage workflow:
- Snapshot → Capture system state (OS, kernel, K8s, GPU)
- Recipe → Generate optimized config from captured data or query parameters
- Validate → Check recipe constraints against actual cluster measurements
- Bundle → Create deployment artifacts (Helm values, manifests, scripts)
Core Components:
- CLI (
eidos): All four stages (snapshot/recipe/validate/bundle) - API Server: Recipe generation and bundle creation via REST API
- Agent: Kubernetes Job for automated cluster snapshots → ConfigMaps
- Bundlers: Plugin-based artifact generators (GPU Operator, Network Operator, Cert-Manager, NVSentinel, Skyhook, DRA Driver)
- Deployers: GitOps integration providers (helm, argocd) with deployment ordering
Tech Stack: Go 1.25, Kubernetes 1.33+, golangci-lint v2.6, Container images via Ko
Package Architecture (Critical Principle):
- User Interaction Packages (
pkg/cli,pkg/api): Focus solely on capturing user intent, validating input, and formatting output. No business logic. - Functional Packages (
pkg/oci,pkg/bundler,pkg/recipe,pkg/collector): Self-contained, reusable business logic. Should be usable independently without CLI/API. - Example: OCI packaging logic lives in
pkg/oci(notpkg/cli), so both CLI and API can use it. Deployers return structuredDeploymentInfoso the CLI just formats output.
Quick Start:
make qualify # Run tests + lint + scan (full check)
make build # Build binaries
make server # Start API server locally-
Add collector in
pkg/collector/gpu/:- Implement
Collectorinterface - Add factory method in
factory.go - Write table-driven tests with mocks
- Implement
-
Update recipe data in
pkg/recipe/data/:- Add base measurements in
overlays/base.yaml - Create overlay files in
overlays/with criteria matching
- Add base measurements in
-
Test workflow:
eidos snapshot --output snapshot.yaml eidos recipe --snapshot snapshot.yaml --intent training
→ See Extended Reference: Adding a New Collector for full example
For Helm Components:
-
Add to component registry (
pkg/recipe/data/registry.yaml):- name: my-operator displayName: My Operator valueOverrideKeys: [myoperator] helm: defaultRepository: https://charts.example.com defaultChart: example/my-operator
-
Create values file (
pkg/recipe/data/components/my-operator/values.yaml):- Define Helm chart values
- Keep configuration minimal and reusable
-
Reference in recipe (
pkg/recipe/data/overlays/*.yaml):componentRefs: - name: my-operator type: Helm version: v1.0.0 valuesFile: components/my-operator/values.yaml
For Kustomize Components:
-
Add to component registry (
pkg/recipe/data/registry.yaml):- name: my-kustomize-app displayName: My Kustomize App valueOverrideKeys: [mykustomize] kustomize: defaultSource: https://github.com/example/my-app defaultPath: deploy/production defaultTag: v1.0.0
-
Reference in recipe (
pkg/recipe/data/overlays/*.yaml):componentRefs: - name: my-kustomize-app type: Kustomize tag: v1.0.0
Note: A component must have either helm OR kustomize configuration, not both.
- Run tests:
go test -v ./pkg/recipe/... -run TestComponentRegistry make test
→ See Bundler Development Guide for full details
-
Create handler in
pkg/api/:func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) { ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second) defer cancel() // Handler logic }
-
Register route in
pkg/api/server.go -
Add middleware (metrics → version → requestID → panic → rateLimit → logging)
-
Update API spec in
api/eidos/v1/server.yaml -
Write integration tests
→ See Extended Reference: Adding a New API Endpoint for detailed steps
-
Check error messages → Use proper assertions:
if err != nil { t.Fatalf("unexpected error: %v", err) // Stop test immediately }
-
Race conditions → Run
go test -race ./... -
Linting issues →
make lintthen fix reported problems -
Context → Ensure collectors/handlers respect
ctx.Done()
→ See Troubleshooting & Support for common issues and debugging tips
1. Functional Options (Configuration)
builder := recipe.NewBuilder(
recipe.WithVersion(version),
)
server := server.New(
server.WithName("eidosd"),
server.WithVersion(version),
)2. Factory Pattern (Collectors)
factory := collector.NewDefaultFactory(
collector.WithSystemDServices([]string{"containerd.service"}),
)
gpuCollector := factory.CreateGPUCollector()3. Builder Pattern (Measurements)
measurement.NewMeasurement(measurement.TypeK8s).
WithSubtype(subtype).
Build()4. Singleton Pattern (K8s Client)
import "github.com/NVIDIA/eidos/pkg/k8s/client"
clientset, config, err := client.GetKubeClient() // Uses sync.OnceAlways use structured errors from pkg/errors:
import "github.com/NVIDIA/eidos/pkg/errors"
// Simple error
return errors.New(errors.ErrCodeNotFound, "GPU not found")
// Wrap existing error
return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)
// With context for debugging
return errors.WrapWithContext(
errors.ErrCodeTimeout,
"operation timed out",
ctx.Err(),
map[string]interface{}{
"component": "gpu-collector",
"timeout": "10s",
},
)Error Codes: ErrCodeNotFound, ErrCodeUnauthorized, ErrCodeTimeout, ErrCodeInternal, ErrCodeInvalidRequest, ErrCodeUnavailable
Always use context with timeouts:
// In collectors
func (c *Collector) Collect(ctx context.Context) (*measurement.Measurement, error) {
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
// Collection logic
}
// In HTTP handlers
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
// Handler logic
}Use errgroup for parallel operations:
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error {
return collector1.Collect(ctx)
})
g.Go(func() error {
return collector2.Collect(ctx)
})
if err := g.Wait(); err != nil {
return fmt.Errorf("collection failed: %w", err)
}Use structured logging with appropriate levels:
slog.Debug("request started",
"requestID", requestID,
"method", r.Method,
"path", r.URL.Path,
)
slog.Error("operation failed",
"error", err,
"component", "gpu-collector",
"node", nodeName,
)Log Levels:
Debug– Detailed diagnostics (enabled with--debug)Info– General informational messages (default)Warn– Warning conditionsError– Error conditions requiring attention
# Full qualification (run before PR)
make qualify # test + lint + scan
# Individual checks
make test # Unit tests with race detector
make lint # golangci-lint + yamllint
make scan # Trivy vulnerability scan
# Build and run
make build # Build for current platform
make server # Start API server (debug mode)
make tidy # Format code + update deps- Coverage: Aim for >70% meaningful coverage (current: ~60%)
- Race Detector: Always enabled (
make testruns with-race) - Table-Driven Tests: Required for multiple test cases
- Mocks: Use fake clients for external dependencies (K8s client-go fakes)
- Error Cases: Test error conditions and edge cases explicitly
Example test structure:
func TestFunction(t *testing.T) {
tests := []struct {
name string
input string
expected string
wantErr bool
}{
{"valid", "test", "test", false},
{"empty", "", "", true},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result, err := Function(tt.input)
if (err != nil) != tt.wantErr {
t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
}
if result != tt.expected {
t.Errorf("got %v, want %v", result, tt.expected)
}
})
}
}- K8s Connection: Check kubeconfig at
~/.kube/configorKUBECONFIGenv - GPU Detection: Requires nvidia-smi in PATH
- Linter Errors: Use
errors.Is()for error comparison, addreturnaftert.Fatal() - Race Conditions: Run tests with
-raceflag to detect - Build Failures: Run
make tidyto fix Go module issues
# Enable debug logging
eidos --debug snapshot
# Run server with debug logs
LOG_LEVEL=debug make server
# Check race conditions
go test -race -run TestSpecificTest ./pkg/...
# Profile performance
go test -cpuprofile cpu.prof -memprofile mem.prof
go tool pprof cpu.profAct as a Principal Distributed Systems Architect with deep expertise in Go and cloud-native architectures. Focus on correctness, resiliency, and operational simplicity. All code must be production-grade, not illustrative pseudo-code.
Core Competencies:
| Domain | Expertise |
|---|---|
| Go (Golang) | Idiomatic code, concurrency (errgroup, context), memory patterns, low-latency networking |
| Distributed Systems | CAP trade-offs, consensus (Raft, Paxos), failure modes, consistency models, Sagas, CRDTs |
| Operations & Runtime | Kubernetes operators/controllers, service meshes, OpenTelemetry, Prometheus |
| Operational Concerns | Upgrades, drift, multi-tenancy, blast radius |
Operational Principles:
- Resilience by Design: Partial failure is the steady state. Design for partitions, timeouts, bounded retries, circuit breakers, backpressure.
- Boring First: Default to proven, simple technologies. Introduce complexity only to address concrete limitations.
- Observability Is Mandatory: Structured logging, metrics, tracing are part of the API contract.
Foundational Principles:
- Local Development Equals CI: Same tools, same versions locally and in CI.
.versions.yamlis single source of truth. - Correctness Must Be Reproducible: Same inputs → same outputs, always. No hidden state or non-deterministic behavior.
- Metadata Is Separate from Consumption: Recipes define what is correct; bundlers/deployers determine how to deliver it.
- Recipe Specialization Requires Explicit Intent: Generic intent never silently resolves to specialized configurations.
- Trust Requires Verifiable Provenance: Every artifact carries verifiable proof of origin (SLSA, SBOM, Sigstore).
- Adoption Comes from Idiomatic Experience: Output standard formats, integrate into existing workflows.
Precision over Generalities:
- Avoid vague guidance; replace "ensure security" with concrete mechanisms
- Example: "enforce mTLS using SPIFFE identities with workload attestation"
Evidence & References:
- Ground recommendations in verifiable sources (Go spec, k8s.io docs, CNCF projects, industry papers)
- If evidence is uncertain or context-dependent, state that explicitly
Trade-off Analysis:
- Always present at least one viable alternative
- Explain why the recommended approach fits the stated constraints
Architecture Communication:
- Use Mermaid diagrams (sequence, flow, component) only when they materially improve clarity
Interaction Protocol:
- If critical inputs are missing (QPS, SLOs, consistency requirements, read/write ratios, failure domains), ask targeted clarifying questions before proposing a design
Code Quality Requirements (Go):
- Handle context cancellation explicitly
- Define timeouts at API boundaries
- Wrap errors with actionable context
- Use table-driven tests
When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience.
Goals:
- Improve technical clarity without changing intent
- Ensure suitability for diverse, global open-source audience
- Align with CNCF / Linux Foundation conventions
Standards:
-
Accuracy & Scope
- Don't invent features, guarantees, timelines, or roadmap commitments
- Clearly distinguish current behavior from future intent
- Remove speculative or marketing language
-
Tone & Style
- Use neutral, factual, engineering-oriented language
- Avoid hype ("best", "powerful", "game-changing")
- Prefer short, declarative sentences
-
Structure & Readability
- Organize with clear sections and logical flow
- Use headings that answer user questions
- Convert dense paragraphs into lists or tables
- Ensure examples are minimal, relevant, clearly labeled
-
Audience Awareness
- Assume engineers but not necessarily project experts
- Define acronyms on first use
- Clearly state prerequisites and assumptions
-
Operational Clarity
- Document configuration boundaries
- Document failure modes or limitations
- Document upgrade/compatibility considerations
- Prefer "what happens" over "what should happen"
# First-time setup
git clone https://github.com/NVIDIA/eidos.git && cd eidos
make tools-setup # Install all required tools (interactive)
make tools-check # Verify versions match .versions.yaml
# Auto-mode for CI/scripts
AUTO_MODE=true make tools-setupTool versions are centralized in .versions.yaml (single source of truth). Both local development and CI use this file to ensure consistency.
# Tools Management
make tools-setup # Install all required development tools
make tools-check # Check installed tools and compare versions
# Development
make tidy # Format code + update deps
make build # Build binaries
make server # Start API server locally
make test # Run tests with coverage
make lint # Lint Go and YAML
make scan # Security scanning
make qualify # Full check (test + lint + scan)
# Workflow
eidos snapshot --output snapshot.yaml
eidos recipe --snapshot snapshot.yaml --intent training
eidos bundle --recipe recipe.yaml --output ./bundles
# Override bundle values at generation time
eidos bundle -r recipe.yaml -b gpu-operator \
--set gpuoperator:gds.enabled=true \
--set gpuoperator:driver.version=570.86.16 \
-o ./bundles
# Node scheduling with selectors and tolerations
eidos bundle -r recipe.yaml -b gpu-operator \
--system-node-selector nodeGroup=system-pool \
--system-node-toleration dedicated=system:NoSchedule \
--accelerated-node-selector nvidia.com/gpu.present=true \
--accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
-o ./bundles
# GitOps deployment with ArgoCD (sync-wave ordering)
eidos bundle -r recipe.yaml -b gpu-operator,network-operator \
--deployer argocd \
--repo https://github.com/my-org/my-gitops-repo.git \
-o ./bundles- Kubernetes: Singleton client via
pkg/k8s/client.GetKubeClient() - NVIDIA Operators: GPU Operator, Network Operator, NIM Operator, Nsight Operator
- Container Images: ghcr.io/nvidia/eidos, ghcr.io/nvidia/eidosd
- Observability: Prometheus metrics at
/metrics, structured JSON logs to stderr
- Contributing Guide – Design principles, development setup, PR process
- Release Process – Maintainer guide for releases, verification, hotfixes
- Architecture Overview – System design
- Bundler Development – Create new bundlers
- API Reference – REST API endpoints
- GitHub Actions README – CI/CD architecture
- API Specification – OpenAPI spec
- .versions.yaml – Tool versions (single source of truth)
Check current versions dynamically:
make tools-check # Compare installed vs expected versions (from .versions.yaml)
make info # Show project build info
cat .versions.yaml # All tool versions (single source of truth)If adding a new system collector (like the OS release collector added in v0.7.0):
1. Create the collector in pkg/collector/os/:
// pkg/collector/os/release.go
package os
import (
"bufio"
"context"
"fmt"
"os"
"strings"
)
// collectRelease reads and parses /etc/os-release
func (c *Collector) collectRelease(ctx context.Context) (*measurement.Subtype, error) {
data := make(map[string]measurement.Reading)
file, err := os.Open("/etc/os-release")
if err != nil {
return nil, fmt.Errorf("failed to open /etc/os-release: %w", err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if strings.TrimSpace(line) == "" || strings.HasPrefix(line, "#") {
continue
}
parts := strings.SplitN(line, "=", 2)
if len(parts) != 2 {
continue
}
key := parts[0]
value := strings.Trim(parts[1], `"`)
data[key] = measurement.Reading{Value: value}
}
if err := scanner.Err(); err != nil {
return nil, fmt.Errorf("error reading /etc/os-release: %w", err)
}
return &measurement.Subtype{
Name: "release",
Data: data,
}, nil
}2. Update the main collector:
// pkg/collector/os/os.go
func (c *Collector) Collect(ctx context.Context) ([]*measurement.Measurement, error) {
// Collect all OS subtypes in parallel
grubSubtype, _ := c.collectGrub(ctx)
sysctlSubtype, _ := c.collectSysctl(ctx)
kmodSubtype, _ := c.collectKmod(ctx)
releaseSubtype, _ := c.collectRelease(ctx) // New subtype
return []*measurement.Measurement{{
Type: measurement.TypeOS,
Subtypes: []*measurement.Subtype{
grubSubtype,
sysctlSubtype,
kmodSubtype,
releaseSubtype, // Add to list
},
}}, nil
}3. Add tests:
// pkg/collector/os/release_test.go
func TestCollectRelease(t *testing.T) {
c := NewCollector()
ctx := context.Background()
subtype, err := c.collectRelease(ctx)
if err != nil {
t.Fatalf("collectRelease() error = %v", err)
}
// Verify expected fields exist
expectedFields := []string{"ID", "VERSION_ID", "PRETTY_NAME"}
for _, field := range expectedFields {
if _, exists := subtype.Data[field]; !exists {
t.Errorf("expected field %q not found", field)
}
}
// Verify subtype name
if subtype.Name != "release" {
t.Errorf("expected subtype name 'release', got %q", subtype.Name)
}
}4. Update integration tests:
// pkg/collector/os/os_test.go
func TestOSCollector(t *testing.T) {
measurements, err := c.Collect(ctx)
if err != nil {
t.Fatalf("Collect() error = %v", err)
}
// Should return 4 subtypes: grub, sysctl, kmod, release
if len(measurements[0].Subtypes) != 4 {
t.Errorf("expected 4 subtypes, got %d", len(measurements[0].Subtypes))
}
}The bundler framework uses BaseBundler - a helper that reduces boilerplate by ~75% (from ~400 lines to ~100 lines). Instead of implementing the full Bundler interface from scratch, embed BaseBundler and override only what you need.
1. Create bundler package in pkg/bundler/<bundler-name>/:
package networkoperator
import (
"context"
"embed"
"fmt"
"path/filepath"
"github.com/NVIDIA/eidos/pkg/bundler"
"github.com/NVIDIA/eidos/pkg/bundler/result"
)
//go:embed templates/*.tmpl
var templatesFS embed.FS
const (
bundlerType = bundler.BundleType("network-operator")
Name = "network-operator" // Use constant for component name
)
func init() {
// Self-register using MustRegister (panics on duplicates)
bundler.MustRegister(bundlerType, NewBundler())
}
// Bundler generates Network Operator deployment bundles.
type Bundler struct {
*bundler.BaseBundler // Embed helper for common functionality
}
// NewBundler creates a new Network Operator bundler instance.
func NewBundler() *Bundler {
return &Bundler{
BaseBundler: bundler.NewBaseBundler(bundlerType, templatesFS),
}
}
// Make generates the bundle from RecipeResult.
func (b *Bundler) Make(ctx context.Context, input *result.RecipeResult,
outputDir string) (*bundler.Result, error) {
// 1. Get component reference from RecipeResult
component := input.GetComponentRef(Name)
if component == nil {
return nil, fmt.Errorf(Name + " component not found in recipe result")
}
// 2. Get values map (with overrides already applied)
values := input.GetValuesForComponent(Name)
// 3. Create bundle directory structure
if err := b.CreateBundleDir(outputDir); err != nil {
return nil, err
}
// 4. Generate bundle metadata
metadata := generateBundleMetadata(component, values)
// 5. Generate files from templates
filePath := filepath.Join(outputDir, "values.yaml")
if err := b.GenerateFileFromTemplate(ctx, GetTemplate, "values.yaml",
filePath, values, 0644); err != nil {
return nil, err
}
var generatedFiles []string
// ... collect file paths
// 6. Generate checksums and return result
return b.GenerateResult(outputDir, generatedFiles)
}2. Create templates directory (only for bundlers with custom manifests):
Most bundlers don't need a templates directory - they only generate values.yaml and checksums.txt. Templates are only needed for custom Kubernetes manifests:
pkg/component/gpuoperator/templates/
├── dcgm-exporter.yaml.tmpl # Custom ConfigMap manifest
└── kernel-module-params.yaml.tmpl # Custom ConfigMap manifest
Example custom manifest template (dcgm-exporter.yaml.tmpl):
# DCGM Exporter ConfigMap
# Generated by Eidos
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-exporter-config
namespace: {{ .Script.Namespace }}
data:
config.yaml: |
metrics:
- name: DCGM_FI_DEV_GPU_TEMP
type: gaugeNote: Values are written directly to values.yaml using internal.MarshalYAMLWithHeader(), not via templates. Templates are only used for custom Kubernetes manifests.
3. Write tests with TestHarness:
package networkoperator
import (
"testing"
"github.com/NVIDIA/eidos/pkg/bundler/result"
"github.com/NVIDIA/eidos/pkg/component/internal"
)
func TestBundler_Make(t *testing.T) {
harness := internal.NewTestHarness(t, NewBundler())
tests := []struct {
name string
input *result.RecipeResult
wantErr bool
verify func(t *testing.T, outputDir string)
}{
{
name: "valid component reference",
input: createTestRecipeResult(),
wantErr: false,
verify: func(t *testing.T, outputDir string) {
harness.AssertFileContains(outputDir, "values.yaml",
"version:", "namespace:")
},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := harness.RunTest(tt.input, tt.wantErr)
if !tt.wantErr && tt.verify != nil {
tt.verify(t, result.OutputDir)
}
})
}
}Key Components:
- BaseBundler provides:
CreateBundleDir,WriteFile,GenerateFileFromTemplate,GenerateResult,Validate - RecipeResult methods:
GetComponentRef(name),GetValuesForComponent(name) - BundleMetadata struct: Contains metadata like
Namespace,Version,HelmRepository - TestHarness:
NewTestHarness,RunTest,AssertFileContains
Best Practices:
- Embed
BaseBundlerinstead of implementing from scratch - Use
Nameconstant for component name (not hardcoded strings) - Get values via
input.GetValuesForComponent(Name)(already has overrides applied) - Pass values map to templates, use
indexfunction for access - Self-register with
MustRegister()for fail-fast behavior - Keep bundlers stateless for thread-safe operation
- Use
TestHarnessfor consistent test structure
1. Create handler in pkg/api/:
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
// Parse query parameters
query, err := recipe.NewQuery(
recipe.WithOS(r.URL.Query().Get("os")),
recipe.WithGPU(r.URL.Query().Get("gpu")),
)
if err != nil {
serializer.WriteError(w, r, err, http.StatusBadRequest)
return
}
// Generate recipe
recipe, err := h.builder.Build(ctx, query)
if err != nil {
serializer.WriteError(w, r, err, http.StatusInternalServerError)
return
}
// Serialize response
serializer.WriteJSON(w, recipe, http.StatusOK)
}2. Register route in pkg/api/server.go:
mux.Handle("/v1/recipe", handler)3. Add middleware (order matters):
// Order: metrics → version → requestID → panic → rateLimit → logging → handler
handler = metricsMiddleware(handler)
handler = versionMiddleware(handler, version)
handler = requestIDMiddleware(handler)
handler = panicMiddleware(handler)
handler = rateLimitMiddleware(handler, limiter)
handler = loggingMiddleware(handler)4. Update API spec in api/eidos/v1/server.yaml:
paths:
/v1/recipe:
get:
summary: Get optimized system configuration recipe
parameters:
- name: os
in: query
required: false
schema:
type: string
enum: [ubuntu, cos, any]5. Write integration tests:
func TestRecipeHandler(t *testing.T) {
server := httptest.NewServer(handler)
defer server.Close()
resp, err := http.Get(server.URL + "/v1/recipe?os=ubuntu&gpu=gb200")
if err != nil {
t.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
t.Errorf("expected 200, got %d", resp.StatusCode)
}
}Eidos uses a three-layer composite actions architecture for reusability:
Layer 1: Primitives (Single-Purpose Building Blocks)
ghcr-login– GHCR authenticationsetup-build-tools– Modular tool installer (ko, syft, crane, goreleaser)security-scan– Trivy vulnerability scanning
Layer 2: Composed Actions (Combine Primitives)
go-ci– Complete Go CI pipeline (setup → test → lint)go-build-release– Full build/release pipelineattest-image-from-tag– Resolve digest + generate attestationscloud-run-deploy– GCP deployment with Workload Identity
Layer 3: Workflows (Orchestrate Actions)
on-push.yaml– CI validation for PRs and main branchon-tag.yaml– Release, attestation, and deployment
Key Workflows:
on-push.yaml (CI validation):
jobs:
validate:
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/go-ci
with:
go_version: '1.25'
golangci_lint_version: 'v2.6'
upload_codecov: 'true'
- uses: ./.github/actions/security-scanon-tag.yaml (Release pipeline):
jobs:
release:
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/go-ci
- id: release
uses: ./.github/actions/go-build-release
- uses: ./.github/actions/attest-image-from-tag
with:
image_name: 'ghcr.io/nvidia/eidos'
image_tag: ${{ github.ref_name }}
- if: steps.release.outputs.release_outcome == 'success'
uses: ./.github/actions/cloud-run-deploySupply Chain Security:
- SLSA Build Level 3: GitHub OIDC attestations
- SBOMs: SPDX format via Syft (containers) and GoReleaser (binaries)
- Signing: Cosign keyless signing (Fulcio + Rekor)
- Verification:
gh attestation verify oci://ghcr.io/nvidia/eidos:${TAG}
For detailed GitHub Actions architecture, see actions/README.md
Complete End-to-End: Snapshot → Recipe → Bundle
# 1. Capture system configuration
eidos snapshot --output snapshot.yaml
# 2. Generate optimized recipe for training workloads
eidos recipe \
--snapshot snapshot.yaml \
--intent training \
--format yaml \
--output recipe.yaml
# 3. Create deployment bundle
eidos bundle \
--recipe recipe.yaml \
--bundlers gpu-operator \
--output ./bundles
# 4. Deploy to cluster
cd bundles/gpu-operator
sha256sum -c checksums.txt # Verify integrity
chmod +x scripts/install.sh
./scripts/install.shConfigMap-based Workflow (for Kubernetes Jobs):
# 1. Capture snapshot directly to ConfigMap
eidos snapshot -o cm://gpu-operator/eidos-snapshot
# 2. Generate recipe from ConfigMap snapshot
eidos recipe -s cm://gpu-operator/eidos-snapshot \
--intent training \
-o cm://gpu-operator/eidos-recipe
# 3. Create bundle from ConfigMap recipe
eidos bundle -r cm://gpu-operator/eidos-recipe \
-b gpu-operator \
-o ./bundles
# 4. Verify ConfigMap data
kubectl get configmap eidos-snapshot -n gpu-operator -o yaml
kubectl get configmap eidos-recipe -n gpu-operator -o yamlE2E Testing with Agent:
# Run full E2E test (snapshot → recipe → bundle)
./tools/e2e -s examples/snapshots/gb200.yaml \
-r examples/recipes/gb200-eks-ubuntu-training.yaml \
-b examples/bundles/gb200-eks-ubuntu-training
# The script:
# 1. Deploys agent Job to cluster
# 2. Waits for snapshot to be written to ConfigMap
# 3. Optionally saves snapshot to file
# 4. Optionally generates recipe using cm://gpu-operator/eidos-snapshot
# 5. Optionally generates bundle from recipe
# 6. Validates each step completes successfullyAgent Deployment Pattern:
# Deploy agent for automated snapshots
kubectl apply -f deployments/eidos-agent/1-deps.yaml
kubectl apply -f deployments/eidos-agent/2-job.yaml
# Check logs
kubectl logs -n gpu-operator job/eidos
# Get snapshot from ConfigMap
kubectl get configmap eidos-snapshot -n gpu-operator \
-o jsonpath='{.data.snapshot\.yaml}' > snapshot.yaml