SKILL.md

HPC Benchmark Suite — Agent Guide

This doc tells an AI agent how to work with the HPC Bench Suite: how to run it, what context to assume, how to make changes, and how to debug. For version history and changelog, see CHANGELOG.md.

What this suite is

Purpose: Benchmark and inventory HPC servers (GPU, CPU, network, storage, topology). Produces JSON per module and a markdown report.
Target: Bare-metal HPC. VMs are supported for convenience; many modules skip or degrade (DCGM, nvbandwidth, HPL-MxP, InfiniBand, BMC). See README "Target environment" and CHANGELOG for VM-related behavior.
Language: Pure Bash. Designed for minimal dependencies and portability across HPC environments.

Codebase structure

bench/
├── conf/
│   ├── defaults.sh          # Central tunables (timeouts, thresholds, container images)
│   └── local.sh             # Optional user overrides (gitignored)
├── docs/
│   └── DEVELOPMENT.md       # Detailed dev conventions, JSON contract, config guide
├── lib/
│   ├── common.sh            # Core shared library (logging, JSON, GPU, cleanup, etc.)
│   └── report-common.sh     # Report scoring logic, module status helpers
├── scripts/
│   ├── bootstrap.sh          # Phase 0: hardware detection, tool install
│   ├── runtime-sanity.sh     # Phase 1: pre-flight checks
│   ├── inventory.sh          # Phase 2: CPU/RAM/storage/OS discovery
│   ├── gpu-inventory.sh      # Phase 2: NVIDIA GPU inventory
│   ├── topology.sh           # Phase 2: NUMA/PCIe/NVLink topology
│   ├── network-inventory.sh  # Phase 2: NICs, InfiniBand, link info
│   ├── bmc-inventory.sh      # Phase 2: IPMI/BMC sensors/firmware
│   ├── software-audit.sh     # Phase 2: CUDA, cuDNN, NCCL, MPI, etc.
│   ├── dcgm-diag.sh          # Phase 3: DCGM diagnostics
│   ├── gpu-burn.sh           # Phase 3: GPU stress test
│   ├── nccl-tests.sh         # Phase 3: NCCL collective benchmarks
│   ├── nvbandwidth.sh        # Phase 3: GPU memory/PCIe/NVLink bandwidth
│   ├── stream-bench.sh       # Phase 3: STREAM memory bandwidth
│   ├── storage-bench.sh      # Phase 3: fio storage benchmarks
│   ├── hpl-cpu.sh            # Phase 3: HPL CPU Linpack
│   ├── hpl-mxp.sh            # Phase 3: HPL-MxP GPU mixed-precision
│   ├── ib-tests.sh           # Phase 3: InfiniBand loopback perftest
│   ├── network-diag.sh       # Phase 4: firewall, ports, routing, DNS
│   ├── filesystem-diag.sh    # Phase 4: mounts, FS types, RAID, NFS/Lustre
│   ├── thermal-power.sh      # Phase 4: thermals, fans, PSU, throttling
│   ├── security-scan.sh      # Phase 4: SSH audit, services, SUID, kernel
│   ├── report.sh             # Phase 5: generates report.md from results
│   ├── run-all.sh            # Orchestrator (runs all phases)
│   ├── ci-static-checks.sh   # CI-only: shellcheck + syntax checks
│   └── check-updates.sh      # Dependency update checker (not a module)
├── specs/
│   ├── modules.json          # Module manifest (single source of truth)
│   ├── dependencies.json     # Tracked external dependency versions
│   └── update-history.json   # Dependency update audit log
├── src/                       # Bundled sources (gpu-burn, nccl-tests, STREAM)
├── tests/
│   ├── helpers.bash           # Shared BATS test helpers
│   ├── common_helpers.bats    # Unit tests for lib/common.sh
│   ├── report_helpers.bats    # Unit tests for lib/report-common.sh
│   ├── module_integration.bats # Integration tests (syntax, manifest, source-gate)
│   └── check_updates.bats    # Tests for dependency checker + manifest
├── .editorconfig              # Formatting rules
├── .pre-commit-config.yaml    # Pre-commit hooks (shfmt, shellcheck)
├── Makefile                   # Quality gates: make lint, make test, make smoke
├── CONTRIBUTING.md            # Contributor guidelines
├── SKILL.md                   # This file (agent guide)
└── README.md                  # User-facing documentation

How to use it (like a human would)

Full run (root): sudo bash scripts/bootstrap.sh then sudo bash scripts/run-all.sh.
Quick validation: sudo bash scripts/run-all.sh --quick (short benchmarks).
Smoke (no benchmarks): sudo bash scripts/run-all.sh --smoke (bootstrap + discovery + report).
NVIDIA driver + CUDA on Ubuntu: sudo bash scripts/bootstrap.sh --install-nvidia (internet required; reboot after driver install, then run bootstrap again).
Check deps only: bash scripts/bootstrap.sh --check-only (no root; lists present/missing).
Single module: e.g. bash scripts/gpu-inventory.sh, bash scripts/gpu-burn.sh — run from repo root; most need prior bootstrap and root for full behavior.
Results: Default $HPC_RESULTS_DIR is /var/log/hpc-bench/results (root) or $HOME/.local/var/hpc-bench/results (non-root). Report: report.md; logs: logs/; per-module JSON: bootstrap.json, gpu-burn.json, etc.
Quality checks: make check (shellcheck + bats + static), make lint (shellcheck only), make test (bats only).

Context to assume: User may be on a VM, air-gapped, or without NVIDIA GPU. Scripts skip modules when prerequisites are missing (e.g. no IB hardware, no nvidia-smi) and emit JSON with status: "skipped" and .note/.skip_reason. Exit codes: 0 = success or conditional pass; 1 = at least one non-skipped module failed; 2 = lockfile held (another run in progress).

Making changes

Adding a new module

Create scripts/<name>.sh with the standard header:

#!/usr/bin/env bash
# <name>.sh -- <one-line description>
# Phase: <N> (<phase-name>)
# Requires: <cmd1>, <cmd2>
# Emits: <name>.json
SCRIPT_NAME="<name>"
source "$(dirname "$0")/../lib/common.sh"

Register in specs/modules.json with name, script, phase, order, and required_cmds.
Use skip_module or skip_module_with_data when prerequisites are missing (exit 0).
Use finish_module or emit_json to write results (the EXIT trap writes a crash record if you forget).
Use status: "error" + exit 1 for failures; never status: "error" + exit 0.
Add scoring logic in lib/report-common.sh score_module() if the module has custom pass/fail criteria.
Add report rendering in scripts/report.sh emit_benchmark_results() or the appropriate emit function.

Modifying an existing module

All tunables (timeouts, thresholds, durations, container images) live in conf/defaults.sh. Override in conf/local.sh for site-specific values.
Shared utilities (CUDA detection, container runtime, HPL.dat generation, JSON helpers) are in lib/common.sh. Use them instead of duplicating logic.
Follow the naming convention: lower_snake_case for local variables, UPPER_SNAKE_CASE for exported/global variables.
Wrap lines at ~120 characters (see .editorconfig).

Running tests

make test — runs all BATS tests.
make lint — runs shellcheck on all .sh files.
make check — runs lint + test + static checks.
make smoke — runs run-all.sh --smoke end-to-end.
make quick — runs run-all.sh --quick end-to-end.
make check-updates — checks 14 tracked dependencies for upstream updates.

JSON contract

Every module MUST emit a JSON file to $HPC_RESULTS_DIR/<module>.json with at least:

{
  "module": "<module-name>",
  "status": "ok|warn|error|skipped",
  "timestamp": "ISO8601"
}

status values: ok, warn, error, skipped — these are the ONLY valid values.
Skipped modules must include skip_reason (human-readable string).
Error modules must include error (human-readable string).
Additional fields are module-specific.

Common pitfalls

Things an AI agent should never do:

Use pass or fail as status values. The only valid values are ok, warn, error, skipped.
Forget to register a new module in specs/modules.json. The orchestrator (run-all.sh) reads this manifest; unregistered scripts won't run.
Emit status: "error" and then exit 0. Error status must pair with exit 1.
Use eval for dynamic variable access. Use the existing JSON helpers (json_compact_or, sanitize_json_str, etc.) instead.
Omit local in function variables. All lowercase variables inside functions must be declared local.
Break the JSON output structure. Always validate JSON before writing (use emit_json_safe for untrusted data, or json_compact_or for validation).
Hardcode CUDA paths. Use detect_cuda_home() from lib/common.sh.
Hardcode container runtime. Use detect_container_runtime() from lib/common.sh.
Use function name() syntax. Always use name() { style.
Skip writing JSON on error. The EXIT trap writes a crash record if you forget, but it's better to explicitly call emit_json with status: "error".

Debugging

Logs: Each module logs to $HPC_LOG_DIR/<SCRIPT_NAME>.log (e.g. logs/bootstrap.log). Stdout from benchmarks often in logs/<module>-stdout.log or similar.
Lock: Concurrent runs use $HPC_RESULTS_DIR/.hpc-bench.lock. Exit 2 if locked; remove lock only if no other run is active.
Crash records: If a module exits without writing its JSON (e.g. crash, kill), the EXIT trap in common.sh writes a crash record for that module so the report still has an entry.
JSON validity: Many scripts pipe data into jq. Control characters or invalid JSON in variables (e.g. from systemd-detect-virt, DMI, or nvidia-smi) can cause "parse error: Invalid string: control characters…". Fix by sanitizing before building JSON (e.g. strip U+0000–U+001F or use sanitize_json_str / temp files + --slurpfile).
Bootstrap "no internet": Connectivity check tries several URLs; on failure it logs a "Connectivity check hint" (first line of curl error). If the user has a proxy, they may need sudo -E or to set HTTP_PROXY/HTTPS_PROXY for root. If "no NVIDIA GPU" but the machine has GPUs, detection uses lspci then fallback /sys/bus/pci/devices/*/vendor (0x10de); ensure pciutils is installed or rely on the sys fallback.
Module-specific: See CHANGELOG for known bugs and fixes (e.g. gpu-inventory VM-safe behavior, nccl-tests parser, storage-bench fio output, report field names). Specs and GPU lookup: specs/hardware-specs.json and lookup_gpu_spec() in lib/common.sh (fuzzy match by GPU name).

Key files for agents

Need	File / location
User-facing usage	README.md
Version number	VERSION
Changelog / version history	CHANGELOG.md
Shared helpers, logging, emit_json, GPU/CUDA/container detection	lib/common.sh
Report scoring logic, module status helpers	lib/report-common.sh
Module manifest (single source of truth for modules)	specs/modules.json
Central defaults (timeouts, thresholds, container images)	conf/defaults.sh
Entry points	scripts/bootstrap.sh, scripts/run-all.sh
GPU spec lookup	lib/common.sh `lookup_gpu_spec`, specs/hardware-specs.json
Dev conventions, JSON contract, config guide	docs/DEVELOPMENT.md
Portable bundle (tarball + single .run for USB, air-gapped, agent reference)	docs/PORTABLE-BUNDLE.md
Contributor guidelines	CONTRIBUTING.md
Quality gates	Makefile
Dependency version manifest	specs/dependencies.json
Dependency update checker	scripts/check-updates.sh
Dependency update history	specs/update-history.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC Benchmark Suite — Agent Guide

What this suite is

Codebase structure

How to use it (like a human would)

Making changes

Adding a new module

Modifying an existing module

Running tests

JSON contract

Common pitfalls

Debugging

Key files for agents

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

HPC Benchmark Suite — Agent Guide

What this suite is

Codebase structure

How to use it (like a human would)

Making changes

Adding a new module

Modifying an existing module

Running tests

JSON contract

Common pitfalls

Debugging

Key files for agents