This doc tells an AI agent how to work with the HPC Bench Suite: how to run it, what context to assume, how to make changes, and how to debug. For version history and changelog, see CHANGELOG.md.
- Purpose: Benchmark and inventory HPC servers (GPU, CPU, network, storage, topology). Produces JSON per module and a markdown report.
- Target: Bare-metal HPC. VMs are supported for convenience; many modules skip or degrade (DCGM, nvbandwidth, HPL-MxP, InfiniBand, BMC). See README "Target environment" and CHANGELOG for VM-related behavior.
- Language: Pure Bash. Designed for minimal dependencies and portability across HPC environments.
bench/
├── conf/
│ ├── defaults.sh # Central tunables (timeouts, thresholds, container images)
│ └── local.sh # Optional user overrides (gitignored)
├── docs/
│ └── DEVELOPMENT.md # Detailed dev conventions, JSON contract, config guide
├── lib/
│ ├── common.sh # Core shared library (logging, JSON, GPU, cleanup, etc.)
│ └── report-common.sh # Report scoring logic, module status helpers
├── scripts/
│ ├── bootstrap.sh # Phase 0: hardware detection, tool install
│ ├── runtime-sanity.sh # Phase 1: pre-flight checks
│ ├── inventory.sh # Phase 2: CPU/RAM/storage/OS discovery
│ ├── gpu-inventory.sh # Phase 2: NVIDIA GPU inventory
│ ├── topology.sh # Phase 2: NUMA/PCIe/NVLink topology
│ ├── network-inventory.sh # Phase 2: NICs, InfiniBand, link info
│ ├── bmc-inventory.sh # Phase 2: IPMI/BMC sensors/firmware
│ ├── software-audit.sh # Phase 2: CUDA, cuDNN, NCCL, MPI, etc.
│ ├── dcgm-diag.sh # Phase 3: DCGM diagnostics
│ ├── gpu-burn.sh # Phase 3: GPU stress test
│ ├── nccl-tests.sh # Phase 3: NCCL collective benchmarks
│ ├── nvbandwidth.sh # Phase 3: GPU memory/PCIe/NVLink bandwidth
│ ├── stream-bench.sh # Phase 3: STREAM memory bandwidth
│ ├── storage-bench.sh # Phase 3: fio storage benchmarks
│ ├── hpl-cpu.sh # Phase 3: HPL CPU Linpack
│ ├── hpl-mxp.sh # Phase 3: HPL-MxP GPU mixed-precision
│ ├── ib-tests.sh # Phase 3: InfiniBand loopback perftest
│ ├── network-diag.sh # Phase 4: firewall, ports, routing, DNS
│ ├── filesystem-diag.sh # Phase 4: mounts, FS types, RAID, NFS/Lustre
│ ├── thermal-power.sh # Phase 4: thermals, fans, PSU, throttling
│ ├── security-scan.sh # Phase 4: SSH audit, services, SUID, kernel
│ ├── report.sh # Phase 5: generates report.md from results
│ ├── run-all.sh # Orchestrator (runs all phases)
│ ├── ci-static-checks.sh # CI-only: shellcheck + syntax checks
│ └── check-updates.sh # Dependency update checker (not a module)
├── specs/
│ ├── modules.json # Module manifest (single source of truth)
│ ├── dependencies.json # Tracked external dependency versions
│ └── update-history.json # Dependency update audit log
├── src/ # Bundled sources (gpu-burn, nccl-tests, STREAM)
├── tests/
│ ├── helpers.bash # Shared BATS test helpers
│ ├── common_helpers.bats # Unit tests for lib/common.sh
│ ├── report_helpers.bats # Unit tests for lib/report-common.sh
│ ├── module_integration.bats # Integration tests (syntax, manifest, source-gate)
│ └── check_updates.bats # Tests for dependency checker + manifest
├── .editorconfig # Formatting rules
├── .pre-commit-config.yaml # Pre-commit hooks (shfmt, shellcheck)
├── Makefile # Quality gates: make lint, make test, make smoke
├── CONTRIBUTING.md # Contributor guidelines
├── SKILL.md # This file (agent guide)
└── README.md # User-facing documentation
- Full run (root):
sudo bash scripts/bootstrap.shthensudo bash scripts/run-all.sh. - Quick validation:
sudo bash scripts/run-all.sh --quick(short benchmarks). - Smoke (no benchmarks):
sudo bash scripts/run-all.sh --smoke(bootstrap + discovery + report). - NVIDIA driver + CUDA on Ubuntu:
sudo bash scripts/bootstrap.sh --install-nvidia(internet required; reboot after driver install, then run bootstrap again). - Check deps only:
bash scripts/bootstrap.sh --check-only(no root; lists present/missing). - Single module: e.g.
bash scripts/gpu-inventory.sh,bash scripts/gpu-burn.sh— run from repo root; most need prior bootstrap and root for full behavior. - Results: Default
$HPC_RESULTS_DIRis/var/log/hpc-bench/results(root) or$HOME/.local/var/hpc-bench/results(non-root). Report:report.md; logs:logs/; per-module JSON:bootstrap.json,gpu-burn.json, etc. - Quality checks:
make check(shellcheck + bats + static),make lint(shellcheck only),make test(bats only).
Context to assume: User may be on a VM, air-gapped, or without NVIDIA GPU. Scripts skip modules when prerequisites are missing (e.g. no IB hardware, no nvidia-smi) and emit JSON with status: "skipped" and .note/.skip_reason. Exit codes: 0 = success or conditional pass; 1 = at least one non-skipped module failed; 2 = lockfile held (another run in progress).
- Create
scripts/<name>.shwith the standard header:#!/usr/bin/env bash # <name>.sh -- <one-line description> # Phase: <N> (<phase-name>) # Requires: <cmd1>, <cmd2> # Emits: <name>.json SCRIPT_NAME="<name>" source "$(dirname "$0")/../lib/common.sh"
- Register in
specs/modules.jsonwithname,script,phase,order, andrequired_cmds. - Use
skip_moduleorskip_module_with_datawhen prerequisites are missing (exit 0). - Use
finish_moduleoremit_jsonto write results (the EXIT trap writes a crash record if you forget). - Use
status: "error"+exit 1for failures; neverstatus: "error"+exit 0. - Add scoring logic in
lib/report-common.shscore_module()if the module has custom pass/fail criteria. - Add report rendering in
scripts/report.shemit_benchmark_results()or the appropriate emit function.
- All tunables (timeouts, thresholds, durations, container images) live in
conf/defaults.sh. Override inconf/local.shfor site-specific values. - Shared utilities (CUDA detection, container runtime, HPL.dat generation, JSON helpers) are in
lib/common.sh. Use them instead of duplicating logic. - Follow the naming convention:
lower_snake_casefor local variables,UPPER_SNAKE_CASEfor exported/global variables. - Wrap lines at ~120 characters (see
.editorconfig).
make test— runs all BATS tests.make lint— runs shellcheck on all.shfiles.make check— runs lint + test + static checks.make smoke— runsrun-all.sh --smokeend-to-end.make quick— runsrun-all.sh --quickend-to-end.make check-updates— checks 14 tracked dependencies for upstream updates.
Every module MUST emit a JSON file to $HPC_RESULTS_DIR/<module>.json with at least:
{
"module": "<module-name>",
"status": "ok|warn|error|skipped",
"timestamp": "ISO8601"
}statusvalues:ok,warn,error,skipped— these are the ONLY valid values.- Skipped modules must include
skip_reason(human-readable string). - Error modules must include
error(human-readable string). - Additional fields are module-specific.
Things an AI agent should never do:
- Use
passorfailas status values. The only valid values areok,warn,error,skipped. - Forget to register a new module in
specs/modules.json. The orchestrator (run-all.sh) reads this manifest; unregistered scripts won't run. - Emit
status: "error"and thenexit 0. Error status must pair withexit 1. - Use
evalfor dynamic variable access. Use the existing JSON helpers (json_compact_or,sanitize_json_str, etc.) instead. - Omit
localin function variables. All lowercase variables inside functions must be declaredlocal. - Break the JSON output structure. Always validate JSON before writing (use
emit_json_safefor untrusted data, orjson_compact_orfor validation). - Hardcode CUDA paths. Use
detect_cuda_home()fromlib/common.sh. - Hardcode container runtime. Use
detect_container_runtime()fromlib/common.sh. - Use
function name()syntax. Always usename() {style. - Skip writing JSON on error. The EXIT trap writes a crash record if you forget, but it's better to explicitly call
emit_jsonwithstatus: "error".
- Logs: Each module logs to
$HPC_LOG_DIR/<SCRIPT_NAME>.log(e.g.logs/bootstrap.log). Stdout from benchmarks often inlogs/<module>-stdout.logor similar. - Lock: Concurrent runs use
$HPC_RESULTS_DIR/.hpc-bench.lock. Exit 2 if locked; remove lock only if no other run is active. - Crash records: If a module exits without writing its JSON (e.g. crash, kill), the EXIT trap in common.sh writes a crash record for that module so the report still has an entry.
- JSON validity: Many scripts pipe data into
jq. Control characters or invalid JSON in variables (e.g. fromsystemd-detect-virt, DMI, or nvidia-smi) can cause "parse error: Invalid string: control characters…". Fix by sanitizing before building JSON (e.g. strip U+0000–U+001F or usesanitize_json_str/ temp files +--slurpfile). - Bootstrap "no internet": Connectivity check tries several URLs; on failure it logs a "Connectivity check hint" (first line of curl error). If the user has a proxy, they may need
sudo -Eor to set HTTP_PROXY/HTTPS_PROXY for root. If "no NVIDIA GPU" but the machine has GPUs, detection useslspcithen fallback/sys/bus/pci/devices/*/vendor(0x10de); ensure pciutils is installed or rely on the sys fallback. - Module-specific: See CHANGELOG for known bugs and fixes (e.g. gpu-inventory VM-safe behavior, nccl-tests parser, storage-bench fio output, report field names). Specs and GPU lookup:
specs/hardware-specs.jsonandlookup_gpu_spec()inlib/common.sh(fuzzy match by GPU name).
| Need | File / location |
|---|---|
| User-facing usage | README.md |
| Version number | VERSION |
| Changelog / version history | CHANGELOG.md |
| Shared helpers, logging, emit_json, GPU/CUDA/container detection | lib/common.sh |
| Report scoring logic, module status helpers | lib/report-common.sh |
| Module manifest (single source of truth for modules) | specs/modules.json |
| Central defaults (timeouts, thresholds, container images) | conf/defaults.sh |
| Entry points | scripts/bootstrap.sh, scripts/run-all.sh |
| GPU spec lookup | lib/common.sh lookup_gpu_spec, specs/hardware-specs.json |
| Dev conventions, JSON contract, config guide | docs/DEVELOPMENT.md |
| Portable bundle (tarball + single .run for USB, air-gapped, agent reference) | docs/PORTABLE-BUNDLE.md |
| Contributor guidelines | CONTRIBUTING.md |
| Quality gates | Makefile |
| Dependency version manifest | specs/dependencies.json |
| Dependency update checker | scripts/check-updates.sh |
| Dependency update history | specs/update-history.json |