Independent hostile audit, did not build or run cycles A-G. Target: 3rd consecutive clean audit → PASS Phase 2.
- Rust toolchain: stable-x86_64-pc-windows-msvc (cargo via rustup).
- Clippy/rustfmt not installed — build/test only.
- Release binary rebuilt (
cargo build --release, 6.05s clean).
cargo test --release: 180 passed, 0 failed- 166 unit tests (in
src/inline#[cfg(test)] mod tests) - 3 integration_compare, 3 integration_criterion, 2 integration_csv, 3 integration_gobench, 3 integration_hyperfine
- 166 unit tests (in
- 0 ignored, 0 filtered.
Four golden cases. benchdiff computes t = (mean_b - mean_a)/se; scipy
computes (mean_a - mean_b)/se → sign flips, magnitudes and p-values must
match to ~8 decimals.
| Case | scipy |t|, df, p | benchdiff |t|, df, p | Δ |
|---|---|---|---|
| [1..5] vs [1.5..5.5] | 0.5, 8, 6.30536076e-1 | 0.5, 8, 6.30536076e-1 | 0 |
| [10.0..10.8] vs [12.0..12.1] | 7.44090644, 6.72904250, 1.75293187e-4 | 7.44090644, 6.72904250, 1.75293187e-4 | 0 |
| 6×100 vs 6×200 | 66.42111642, 6.67436490, 1.14830910e-10 | 66.42111642, 6.67436490, 1.14830910e-10 | 0 |
| 7×100 vs 7×200 | 285.29870108, 12.0, 2.31443649e-24 | 285.29870108, 12.0, 2.31443649e-24 | 0 |
Exact match to 8 decimals on t, df, and p across all four cases. The Lentz continued fraction + Lanczos ln_gamma is bit-for-bit scipy-equivalent at these scales.
pvals = [0.001, 0.04, 0.03, 0.001, 0.2, 0.05, 0.01, 0.0001, 0.5, 0.02], q=0.05.
- Python reference (hand):
[T, F, T, T, F, F, T, T, F, T] - benchdiff:
[T, F, T, T, F, F, T, T, F, T]
Exact match. max_k=5 (sixth-smallest p). BH step-up logic correct.
save base.csv --label v1: OK, emits "saved baseline v1 with 2 benchmark(s)"compare curr.csv --against v1 --output markdown: OK, 1 regression, 1 unchanged, exit=1compare self --against v1: 0 regressions, exit=0compare curr --correction bh: BH active, parse_large still regression, exit=1list: prints "v1"init: prints TOML templateinspect curr.csv: 2 benchmarks, unit=nssave --label "../bad": exit=2 (InvalidLabel → Config/InvalidLabel bucket)
- 5 release runs,
--output json: all 5 md5 identical → deterministic - 3 release runs,
--output markdown: all 3 md5 identical → deterministic
main.rs::classify_exit_code maps every Error variant:
- Parse, UnknownFormat, Json, Toml, InvalidNumber → 3
- Io, BaselineNotFound, InputTooLarge, LineTooLong → 4
- InsufficientSamples, EmptyInput, NonFinite → 5
- InvalidLabel, Config → 2
- Unwrapped io::Error → 4
- Fallback → 2
All 14 Error variants classified; no unreachable branch; consistent with README "Exit codes" table.
- Baseline has
schema: u32+label,created_at,run. - No
#[serde(deny_unknown_fields)]on Baseline → future v1.1 fields are silently ignored by v1.0 (forward-compatible). - Schema version hard-checked:
if b.schema != 1 → Error::parse("unsupported schema version")→ breaking changes are fail-loud. - Config does use
deny_unknown_fields— typo in TOML fails loud (correct behaviour for user-facing config). - Correct layering.
grep -n 'unwrap\|expect\|panic!\|unreachable!' src/:
- 100% of matches are inside
#[cfg(test)]modules. - Zero production-path panics found.
#![forbid(unsafe_code)]enforced in lib.rs and main.rs.#![deny(warnings)]active in main.rs.
| file | lines |
|---|---|
| compare.rs | 576 (incl. ~280 test lines) |
| stats.rs | 554 (~200 test) |
| config.rs | 413 (~135 test) |
| parsers/criterion.rs | 366 (~125 test) |
| report/markdown.rs | 299 (~60 test) |
Largest production module after stripping tests ≈ 300 LoC. All readable, single-responsibility.
- Install:
cargo install --gitandcargo install --path .documented — both work from source. README does NOT claimcargo install benchdiff-rsfrom crates.io (correct, package not published). - Usage: every flag in README matches clap-derive (
--label,--against,--format,--alpha,--min-change,--correction,--output,--out,--allow-regressions,--config,--baseline-dir,--quiet,--top). - Exit codes table matches main.rs::classify_exit_code precisely.
- Edge-cases table (insufficient, unchanged, zero-var, missing, new, ignored) all verified in compare.rs.
- Set to
"1.75". Clap 4.5, serde 1, thiserror 1, anyhow 1, comfy-table 7.1, toml 0.8 — all compatible with 1.75. Release build succeeds on current stable (≥ 1.75). Not actually verified against 1.75 exactly (no MSRV toolchain available), but no ≥1.76 features used (nolet-else-in-fn-pat, nogeneric const exprs, noasync fn in traits).
NONE.
CLEAN — CYCLE 3 OF 3 → PASS PHASE 2.
Cycles A through G found and fixed 10 bugs. F, G, and now H all independently clean. Three consecutive clean audits from different agents satisfy the 3-fresh-agent rule. Release binary, all 180 tests, CLI smoke, statistical golden-value verification, determinism checks, exit-code coverage, forward-compat audit, panic audit, module sizing, README-vs-code cross-check — all green.
Ready to ship.