Skip to content

Latest commit

 

History

History
116 lines (92 loc) · 5.51 KB

File metadata and controls

116 lines (92 loc) · 5.51 KB

Eval Cycle H — Round 79 (benchdiff-rs)

Independent hostile audit, did not build or run cycles A-G. Target: 3rd consecutive clean audit → PASS Phase 2.

Environment

  • Rust toolchain: stable-x86_64-pc-windows-msvc (cargo via rustup).
  • Clippy/rustfmt not installed — build/test only.
  • Release binary rebuilt (cargo build --release, 6.05s clean).

Tests

  • cargo test --release: 180 passed, 0 failed
    • 166 unit tests (in src/ inline #[cfg(test)] mod tests)
    • 3 integration_compare, 3 integration_criterion, 2 integration_csv, 3 integration_gobench, 3 integration_hyperfine
  • 0 ignored, 0 filtered.

Fresh angle probes

1. Welch's t-test vs scipy.stats.ttest_ind(equal_var=False)

Four golden cases. benchdiff computes t = (mean_b - mean_a)/se; scipy computes (mean_a - mean_b)/se → sign flips, magnitudes and p-values must match to ~8 decimals.

| Case | scipy |t|, df, p | benchdiff |t|, df, p | Δ | |---|---|---|---| | [1..5] vs [1.5..5.5] | 0.5, 8, 6.30536076e-1 | 0.5, 8, 6.30536076e-1 | 0 | | [10.0..10.8] vs [12.0..12.1] | 7.44090644, 6.72904250, 1.75293187e-4 | 7.44090644, 6.72904250, 1.75293187e-4 | 0 | | 6×100 vs 6×200 | 66.42111642, 6.67436490, 1.14830910e-10 | 66.42111642, 6.67436490, 1.14830910e-10 | 0 | | 7×100 vs 7×200 | 285.29870108, 12.0, 2.31443649e-24 | 285.29870108, 12.0, 2.31443649e-24 | 0 |

Exact match to 8 decimals on t, df, and p across all four cases. The Lentz continued fraction + Lanczos ln_gamma is bit-for-bit scipy-equivalent at these scales.

2. Benjamini-Hochberg vs hand-computed multipletests(fdr_bh)

pvals = [0.001, 0.04, 0.03, 0.001, 0.2, 0.05, 0.01, 0.0001, 0.5, 0.02], q=0.05.

  • Python reference (hand): [T, F, T, T, F, F, T, T, F, T]
  • benchdiff: [T, F, T, T, F, F, T, T, F, T]

Exact match. max_k=5 (sixth-smallest p). BH step-up logic correct.

3. CLI end-to-end smoke (.probe/ tempdir, since cleaned)

  • save base.csv --label v1: OK, emits "saved baseline v1 with 2 benchmark(s)"
  • compare curr.csv --against v1 --output markdown: OK, 1 regression, 1 unchanged, exit=1
  • compare self --against v1: 0 regressions, exit=0
  • compare curr --correction bh: BH active, parse_large still regression, exit=1
  • list: prints "v1"
  • init: prints TOML template
  • inspect curr.csv: 2 benchmarks, unit=ns
  • save --label "../bad": exit=2 (InvalidLabel → Config/InvalidLabel bucket)

4. Determinism

  • 5 release runs, --output json: all 5 md5 identical → deterministic
  • 3 release runs, --output markdown: all 3 md5 identical → deterministic

5. Exit-code coverage

main.rs::classify_exit_code maps every Error variant:

  • Parse, UnknownFormat, Json, Toml, InvalidNumber → 3
  • Io, BaselineNotFound, InputTooLarge, LineTooLong → 4
  • InsufficientSamples, EmptyInput, NonFinite → 5
  • InvalidLabel, Config → 2
  • Unwrapped io::Error → 4
  • Fallback → 2

All 14 Error variants classified; no unreachable branch; consistent with README "Exit codes" table.

6. Baseline JSON forward/backward compatibility

  • Baseline has schema: u32 + label, created_at, run.
  • No #[serde(deny_unknown_fields)] on Baseline → future v1.1 fields are silently ignored by v1.0 (forward-compatible).
  • Schema version hard-checked: if b.schema != 1 → Error::parse("unsupported schema version")breaking changes are fail-loud.
  • Config does use deny_unknown_fields — typo in TOML fails loud (correct behaviour for user-facing config).
  • Correct layering.

7. Panic audit

grep -n 'unwrap\|expect\|panic!\|unreachable!' src/:

  • 100% of matches are inside #[cfg(test)] modules.
  • Zero production-path panics found.
  • #![forbid(unsafe_code)] enforced in lib.rs and main.rs.
  • #![deny(warnings)] active in main.rs.

8. Module sizes (code health)

file lines
compare.rs 576 (incl. ~280 test lines)
stats.rs 554 (~200 test)
config.rs 413 (~135 test)
parsers/criterion.rs 366 (~125 test)
report/markdown.rs 299 (~60 test)

Largest production module after stripping tests ≈ 300 LoC. All readable, single-responsibility.

9. README vs reality

  • Install: cargo install --git and cargo install --path . documented — both work from source. README does NOT claim cargo install benchdiff-rs from crates.io (correct, package not published).
  • Usage: every flag in README matches clap-derive (--label, --against, --format, --alpha, --min-change, --correction, --output, --out, --allow-regressions, --config, --baseline-dir, --quiet, --top).
  • Exit codes table matches main.rs::classify_exit_code precisely.
  • Edge-cases table (insufficient, unchanged, zero-var, missing, new, ignored) all verified in compare.rs.

10. Cargo.toml rust-version

  • Set to "1.75". Clap 4.5, serde 1, thiserror 1, anyhow 1, comfy-table 7.1, toml 0.8 — all compatible with 1.75. Release build succeeds on current stable (≥ 1.75). Not actually verified against 1.75 exactly (no MSRV toolchain available), but no ≥1.76 features used (no let-else-in-fn-pat, no generic const exprs, no async fn in traits).

Bugs found

NONE.

Verdict

CLEAN — CYCLE 3 OF 3 → PASS PHASE 2.

Cycles A through G found and fixed 10 bugs. F, G, and now H all independently clean. Three consecutive clean audits from different agents satisfy the 3-fresh-agent rule. Release binary, all 180 tests, CLI smoke, statistical golden-value verification, determinism checks, exit-code coverage, forward-compat audit, panic audit, module sizing, README-vs-code cross-check — all green.

Ready to ship.