Skip to content

Latest commit

 

History

History
127 lines (113 loc) · 6.37 KB

File metadata and controls

127 lines (113 loc) · 6.37 KB

Eval Cycle C — benchdiff-rs

Independent hostile audit. Cycle 1 of 3.

Files audited

  • src/lib.rs, src/main.rs, src/error.rs, src/cli.rs
  • src/sample.rs, src/stats.rs, src/baseline.rs, src/config.rs, src/compare.rs
  • src/parsers/{mod.rs,criterion.rs,csv.rs,gobench.rs,hyperfine.rs}
  • src/report/{mod.rs,json.rs,text.rs,markdown.rs}
  • README.md, Cargo.toml, ROUND_LOG.md, CHANGELOG.md
  • tests/integration_*.rs (5 files)

Re-verification of B's fixes

  1. INF_SENTINEL saturation (compare::compute_row) — verified: 1e300 sentinel finite, JSON never emits null, display layer collapses to ±∞ via DISPLAY_INF_THRESHOLD = 1e100. Real-world conflict (a real 1e300 measurement) is impossible (1e300 ns ≈ 3e283 years).
  2. classify_exit_code downcastcli::run returns anyhow::Result but never calls .context(), so benchdiff::Error stays at the head of the anyhow chain and downcast_ref::<BdError>() succeeds for every propagated error. All variants mapped to documented exit codes (2/3/4/5).
  3. MAX_WALK_DEPTH=32 + symlink rejection — both parse_dir entry and per-entry walk skip symlinks via symlink_metadata().file_type().is_symlink(). Windows note: this also catches reparse points / junctions because is_symlink() returns true for any reparse with the symlink tag and junctions are skipped at the directory iteration level (we'd just recurse into them; the depth cap protects against cycles regardless).
  4. Config::tolerance per-entry validation — verified: walks every entry, accepts integer or float, rejects string / negative / non-finite. deny_unknown_fields at top level still rejects mystery keys (E2E verified — see below).
  5. glob_match middle wildcard — verified: Bench*Skip matches BenchFooSkip, BenchSkip; ** is not special-cased but degenerates correctly (empty middle chunks are skipped, equivalent to *).

Fresh probes

  • README determinism — markdown report uses literal \n, not writeln!, so output is LF on every platform. ✓
  • Welch Satterthwaite df — fractional, not rounded. two_sided_p uses df directly via reg_incomplete_beta(x, df/2, 0.5). ✓
  • Welch with n=1welch_t_test returns InsufficientSamples, compute_row catches and emits Verdict::Insufficient. ✓
  • Hyperfine times: null — serde_json rejects null→Vec, surfaced as Error::Parse. Acceptable.
  • Criterion mean without std_devEstimatesJson::std_dev defaults to None → sd=0.0 → 3 identical synthesized samples. ✓
  • CSV BOM prefix — would parse the BOM-prefixed name row as data, fail with InvalidNumber. Sub-optimal but errors clearly. Not blocking.
  • CSV CRLFlines_crlf strips trailing \r, integration test handles_crlf covers it. ✓
  • CSV column case-sensitivity — header detection uses to_ascii_lowercase(); case-insensitive. README example uses lowercase but parser accepts either. ✓
  • ignore regex special chars — patterns are literal except *; Bench[0-9]+ would only match the literal string. Documented as "no character classes". ✓
  • Multiple benchmarks same name across runsBTreeMap aggregation in csv/gobench parsers; criterion would emit two Benchmark entries (last find() wins in compare). Pre-existing limitation, not a regression.
  • Baseline overwriteBaseline::save calls fs::write (truncate); no warning. Not documented as either way; not a bug.
  • Compare with baseline > current — missing rows reported as MissingFromCurrent. ✓
  • Compare with current > baseline — extra rows reported as NewInCurrent. ✓
  • JSON schema versioningBaseline carries schema: u32; load rejects non-1. DiffReport JSON has no version field. Not blocking — consumers can rely on field stability.
  • Unicode benchmark names — comfy-table handles wide chars; markdown escape only touches | / backtick. ✓
  • Exit code 3 vs 4 — Parse / UnknownFormat / Json / Toml / InvalidNumber → 3 (parse-class); Io / BaselineNotFound / InputTooLarge / LineTooLong → 4 (I/O-class). InvalidLabel / Config → 2 (usage). InsufficientSamples / EmptyInput / NonFinite → 5 (statistical). All match documented table.
  • --help for every subcommand — verified for save (and clap auto-generates for the rest).
  • README code samples — quick-start commands E2E verified end-to-end via CSV save+compare; output matches doc shape, exit code 1 on regression.

Bugs found

Bug C-1 — anyhow {:#} duplicates source-embedded error display (LOW)

main.rs printed the error via eprintln!("benchdiff error: {err:#}") where {err:#} is anyhow's alternate-form formatter. That formatter walks #[source] and appends each layer with : . But every benchdiff::Error variant whose source is interesting (Io, Json, Toml) already embeds {source} directly in its #[error("…: {source}")] Display. The result was the same TOML/JSON/IO message printed twice, joined by an extra : :

benchdiff error: TOML parse error in "x.toml": TOML parse error at line 4 …
: TOML parse error at line 4 …

Fix: switch to eprintln!("benchdiff error: {err}") (non-alternate). Each error variant's Display already contains the source text exactly once. Comment added to explain why we deliberately drop :#.

Severity: cosmetic / UX; no functional impact, but confused error output is exactly the kind of thing CI users will paste into bug reports.

Test count

174 tests, all passing post-fix:

  • 160 lib unit tests (sample, stats, baseline, config, compare, parsers, report, cli)
  • 14 integration tests across 5 fixture-driven files

Test output

test result: ok. 160 passed; 0 failed; 0 ignored
test result: ok. 3 passed; 0 failed (integration_compare)
test result: ok. 3 passed; 0 failed (integration_criterion)
test result: ok. 2 passed; 0 failed (integration_csv)
test result: ok. 3 passed; 0 failed (integration_gobench)
test result: ok. 3 passed; 0 failed (integration_hyperfine)

E2E verified

  • save CSV → baseline written
  • compare CSV vs CSV → markdown table, regression flagged, exit 1
  • JSON output → strictly numeric (no nulls), well-formed
  • --config with mystery top-level field → exit 3, single clean error line

Verdict

1 LOW bug found, 1 fix applied. Cycle is NOT clean — a fresh agent must re-evaluate.