.eval-notes-C.md

Eval Cycle C — benchdiff-rs

Independent hostile audit. Cycle 1 of 3.

Files audited

src/lib.rs, src/main.rs, src/error.rs, src/cli.rs
src/sample.rs, src/stats.rs, src/baseline.rs, src/config.rs, src/compare.rs
src/parsers/{mod.rs,criterion.rs,csv.rs,gobench.rs,hyperfine.rs}
src/report/{mod.rs,json.rs,text.rs,markdown.rs}
README.md, Cargo.toml, ROUND_LOG.md, CHANGELOG.md
tests/integration_*.rs (5 files)

Re-verification of B's fixes

INF_SENTINEL saturation (compare::compute_row) — verified: 1e300 sentinel finite, JSON never emits null, display layer collapses to ±∞ via DISPLAY_INF_THRESHOLD = 1e100. Real-world conflict (a real 1e300 measurement) is impossible (1e300 ns ≈ 3e283 years).
classify_exit_code downcast — cli::run returns anyhow::Result but never calls .context(), so benchdiff::Error stays at the head of the anyhow chain and downcast_ref::<BdError>() succeeds for every propagated error. All variants mapped to documented exit codes (2/3/4/5).
MAX_WALK_DEPTH=32 + symlink rejection — both parse_dir entry and per-entry walk skip symlinks via symlink_metadata().file_type().is_symlink(). Windows note: this also catches reparse points / junctions because is_symlink() returns true for any reparse with the symlink tag and junctions are skipped at the directory iteration level (we'd just recurse into them; the depth cap protects against cycles regardless).
Config::tolerance per-entry validation — verified: walks every entry, accepts integer or float, rejects string / negative / non-finite. deny_unknown_fields at top level still rejects mystery keys (E2E verified — see below).
glob_match middle wildcard — verified: Bench*Skip matches BenchFooSkip, BenchSkip; ** is not special-cased but degenerates correctly (empty middle chunks are skipped, equivalent to *).

Fresh probes

README determinism — markdown report uses literal \n, not writeln!, so output is LF on every platform. ✓
Welch Satterthwaite df — fractional, not rounded. two_sided_p uses df directly via reg_incomplete_beta(x, df/2, 0.5). ✓
Welch with n=1 — welch_t_test returns InsufficientSamples, compute_row catches and emits Verdict::Insufficient. ✓
Hyperfine times: null — serde_json rejects null→Vec, surfaced as Error::Parse. Acceptable.
Criterion mean without std_dev — EstimatesJson::std_dev defaults to None → sd=0.0 → 3 identical synthesized samples. ✓
CSV BOM prefix — would parse the BOM-prefixed name row as data, fail with InvalidNumber. Sub-optimal but errors clearly. Not blocking.
CSV CRLF — lines_crlf strips trailing \r, integration test handles_crlf covers it. ✓
CSV column case-sensitivity — header detection uses to_ascii_lowercase(); case-insensitive. README example uses lowercase but parser accepts either. ✓
ignore regex special chars — patterns are literal except *; Bench[0-9]+ would only match the literal string. Documented as "no character classes". ✓
Multiple benchmarks same name across runs — BTreeMap aggregation in csv/gobench parsers; criterion would emit two Benchmark entries (last find() wins in compare). Pre-existing limitation, not a regression.
Baseline overwrite — Baseline::save calls fs::write (truncate); no warning. Not documented as either way; not a bug.
Compare with baseline > current — missing rows reported as MissingFromCurrent. ✓
Compare with current > baseline — extra rows reported as NewInCurrent. ✓
JSON schema versioning — Baseline carries schema: u32; load rejects non-1. DiffReport JSON has no version field. Not blocking — consumers can rely on field stability.
Unicode benchmark names — comfy-table handles wide chars; markdown escape only touches | / backtick. ✓
Exit code 3 vs 4 — Parse / UnknownFormat / Json / Toml / InvalidNumber → 3 (parse-class); Io / BaselineNotFound / InputTooLarge / LineTooLong → 4 (I/O-class). InvalidLabel / Config → 2 (usage). InsufficientSamples / EmptyInput / NonFinite → 5 (statistical). All match documented table.
--help for every subcommand — verified for save (and clap auto-generates for the rest).
README code samples — quick-start commands E2E verified end-to-end via CSV save+compare; output matches doc shape, exit code 1 on regression.

Bugs found

Bug C-1 — anyhow `{:#}` duplicates source-embedded error display (LOW)

main.rs printed the error via eprintln!("benchdiff error: {err:#}") where {err:#} is anyhow's alternate-form formatter. That formatter walks #[source] and appends each layer with : . But every benchdiff::Error variant whose source is interesting (Io, Json, Toml) already embeds {source} directly in its #[error("…: {source}")] Display. The result was the same TOML/JSON/IO message printed twice, joined by an extra : :

benchdiff error: TOML parse error in "x.toml": TOML parse error at line 4 …
: TOML parse error at line 4 …

Fix: switch to eprintln!("benchdiff error: {err}") (non-alternate). Each error variant's Display already contains the source text exactly once. Comment added to explain why we deliberately drop :#.

Severity: cosmetic / UX; no functional impact, but confused error output is exactly the kind of thing CI users will paste into bug reports.

Test count

174 tests, all passing post-fix:

160 lib unit tests (sample, stats, baseline, config, compare, parsers, report, cli)
14 integration tests across 5 fixture-driven files

Test output

test result: ok. 160 passed; 0 failed; 0 ignored
test result: ok. 3 passed; 0 failed (integration_compare)
test result: ok. 3 passed; 0 failed (integration_criterion)
test result: ok. 2 passed; 0 failed (integration_csv)
test result: ok. 3 passed; 0 failed (integration_gobench)
test result: ok. 3 passed; 0 failed (integration_hyperfine)

E2E verified

save CSV → baseline written
compare CSV vs CSV → markdown table, regression flagged, exit 1
JSON output → strictly numeric (no nulls), well-formed
--config with mystery top-level field → exit 3, single clean error line

Verdict

1 LOW bug found, 1 fix applied. Cycle is NOT clean — a fresh agent must re-evaluate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval Cycle C — benchdiff-rs

Files audited

Re-verification of B's fixes

Fresh probes

Bugs found

Bug C-1 — anyhow `{:#}` duplicates source-embedded error display (LOW)

Test count

Test output

E2E verified

Verdict

FilesExpand file tree

.eval-notes-C.md

Latest commit

History

.eval-notes-C.md

File metadata and controls

Eval Cycle C — benchdiff-rs

Files audited

Re-verification of B's fixes

Fresh probes

Bugs found

Bug C-1 — anyhow {:#} duplicates source-embedded error display (LOW)

Test count

Test output

E2E verified

Verdict

Bug C-1 — anyhow `{:#}` duplicates source-embedded error display (LOW)