Skip to content

Accuracy Benchmark

bsevern edited this page Apr 9, 2026 · 1 revision

Accuracy Benchmark

infermap ships with a cross-language accuracy benchmark that scores every PR in both Python and TypeScript on the same corpus, with the same metrics, and gates regressions in CI.

Headline numbers (v0.2.0, synthetic slice): F1 ≈ 0.95 in both runners, parity within 0.3pp.

Why

Schema-mapping is easy to overfit. A single hand-tuned alias can flatter a small fixture suite without improving real behavior. The benchmark exists so that:

  • Every PR is scored against the same corpus, in both languages, with the same metrics.
  • Cross-language drift is caught immediately. (A v0.1.x bug — Python silently re-inferring dtypes through polars while TypeScript did not — was caught by the benchmark and fixed by the new MapEngine.map_schemas() API.)
  • Regressions are blocked by CI: PRs that drop F1 by more than the threshold require an explicit regression-ack label.

Metrics

Metric Question it answers
F1 Of the mappings the engine emitted, how many are correct, and how many correct ones did it miss?
Top-1 accuracy When the engine picks the highest-scoring target for a source field, is it right?
MRR If the correct target isn't #1, how far down the candidate list is it?
ECE Are the engine's confidence scores actually calibrated?

Each metric has a hand-computed anchor file under benchmark/tests/parity/ so refactors cannot silently change the math.

Running it locally

# Python
pip install -e ".[dev]"
pip install -e "benchmark/runners/python[dev]"
python -m infermap_bench run --output report-py.json
python -m infermap_bench report report-py.json

# TypeScript
cd benchmark/runners/ts
npm install --install-links
npm run build
node dist/cli.js run --output ../../../report-ts.json
node dist/cli.js report ../../../report-ts.json

# Compare two reports
python -m infermap_bench compare --baseline report-py.json --current report-ts.json

Filtering: --only category:names, --only difficulty:hard, --only tag:abbrev, or just a case-id prefix.

New API in v0.2

MapEngine.map_schemas() / mapSchemas()

Pre-extracted-schema entry point. Use this when you already hold a SchemaInfo.

from infermap import MapEngine
engine = MapEngine()
result = engine.map_schemas(src_schema, tgt_schema)

return_score_matrix=True / returnScoreMatrix: true

Exposes the full M×N candidate score matrix. Enables MRR computation, runner-up inspection, and override UIs.

engine = MapEngine(return_score_matrix=True)
result = engine.map_schemas(src, tgt)
for src_field, candidates in result.score_matrix.items():
    top3 = sorted(candidates.items(), key=lambda kv: -kv[1])[:3]

Self-test

benchmark/self-test/ is a tiny corpus with a frozen expected scorecard. Any change to the engine that moves the self-test scores by more than 1e-4 fails CI, even if the full benchmark looks fine. Treat it as a tripwire.

See also

Clone this wiki locally