-
-
Notifications
You must be signed in to change notification settings - Fork 0
Accuracy Benchmark
infermap ships with a cross-language accuracy benchmark that scores every PR in both Python and TypeScript on the same corpus, with the same metrics, and gates regressions in CI.
Headline numbers (v0.2.0, synthetic slice): F1 ≈ 0.95 in both runners, parity within 0.3pp.
Schema-mapping is easy to overfit. A single hand-tuned alias can flatter a small fixture suite without improving real behavior. The benchmark exists so that:
- Every PR is scored against the same corpus, in both languages, with the same metrics.
- Cross-language drift is caught immediately. (A v0.1.x bug — Python silently re-inferring dtypes through polars while TypeScript did not — was caught by the benchmark and fixed by the new
MapEngine.map_schemas()API.) - Regressions are blocked by CI: PRs that drop F1 by more than the threshold require an explicit
regression-acklabel.
| Metric | Question it answers |
|---|---|
| F1 | Of the mappings the engine emitted, how many are correct, and how many correct ones did it miss? |
| Top-1 accuracy | When the engine picks the highest-scoring target for a source field, is it right? |
| MRR | If the correct target isn't #1, how far down the candidate list is it? |
| ECE | Are the engine's confidence scores actually calibrated? |
Each metric has a hand-computed anchor file under benchmark/tests/parity/ so refactors cannot silently change the math.
# Python
pip install -e ".[dev]"
pip install -e "benchmark/runners/python[dev]"
python -m infermap_bench run --output report-py.json
python -m infermap_bench report report-py.json
# TypeScript
cd benchmark/runners/ts
npm install --install-links
npm run build
node dist/cli.js run --output ../../../report-ts.json
node dist/cli.js report ../../../report-ts.json
# Compare two reports
python -m infermap_bench compare --baseline report-py.json --current report-ts.jsonFiltering: --only category:names, --only difficulty:hard, --only tag:abbrev, or just a case-id prefix.
Pre-extracted-schema entry point. Use this when you already hold a SchemaInfo.
from infermap import MapEngine
engine = MapEngine()
result = engine.map_schemas(src_schema, tgt_schema)Exposes the full M×N candidate score matrix. Enables MRR computation, runner-up inspection, and override UIs.
engine = MapEngine(return_score_matrix=True)
result = engine.map_schemas(src, tgt)
for src_field, candidates in result.score_matrix.items():
top3 = sorted(candidates.items(), key=lambda kv: -kv[1])[:3]benchmark/self-test/ is a tiny corpus with a frozen expected scorecard. Any change to the engine that moves the self-test scores by more than 1e-4 fails CI, even if the full benchmark looks fine. Treat it as a tripwire.
-
Python API —
MapEngine.map_schemas()andreturn_score_matrix=True -
TypeScript API —
MapEngine.mapSchemas()andreturnScoreMatrix: true - Docs site benchmark page
- Example: 08_benchmark_introspection.py