Accuracy Benchmark

infermap ships with a cross-language accuracy benchmark that scores every PR in both Python and TypeScript on the same corpus, with the same metrics, and gates regressions in CI.

Headline numbers (v0.2.0, synthetic slice): F1 ≈ 0.95 in both runners, parity within 0.3pp.

Why

Schema-mapping is easy to overfit. A single hand-tuned alias can flatter a small fixture suite without improving real behavior. The benchmark exists so that:

Every PR is scored against the same corpus, in both languages, with the same metrics.
Cross-language drift is caught immediately. (A v0.1.x bug — Python silently re-inferring dtypes through polars while TypeScript did not — was caught by the benchmark and fixed by the new MapEngine.map_schemas() API.)
Regressions are blocked by CI: PRs that drop F1 by more than the threshold require an explicit regression-ack label.

Metrics

Metric	Question it answers
F1	Of the mappings the engine emitted, how many are correct, and how many correct ones did it miss?
Top-1 accuracy	When the engine picks the highest-scoring target for a source field, is it right?
MRR	If the correct target isn't #1, how far down the candidate list is it?
ECE	Are the engine's confidence scores actually calibrated?

Each metric has a hand-computed anchor file under benchmark/tests/parity/ so refactors cannot silently change the math.

Running it locally

# Python
pip install -e ".[dev]"
pip install -e "benchmark/runners/python[dev]"
python -m infermap_bench run --output report-py.json
python -m infermap_bench report report-py.json

# TypeScript
cd benchmark/runners/ts
npm install --install-links
npm run build
node dist/cli.js run --output ../../../report-ts.json
node dist/cli.js report ../../../report-ts.json

# Compare two reports
python -m infermap_bench compare --baseline report-py.json --current report-ts.json

Filtering: --only category:names, --only difficulty:hard, --only tag:abbrev, or just a case-id prefix.

New API in v0.2

`MapEngine.map_schemas()` / `mapSchemas()`

Pre-extracted-schema entry point. Use this when you already hold a SchemaInfo.

from infermap import MapEngine
engine = MapEngine()
result = engine.map_schemas(src_schema, tgt_schema)

`return_score_matrix=True` / `returnScoreMatrix: true`

Exposes the full M×N candidate score matrix. Enables MRR computation, runner-up inspection, and override UIs.

engine = MapEngine(return_score_matrix=True)
result = engine.map_schemas(src, tgt)
for src_field, candidates in result.score_matrix.items():
    top3 = sorted(candidates.items(), key=lambda kv: -kv[1])[:3]

Self-test

benchmark/self-test/ is a tiny corpus with a frozen expected scorecard. Any change to the engine that moves the self-test scores by more than 1e-4 fails CI, even if the full benchmark looks fine. Treat it as a tripwire.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accuracy Benchmark

Accuracy Benchmark

Why

Metrics

Running it locally

New API in v0.2

`MapEngine.map_schemas()` / `mapSchemas()`

`return_score_matrix=True` / `returnScoreMatrix: true`

Self-test

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Uh oh!

Accuracy Benchmark

Accuracy Benchmark

Why

Metrics

Running it locally

New API in v0.2

MapEngine.map_schemas() / mapSchemas()

return_score_matrix=True / returnScoreMatrix: true

Self-test

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`MapEngine.map_schemas()` / `mapSchemas()`

`return_score_matrix=True` / `returnScoreMatrix: true`