This document describes how to reproduce ContextRAG evaluations and interpret artifacts.
The evaluation runner expects:
dataset/
├── documents/ # one text file per document
└── queries.jsonl # one JSON object per line
Each queries.jsonl line contains:
{"query": "...", "relevant_ids": ["doc_id_1", "doc_id_2"]}
Included datasets:
data/demo-- small RFC-based dataset for offline demo runsdata/eval-mixed-- mixed corpus used in the main evaluation (larger)data/eval-expanded-- expanded mixed corpus with multi-relevance and hard negativesdata/eval-external-- external RFC holdout split for transfer checksdata/eval-scifact-mini-- public BEIR SciFact slice for non-RFC transfer checks
To rebuild the SciFact slice:
python3 scripts/build_eval_scifact_mini.pyThis run uses local embeddings and produces reproducible artifacts:
uv run contextrag eval \
--dataset data/demo \
--baseline uniform \
--k 5 \
--embed-provider local \
--output runs/demo_eval.json \
--run-dir runs/demo_evalThe first run downloads the local embedding model
(sentence-transformers/all-MiniLM-L6-v2).
For larger evals with saved artifacts:
uv run contextrag eval --config experiments/eval_expanded_uniform_local.yaml --run-dir runs/eval_expandedEmbeddings support openrouter and local providers. OpenAI model families can be selected through OpenRouter model IDs (e.g., openai/text-embedding-3-small).
Run a matrix comparison with local embeddings:
uv run contextrag matrix \
--dataset data/eval-expanded \
--baselines uniform,router \
--k-values 3,5,10 \
--embed-provider local \
--run-root runs/matrix_eval_expanded_local \
--persist-root runs/chroma-matrix-eval-expanded-localThis writes:
- one run directory per
(baseline, k)pair matrix_summary.jsonandmatrix_summary.md- per-
kuniform-vs-router comparisons undercomparisons/
Or use the Makefile shortcut:
make reproduceOptional dashboard rendering:
python3 scripts/render_matrix_report.py \
--input runs/matrix_eval_expanded_local/matrix_summary.json \
--output docs/matrix_eval_expanded_local.mdEach run directory contains:
runs/{run_name}/
├── summary.json # aggregate metrics and timing
├── per_query.jsonl # per-query metrics and hits
├── metadata.json # dataset/config details
└── manifest.json # config hash, dataset fingerprint, versions, system info
manifest.json includes:
manifest_schema_versiongit_commit(when available)- dataset fingerprint (
sha256, file count, byte count) - package/system versions
The top-level output JSON (e.g., runs/demo_eval.json) matches the
summary.json content and includes per-query records inline.
- Local embeddings are deterministic for a fixed model version.
- Switching embedding providers requires rebuilding the index.
- API-based providers may introduce nondeterminism depending on model and service configuration.