Skip to content

Latest commit

 

History

History
121 lines (88 loc) · 3.15 KB

File metadata and controls

121 lines (88 loc) · 3.15 KB

Reproducibility

This document describes how to reproduce ContextRAG evaluations and interpret artifacts.

Datasets

The evaluation runner expects:

dataset/
├── documents/      # one text file per document
└── queries.jsonl   # one JSON object per line

Each queries.jsonl line contains:

{"query": "...", "relevant_ids": ["doc_id_1", "doc_id_2"]}

Included datasets:

  • data/demo -- small RFC-based dataset for offline demo runs
  • data/eval-mixed -- mixed corpus used in the main evaluation (larger)
  • data/eval-expanded -- expanded mixed corpus with multi-relevance and hard negatives
  • data/eval-external -- external RFC holdout split for transfer checks
  • data/eval-scifact-mini -- public BEIR SciFact slice for non-RFC transfer checks

To rebuild the SciFact slice:

python3 scripts/build_eval_scifact_mini.py

Offline Demo (Deterministic)

This run uses local embeddings and produces reproducible artifacts:

uv run contextrag eval \
  --dataset data/demo \
  --baseline uniform \
  --k 5 \
  --embed-provider local \
  --output runs/demo_eval.json \
  --run-dir runs/demo_eval

The first run downloads the local embedding model (sentence-transformers/all-MiniLM-L6-v2).

Config-Driven Runs

For larger evals with saved artifacts:

uv run contextrag eval --config experiments/eval_expanded_uniform_local.yaml --run-dir runs/eval_expanded

Embeddings support openrouter and local providers. OpenAI model families can be selected through OpenRouter model IDs (e.g., openai/text-embedding-3-small).

Matrix Runs (Recommended)

Run a matrix comparison with local embeddings:

uv run contextrag matrix \
  --dataset data/eval-expanded \
  --baselines uniform,router \
  --k-values 3,5,10 \
  --embed-provider local \
  --run-root runs/matrix_eval_expanded_local \
  --persist-root runs/chroma-matrix-eval-expanded-local

This writes:

  • one run directory per (baseline, k) pair
  • matrix_summary.json and matrix_summary.md
  • per-k uniform-vs-router comparisons under comparisons/

Or use the Makefile shortcut:

make reproduce

Optional dashboard rendering:

python3 scripts/render_matrix_report.py \
  --input runs/matrix_eval_expanded_local/matrix_summary.json \
  --output docs/matrix_eval_expanded_local.md

Artifacts

Each run directory contains:

runs/{run_name}/
├── summary.json     # aggregate metrics and timing
├── per_query.jsonl  # per-query metrics and hits
├── metadata.json    # dataset/config details
└── manifest.json    # config hash, dataset fingerprint, versions, system info

manifest.json includes:

  • manifest_schema_version
  • git_commit (when available)
  • dataset fingerprint (sha256, file count, byte count)
  • package/system versions

The top-level output JSON (e.g., runs/demo_eval.json) matches the summary.json content and includes per-query records inline.

Determinism Notes

  • Local embeddings are deterministic for a fixed model version.
  • Switching embedding providers requires rebuilding the index.
  • API-based providers may introduce nondeterminism depending on model and service configuration.