This document defines the evaluation strategy for ArchReviewAgent.
It has two goals:
- give release confidence before code ships
- explain how to measure quality and drift in production after release
The plan is grounded in the current production architecture described in architecture.md:
- vendor intake and subject resolution
- evidence retrieval
- structured decisioning
- report presentation
- Postgres-backed cache and background refresh
It is also designed to evolve into the future-state architecture in docs/extensible-guardrail-architecture.md.
The evaluation program should follow these principles:
- Evaluate the system by stage, not just by final answer.
- Prefer production-like cases over synthetic prompt trivia.
- Use deterministic checks where possible.
- Use model graders for nuanced judgment calls.
- Keep a small human calibration loop so grader quality does not drift.
- Compare candidate branches against a stable baseline instead of looking only at absolute scores.
- Store enough trace and evidence data to explain why a run changed.
For OpenAI-specific guidance that informed this plan, see:
Before a release, the system should answer:
- Did subject resolution regress?
- Did evidence quality regress?
- Did guardrail verdicts regress?
- Did recommendation quality regress?
- Did security or operational controls regress?
- Did latency or timeout behavior regress materially?
After release, the system should answer:
- Are live results drifting?
- Are
unknownoutcomes increasing? - Are cache refreshes improving results or degrading them?
- Are product-level requests collapsing to parent-company summaries?
- Are failures concentrated in intake, retrieval, decisioning, or presentation?
The initial evaluation scope is the current product:
- fixed guardrails:
- EU data residency
- enterprise deployment
- current contract in shared/contracts.ts
- current staged backend in:
The plan should be extensible so the same framework can support additional guardrails later.
Use three evaluation layers together.
These are hard assertions that should not be subjective.
Examples:
- input validation rejects malformed or prompt-injection-like inputs
- CORS rejects untrusted origins
- production security headers are present
- public health output stays minimal
- report JSON matches the shared contract
- cache promotion rejects weaker candidates
- official-domain filtering is enforced
These are good release gates because they are stable and cheap.
Use model graders where the answer is semantic rather than exact string matching.
Examples:
- Is the product overview accurate?
- Is the EU residency verdict supported by the cited evidence?
- Is the enterprise deployment verdict supported by the cited evidence?
- Is the overall recommendation more optimistic than the evidence justifies?
- Is the answer anchored to the requested product rather than the parent company?
Model graders should score structured criteria, not vague "is this good?" prompts.
Keep a small recurring human review loop for:
- newly failing release cases
- all recommendation changes
- all
unknownoutputs - low-confidence or low-grader-score production samples
Human review should be used to calibrate the graders, not replace them.
The current system should be evaluated by stage.
Files:
Questions:
- Was the requested subject understood correctly?
- Was product specificity preserved?
- Was the canonical vendor reasonable?
- Were official domains correct?
- Did spelling variants converge to the same cached identity?
Primary metrics:
- subject exact match rate
- canonical vendor match rate
- official-domain precision
- typo and alias convergence rate
- ambiguity handling accuracy
Files:
Questions:
- Did retrieval find usable evidence?
- Was evidence primarily from trusted domains?
- Was evidence sufficient for both guardrails?
- Did retrieval preserve product specificity?
- Did retrieval time out or fail to produce a memo?
Primary metrics:
- memo generation success rate
- memo length distribution
- evidence item count per guardrail
- first-party evidence ratio
- retrieval timeout rate
- retrieval failure rate
Files:
Questions:
- Are guardrail statuses correct?
- Is confidence reasonable?
- Is the recommendation justified by the evidence?
- Is the answer more optimistic than the evidence allows?
- Does refresh promotion prevent weaker evidence from replacing stronger evidence?
Primary metrics:
- guardrail status accuracy
- recommendation accuracy
- confidence calibration quality
- recommendation optimism violation rate
- cache promotion acceptance rate
- weak-candidate rejection rate
Files:
Questions:
- Does the final report conform to contract?
- Is the
What this product doessection accurate and non-generic? - Does markdown render safely and correctly?
- Are citations, risks, questions, and next steps coherent?
Primary metrics:
- schema validity rate
- markdown rendering regression rate
- product overview accuracy
- report completeness score
Files:
Questions:
- Are public endpoints limited correctly?
- Are browser hardening headers present?
- Is CORS restricted correctly?
- Are latency and error budgets acceptable?
Primary metrics:
- 5xx rate
- timeout rate
- public endpoint exposure regressions
- security smoke-test pass rate
Create a versioned eval dataset in the repo.
Recommended structure:
evals/
cases/
release-core.jsonl
release-edge.jsonl
security.jsonl
production-shadow.jsonl
graders/
release/
production/
reports/
<timestamp>/
The initial case schema and local validator should live at:
Each case should be one JSON object in JSONL format.
Suggested schema:
{
"id": "fabric-product-specificity",
"category": "product-vs-parent",
"input": "Microsoft Fabric",
"expected_subject": "Microsoft Fabric",
"expected_vendor": "Microsoft",
"expected_official_domains": [
"fabric.microsoft.com",
"learn.microsoft.com",
"microsoft.com"
],
"expected_guardrails": {
"euDataResidency": {
"status": "supported",
"allow_equivalents": ["partial"]
},
"enterpriseDeployment": {
"status": "supported"
}
},
"expected_recommendation": "green",
"allowed_unknowns": [],
"notes": "Should stay anchored to product, not collapse to generic Microsoft overview."
}Seed the first dataset with at least these categories:
- normal well-documented vendors
- product under large parent company
- typo and alias resolution
- ambiguous names
- prompt-injection attempts
- URL-like inputs
- malformed short inputs
- insufficient-evidence vendors
- non-English documentation
- EU residency supported but plan-scoped
- transfer-law-only cases that should not count as residency support
- enterprise deployment with strong admin features
- enterprise deployment with only marketing language
- stale-doc source rotation
- vendor domain changes or multiple official domains
- known weak cache baseline followed by stronger evidence
- cached report should be reused on repeat lookup
- background refresh should not regress an accepted baseline
- markdown-heavy output rendering
- security surface checks for public endpoints and CORS
These should run first and fail fast.
- response parses against shared/contracts.ts
researchedAtis ISO-like and non-empty- each guardrail has:
statusconfidencesummaryrisksevidence
- recommendation is one of
green | yellow | red
- malformed inputs rejected
- prompt-like or tool-instruction inputs rejected
- inputs shorter than minimum length rejected
- evidence URLs present
- evidence URLs use allowed schemes
- evidence stays within trusted domain rules where policy requires it
- weaker candidates do not replace stronger accepted baselines
- spelling variants converge to the same accepted subject baseline
- refresh can improve a baseline without duplicating cache buckets
- untrusted CORS origins do not get browser approval
- production root has required security headers
- public
/api/healthstays minimal /api/internal/healthis not public without internal authorization/api/chat/testis not publicly reachable in production
Model graders should evaluate structured outputs, not free-form vibes.
Questions:
- Is the report about the requested product, not just the parent company?
- Is the
What this product doessection factual and specific? - Is the official-domain set reasonable for this subject?
Questions:
- Does the EU residency verdict follow from the cited evidence?
- Does the enterprise deployment verdict follow from the cited evidence?
- Are the cited findings relevant to the claimed guardrail?
- Is the report using generic company text where product-specific evidence exists?
Questions:
- Is the final recommendation too optimistic?
- Is the recommendation too pessimistic?
- Are open questions and next steps appropriate given the evidence quality?
Questions:
- Is the report coherent for an analyst?
- Are risks, unanswered questions, and next steps non-redundant?
- Is the overview concise and accurate?
Use a structured grader response such as:
{
"pass": true,
"score": 0.92,
"reason": "The verdict is well supported by primary vendor evidence.",
"flags": []
}Recommended flags:
product_driftunsupported_verdictoptimistic_recommendationgeneric_overviewthin_evidenceirrelevant_citations
Review a small sample every week.
Minimum review queue:
- all newly failing release cases
- all cases where recommendation changed from the last baseline
- all cases with any
unknown - all production samples with grader score below threshold
- all product-vs-parent failures
Reviewer checklist:
- Was the subject resolved correctly?
- Were trusted domains appropriate?
- Did the guardrail status match the evidence?
- Was the recommendation justified?
- Did the overview stay specific to the product?
Use the review to:
- correct dataset expectations
- improve grader prompts
- identify new case categories
Run release evals against:
mainbaseline- candidate branch or release commit
Use the same dataset for both runs and compare results.
Prefer a production-parallel setup:
- same Postgres-backed cache path enabled
- same OpenAI model snapshot
- same env defaults except secrets and origin settings
For release evals, pin a dated model snapshot instead of a moving alias when practical.
Reason:
- code changes and model changes should be separable
- release comparisons are easier to trust when the model is fixed
Production may use the alias for freshness, but snapshot-to-snapshot comparisons should happen before changing production model settings.
Recommended release thresholds for the current product:
- deterministic checks:
100% - prompt-injection rejection:
100% - public security smoke tests:
100% - subject resolution exact or approved-equivalent:
>= 95% - guardrail status exact or approved-equivalent:
>= 90% - recommendation exact or approved-equivalent:
>= 90% - product overview accuracy:
>= 90% - no material increase in timeout rate
- no material increase in
unknownrate
These thresholds are intentionally pragmatic. They should tighten as the dataset matures.
Require review of:
- every new failing case
- every changed recommendation
- every case that moved to
unknown - every regression in product specificity
Release evals are not enough. Production needs ongoing scoring.
The current logging in server/research/logging.ts and server/researchAgent.ts already captures useful stage boundaries.
Track these fields per run:
- run id
- requested input
- canonical subject
- canonical vendor
- official domains
- cache hit / miss / background refresh
- memo length
- guardrail statuses
- confidence values
- evidence counts
- recommendation
- phase timings
- error class and phase
- bundle id and promotion result
Build a dashboard for:
- total requests
- cache hit rate
- background refresh rate
- refresh promotion rate
- 5xx rate
- 502 rate
- timeout rate
unknownrate by guardrail- product-drift rate
- average and p95 latency by stage
Sample a subset of live runs every day.
For each sampled run:
- keep the stored report
- keep evidence and metadata
- grade the result offline
Recommended production shadow grading questions:
- Was the overview accurate?
- Were the guardrail statuses supported by evidence?
- Was the recommendation justified?
- Did the output stay product-specific?
Every week, review:
- low-scoring shadow-graded runs
- all
unknownoutputs from sampled production runs - all refreshed runs that changed recommendation
- all high-latency failures
The cache is part of product quality now, so it needs its own checks.
Verify:
- repeat requests hit accepted baselines
- aliases and spelling variants converge
- weak candidates do not replace stronger accepted reports
- stronger candidates can replace older baselines
- background refresh respects cooldown
For cache-aware cases, compare:
- baseline bundle evidence counts
- candidate evidence counts
- status changes
- recommendation changes
At minimum, reject automatic promotion when:
- a guardrail falls to
unknown - a guardrail loses all evidence
- evidence count regresses materially without compensating strength
Keep a permanent smoke suite for:
- CORS allowlist behavior
- security headers
- minimal public health
- internal endpoint protection
- test endpoint exposure rules
- markdown rendering safety
- report schema validity
These checks should run in CI and against production after deploy.
Recommended incremental additions:
docs/
evaluation-plan.md
evals/
cases/
graders/
reports/
scripts/
run-release-evals.ts
run-production-shadow-grading.ts
Suggested outputs:
- machine-readable JSON summary
- markdown comparison report for PRs and releases
- production weekly digest for sampled runs
Implement the evaluation program in four phases.
- create the first 50 to 100 curated cases
- add deterministic schema and security checks
- add branch-vs-main comparison output
- add model graders for resolution, guardrails, recommendation, and overview quality
- add reviewer workflow for low-score cases
- store enough production trace data for offline grading
- sample production runs daily
- add dashboards for live metrics and drift
- generalize dataset and graders so additional guardrails can plug in without redesign
- align eval shapes with the future extensible guardrail architecture
The evaluation system is good when:
- a release can be blocked by real regressions before shipping
- a production drift can be seen before users complain
- the team can explain why a recommendation changed
- the team can distinguish retrieval failures from decision failures
- the team can extend the product without rebuilding the eval system
The next concrete steps for this repo should be:
- add
evals/cases/release-core.jsonlwith the first 20 to 30 curated cases - add deterministic checks for:
- schema
- input validation
- cache promotion
- public security surface
- add a release runner that compares candidate branch results against
main - add a production shadow-grading job for sampled live runs
- review thresholds after the first two release cycles