evaluation-plan.md

Evaluation Plan

This document defines the evaluation strategy for ArchReviewAgent.

It has two goals:

give release confidence before code ships
explain how to measure quality and drift in production after release

The plan is grounded in the current production architecture described in architecture.md:

vendor intake and subject resolution
evidence retrieval
structured decisioning
report presentation
Postgres-backed cache and background refresh

It is also designed to evolve into the future-state architecture in docs/extensible-guardrail-architecture.md.

1. Principles

The evaluation program should follow these principles:

Evaluate the system by stage, not just by final answer.
Prefer production-like cases over synthetic prompt trivia.
Use deterministic checks where possible.
Use model graders for nuanced judgment calls.
Keep a small human calibration loop so grader quality does not drift.
Compare candidate branches against a stable baseline instead of looking only at absolute scores.
Store enough trace and evidence data to explain why a run changed.

For OpenAI-specific guidance that informed this plan, see:

2. Evaluation Objectives

2.1 Release Confidence

Before a release, the system should answer:

Did subject resolution regress?
Did evidence quality regress?
Did guardrail verdicts regress?
Did recommendation quality regress?
Did security or operational controls regress?
Did latency or timeout behavior regress materially?

2.2 Production Confidence

After release, the system should answer:

Are live results drifting?
Are unknown outcomes increasing?
Are cache refreshes improving results or degrading them?
Are product-level requests collapsing to parent-company summaries?
Are failures concentrated in intake, retrieval, decisioning, or presentation?

3. Scope

The initial evaluation scope is the current product:

fixed guardrails:
- EU data residency
- enterprise deployment
current contract in shared/contracts.ts
current staged backend in:

The plan should be extensible so the same framework can support additional guardrails later.

4. Evaluation Layers

Use three evaluation layers together.

4.1 Deterministic Checks

These are hard assertions that should not be subjective.

Examples:

input validation rejects malformed or prompt-injection-like inputs
CORS rejects untrusted origins
production security headers are present
public health output stays minimal
report JSON matches the shared contract
cache promotion rejects weaker candidates
official-domain filtering is enforced

These are good release gates because they are stable and cheap.

4.2 Model Graders

Use model graders where the answer is semantic rather than exact string matching.

Examples:

Is the product overview accurate?
Is the EU residency verdict supported by the cited evidence?
Is the enterprise deployment verdict supported by the cited evidence?
Is the overall recommendation more optimistic than the evidence justifies?
Is the answer anchored to the requested product rather than the parent company?

Model graders should score structured criteria, not vague "is this good?" prompts.

4.3 Human Calibration

Keep a small recurring human review loop for:

newly failing release cases
all recommendation changes
all unknown outputs
low-confidence or low-grader-score production samples

Human review should be used to calibrate the graders, not replace them.

5. Stage-Based Evaluation Map

The current system should be evaluated by stage.

5.1 Intake and Resolution

Files:

Questions:

Was the requested subject understood correctly?
Was product specificity preserved?
Was the canonical vendor reasonable?
Were official domains correct?
Did spelling variants converge to the same cached identity?

Primary metrics:

subject exact match rate
canonical vendor match rate
official-domain precision
typo and alias convergence rate
ambiguity handling accuracy

5.2 Retrieval

Files:

Questions:

Did retrieval find usable evidence?
Was evidence primarily from trusted domains?
Was evidence sufficient for both guardrails?
Did retrieval preserve product specificity?
Did retrieval time out or fail to produce a memo?

Primary metrics:

memo generation success rate
memo length distribution
evidence item count per guardrail
first-party evidence ratio
retrieval timeout rate
retrieval failure rate

5.3 Decisioning

Files:

Questions:

Are guardrail statuses correct?
Is confidence reasonable?
Is the recommendation justified by the evidence?
Is the answer more optimistic than the evidence allows?
Does refresh promotion prevent weaker evidence from replacing stronger evidence?

Primary metrics:

guardrail status accuracy
recommendation accuracy
confidence calibration quality
recommendation optimism violation rate
cache promotion acceptance rate
weak-candidate rejection rate

5.4 Presentation

Files:

Questions:

Does the final report conform to contract?
Is the What this product does section accurate and non-generic?
Does markdown render safely and correctly?
Are citations, risks, questions, and next steps coherent?

Primary metrics:

schema validity rate
markdown rendering regression rate
product overview accuracy
report completeness score

5.5 Operational and Security Surface

Files:

Questions:

Are public endpoints limited correctly?
Are browser hardening headers present?
Is CORS restricted correctly?
Are latency and error budgets acceptable?

Primary metrics:

5xx rate
timeout rate
public endpoint exposure regressions
security smoke-test pass rate

6. Case Dataset Design

Create a versioned eval dataset in the repo.

Recommended structure:

evals/
  cases/
    release-core.jsonl
    release-edge.jsonl
    security.jsonl
    production-shadow.jsonl
  graders/
    release/
    production/
  reports/
    <timestamp>/

The initial case schema and local validator should live at:

Each case should be one JSON object in JSONL format.

Suggested schema:

{
  "id": "fabric-product-specificity",
  "category": "product-vs-parent",
  "input": "Microsoft Fabric",
  "expected_subject": "Microsoft Fabric",
  "expected_vendor": "Microsoft",
  "expected_official_domains": [
    "fabric.microsoft.com",
    "learn.microsoft.com",
    "microsoft.com"
  ],
  "expected_guardrails": {
    "euDataResidency": {
      "status": "supported",
      "allow_equivalents": ["partial"]
    },
    "enterpriseDeployment": {
      "status": "supported"
    }
  },
  "expected_recommendation": "green",
  "allowed_unknowns": [],
  "notes": "Should stay anchored to product, not collapse to generic Microsoft overview."
}

6.1 First Case Categories

Seed the first dataset with at least these categories:

normal well-documented vendors
product under large parent company
typo and alias resolution
ambiguous names
prompt-injection attempts
URL-like inputs
malformed short inputs
insufficient-evidence vendors
non-English documentation
EU residency supported but plan-scoped
transfer-law-only cases that should not count as residency support
enterprise deployment with strong admin features
enterprise deployment with only marketing language
stale-doc source rotation
vendor domain changes or multiple official domains
known weak cache baseline followed by stronger evidence
cached report should be reused on repeat lookup
background refresh should not regress an accepted baseline
markdown-heavy output rendering
security surface checks for public endpoints and CORS

7. Deterministic Graders

These should run first and fail fast.

7.1 Contract and Schema

response parses against shared/contracts.ts
researchedAt is ISO-like and non-empty
each guardrail has:
- status
- confidence
- summary
- risks
- evidence
recommendation is one of green | yellow | red

7.2 Input Safety

malformed inputs rejected
prompt-like or tool-instruction inputs rejected
inputs shorter than minimum length rejected

7.3 Source Safety

evidence URLs present
evidence URLs use allowed schemes
evidence stays within trusted domain rules where policy requires it

7.4 Cache and Promotion

weaker candidates do not replace stronger accepted baselines
spelling variants converge to the same accepted subject baseline
refresh can improve a baseline without duplicating cache buckets

7.5 Public Security Surface

untrusted CORS origins do not get browser approval
production root has required security headers
public /api/health stays minimal
/api/internal/health is not public without internal authorization
/api/chat/test is not publicly reachable in production

8. Model Graders

Model graders should evaluate structured outputs, not free-form vibes.

8.1 Product and Resolution Graders

Questions:

Is the report about the requested product, not just the parent company?
Is the What this product does section factual and specific?
Is the official-domain set reasonable for this subject?

8.2 Evidence Support Graders

Questions:

Does the EU residency verdict follow from the cited evidence?
Does the enterprise deployment verdict follow from the cited evidence?
Are the cited findings relevant to the claimed guardrail?
Is the report using generic company text where product-specific evidence exists?

8.3 Recommendation Graders

Questions:

Is the final recommendation too optimistic?
Is the recommendation too pessimistic?
Are open questions and next steps appropriate given the evidence quality?

8.4 Presentation Graders

Questions:

Is the report coherent for an analyst?
Are risks, unanswered questions, and next steps non-redundant?
Is the overview concise and accurate?

8.5 Grader Output Shape

Use a structured grader response such as:

{
  "pass": true,
  "score": 0.92,
  "reason": "The verdict is well supported by primary vendor evidence.",
  "flags": []
}

Recommended flags:

product_drift
unsupported_verdict
optimistic_recommendation
generic_overview
thin_evidence
irrelevant_citations

9. Human Calibration Loop

Review a small sample every week.

Minimum review queue:

all newly failing release cases
all cases where recommendation changed from the last baseline
all cases with any unknown
all production samples with grader score below threshold
all product-vs-parent failures

Reviewer checklist:

Was the subject resolved correctly?
Were trusted domains appropriate?
Did the guardrail status match the evidence?
Was the recommendation justified?
Did the overview stay specific to the product?

Use the review to:

correct dataset expectations
improve grader prompts
identify new case categories

10. Release Evaluation Workflow

Run release evals against:

main baseline
candidate branch or release commit

Use the same dataset for both runs and compare results.

10.1 Release Environments

Prefer a production-parallel setup:

same Postgres-backed cache path enabled
same OpenAI model snapshot
same env defaults except secrets and origin settings

10.2 Model Versioning

For release evals, pin a dated model snapshot instead of a moving alias when practical.

Reason:

code changes and model changes should be separable
release comparisons are easier to trust when the model is fixed

Production may use the alias for freshness, but snapshot-to-snapshot comparisons should happen before changing production model settings.

10.3 Release Gate Recommendation

Recommended release thresholds for the current product:

deterministic checks: 100%
prompt-injection rejection: 100%
public security smoke tests: 100%
subject resolution exact or approved-equivalent: >= 95%
guardrail status exact or approved-equivalent: >= 90%
recommendation exact or approved-equivalent: >= 90%
product overview accuracy: >= 90%
no material increase in timeout rate
no material increase in unknown rate

These thresholds are intentionally pragmatic. They should tighten as the dataset matures.

10.4 Required Manual Review Before Ship

Require review of:

every new failing case
every changed recommendation
every case that moved to unknown
every regression in product specificity

11. Production Evaluation Program

Release evals are not enough. Production needs ongoing scoring.

11.1 Structured Trace Fields

The current logging in server/research/logging.ts and server/researchAgent.ts already captures useful stage boundaries.

Track these fields per run:

run id
requested input
canonical subject
canonical vendor
official domains
cache hit / miss / background refresh
memo length
guardrail statuses
confidence values
evidence counts
recommendation
phase timings
error class and phase
bundle id and promotion result

11.2 Live Metrics

Build a dashboard for:

total requests
cache hit rate
background refresh rate
refresh promotion rate
5xx rate
502 rate
timeout rate
unknown rate by guardrail
product-drift rate
average and p95 latency by stage

11.3 Sampled Shadow Grading

Sample a subset of live runs every day.

For each sampled run:

keep the stored report
keep evidence and metadata
grade the result offline

Recommended production shadow grading questions:

Was the overview accurate?
Were the guardrail statuses supported by evidence?
Was the recommendation justified?
Did the output stay product-specific?

11.4 Weekly Human Review

Every week, review:

low-scoring shadow-graded runs
all unknown outputs from sampled production runs
all refreshed runs that changed recommendation
all high-latency failures

12. Cache-Specific Evaluation

The cache is part of product quality now, so it needs its own checks.

12.1 Cache Correctness

Verify:

repeat requests hit accepted baselines
aliases and spelling variants converge
weak candidates do not replace stronger accepted reports
stronger candidates can replace older baselines
background refresh respects cooldown

12.2 Anti-Regression Checks

For cache-aware cases, compare:

baseline bundle evidence counts
candidate evidence counts
status changes
recommendation changes

At minimum, reject automatic promotion when:

a guardrail falls to unknown
a guardrail loses all evidence
evidence count regresses materially without compensating strength

13. Security and Reliability Evals

Keep a permanent smoke suite for:

CORS allowlist behavior
security headers
minimal public health
internal endpoint protection
test endpoint exposure rules
markdown rendering safety
report schema validity

These checks should run in CI and against production after deploy.

14. Suggested Repo Structure

Recommended incremental additions:

docs/
  evaluation-plan.md
evals/
  cases/
  graders/
  reports/
scripts/
  run-release-evals.ts
  run-production-shadow-grading.ts

Suggested outputs:

machine-readable JSON summary
markdown comparison report for PRs and releases
production weekly digest for sampled runs

15. Rollout Plan

Implement the evaluation program in four phases.

Phase 1: Release Core

create the first 50 to 100 curated cases
add deterministic schema and security checks
add branch-vs-main comparison output

Phase 2: Semantic Graders

add model graders for resolution, guardrails, recommendation, and overview quality
add reviewer workflow for low-score cases

Phase 3: Production Shadow Evals

store enough production trace data for offline grading
sample production runs daily
add dashboards for live metrics and drift

Phase 4: Guardrail Expansion Readiness

generalize dataset and graders so additional guardrails can plug in without redesign
align eval shapes with the future extensible guardrail architecture

16. Definition of Good

The evaluation system is good when:

a release can be blocked by real regressions before shipping
a production drift can be seen before users complain
the team can explain why a recommendation changed
the team can distinguish retrieval failures from decision failures
the team can extend the product without rebuilding the eval system

17. Immediate Next Steps

The next concrete steps for this repo should be:

add evals/cases/release-core.jsonl with the first 20 to 30 curated cases
add deterministic checks for:
- schema
- input validation
- cache promotion
- public security surface
add a release runner that compares candidate branch results against main
add a production shadow-grading job for sampled live runs
review thresholds after the first two release cycles

FilesExpand file tree

evaluation-plan.md

Latest commit

History

evaluation-plan.md

File metadata and controls

Evaluation Plan

1. Principles

2. Evaluation Objectives

2.1 Release Confidence

2.2 Production Confidence

3. Scope

4. Evaluation Layers

4.1 Deterministic Checks

4.2 Model Graders

4.3 Human Calibration

5. Stage-Based Evaluation Map

5.1 Intake and Resolution

5.2 Retrieval

5.3 Decisioning

5.4 Presentation

5.5 Operational and Security Surface

6. Case Dataset Design

6.1 First Case Categories

7. Deterministic Graders

7.1 Contract and Schema

7.2 Input Safety

7.3 Source Safety

7.4 Cache and Promotion

7.5 Public Security Surface

8. Model Graders

8.1 Product and Resolution Graders

8.2 Evidence Support Graders

8.3 Recommendation Graders

8.4 Presentation Graders

8.5 Grader Output Shape

9. Human Calibration Loop

10. Release Evaluation Workflow

10.1 Release Environments

10.2 Model Versioning

10.3 Release Gate Recommendation

10.4 Required Manual Review Before Ship

11. Production Evaluation Program

11.1 Structured Trace Fields

11.2 Live Metrics

11.3 Sampled Shadow Grading

11.4 Weekly Human Review

12. Cache-Specific Evaluation

12.1 Cache Correctness

12.2 Anti-Regression Checks

13. Security and Reliability Evals

14. Suggested Repo Structure

15. Rollout Plan

Phase 1: Release Core

Phase 2: Semantic Graders

Phase 3: Production Shadow Evals

Phase 4: Guardrail Expansion Readiness

16. Definition of Good

17. Immediate Next Steps