Unbiased financial product recommendations — credit cards, EWA, BNPL, savings, loans. Live web search, no affiliate links, interactive.
This repo is a tutorial on building AI products properly: response design → evals → improve → monitor.
git clone https://github.com/youruser/clarifi.git && npm install && npm run devThe system prompt is the product. Before writing code, decide what "good" looks like — write 5-6 ideal responses by hand. This is where Taste comes from.
What we decided: under 200 words, specific dollar amounts, must surface tradeoffs ("who should skip this"), always search live data, cover all product types not just credit cards.
~30 realistic scenarios. Quality > quantity — one sloppy eval sends you chasing the wrong problem.
Write an LLM-as-Judge rubric (ours weights: specificity 25%, data freshness 20%, tradeoff transparency 20%, brevity 15%, coverage 10%, tone 10%). Then judge the judge — go through the grades manually, tweak until they match your Taste. Sometimes removing a criterion is the fix.
Impact is asymptotic. Our order: system prompt (0→80%), web search as live RAG (fixed stale data), inline widgets (better UX for context gathering). We skipped multi-agent and fine-tuning — 10x complexity for marginal gains at this stage.
CI runs evals on every prompt change + weekly (catches silent model drift). Weekly human QA: 10 sampled interactions, rated by the person who wrote the rubric — not outsourced to cheap raters.
Latest from Anthropic's skill-creator 2.0: parallel isolated evals (no context bleed), blind A/B comparison (judge doesn't know which prompt version is which), trigger optimization (60/40 train/test split on skill descriptions).
User → Claude API + web_search tool → Response + inline widgets
No database. No product JSON. The AI searches current data on every query. Freshness without maintenance.
pip install pyyaml anthropic
export ANTHROPIC_API_KEY=sk-...
python evals/run_evals.py # all 25 scenarios
python evals/run_evals.py --category credit_cards
python evals/run_evals.py --no-llm-judge # code graders only, fasterResults land in evals/results/<timestamp>.json. Exit code 1 if pass rate drops below 80%.
Two CI triggers (.github/workflows/evals.yml):
| Trigger | What happens |
|---|---|
PR touching prompts/** |
Runs full eval suite; posts pass/fail delta vs main as a PR comment |
| Weekly cron (Mon 9am UTC) | Runs full eval suite; opens a GitHub issue if failures exceed 25% — catches silent model drift |
Add ANTHROPIC_API_KEY as a repository secret to enable CI.
MIT License.