Skip to content

sanaships/clarifi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClariFi

Unbiased financial product recommendations — credit cards, EWA, BNPL, savings, loans. Live web search, no affiliate links, interactive.

This repo is a tutorial on building AI products properly: response design → evals → improve → monitor.

git clone https://github.com/youruser/clarifi.git && npm install && npm run dev

The process

1. Design the response, not the UI

The system prompt is the product. Before writing code, decide what "good" looks like — write 5-6 ideal responses by hand. This is where Taste comes from.

What we decided: under 200 words, specific dollar amounts, must surface tradeoffs ("who should skip this"), always search live data, cover all product types not just credit cards.

2. Set up evals before you build

~30 realistic scenarios. Quality > quantity — one sloppy eval sends you chasing the wrong problem.

Write an LLM-as-Judge rubric (ours weights: specificity 25%, data freshness 20%, tradeoff transparency 20%, brevity 15%, coverage 10%, tone 10%). Then judge the judge — go through the grades manually, tweak until they match your Taste. Sometimes removing a criterion is the fix.

3. Improve with the cheapest lever first

Impact is asymptotic. Our order: system prompt (0→80%), web search as live RAG (fixed stale data), inline widgets (better UX for context gathering). We skipped multi-agent and fine-tuning — 10x complexity for marginal gains at this stage.

4. Monitor because the world changes

CI runs evals on every prompt change + weekly (catches silent model drift). Weekly human QA: 10 sampled interactions, rated by the person who wrote the rubric — not outsourced to cheap raters.

Latest from Anthropic's skill-creator 2.0: parallel isolated evals (no context bleed), blind A/B comparison (judge doesn't know which prompt version is which), trigger optimization (60/40 train/test split on skill descriptions).


Architecture

User → Claude API + web_search tool → Response + inline widgets

No database. No product JSON. The AI searches current data on every query. Freshness without maintenance.


Automated evals

pip install pyyaml anthropic
export ANTHROPIC_API_KEY=sk-...
python evals/run_evals.py                        # all 25 scenarios
python evals/run_evals.py --category credit_cards
python evals/run_evals.py --no-llm-judge         # code graders only, faster

Results land in evals/results/<timestamp>.json. Exit code 1 if pass rate drops below 80%.

Two CI triggers (.github/workflows/evals.yml):

Trigger What happens
PR touching prompts/** Runs full eval suite; posts pass/fail delta vs main as a PR comment
Weekly cron (Mon 9am UTC) Runs full eval suite; opens a GitHub issue if failures exceed 25% — catches silent model drift

Add ANTHROPIC_API_KEY as a repository secret to enable CI.


MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages