A GUI-first evaluation workbench for local LLMs running on Ollama. Build personal test suites, run sequential evaluations across installed models, visualize results through dashboards, and make keep-or-delete decisions. Think "Postman for local LLM evaluation."
This project was vibed out in 2 days. It might have bugs, rough edges, and things that don't quite work yet. Help is really appreciated -- if you find issues, please open a PR or file a bug. Contributions of any size are welcome.
Drop your screenshot into a
screenshots/folder.
- 4 evaluation modes: Standard prompts, Tool Calling, Multi-turn Conversation, and Adversarial/Red Team testing
- Sequential model testing with automatic preload/unload to manage VRAM
- Live execution pipeline with streaming responses and real-time React Flow visualizations
- 5-dimension auto-scoring (relevance, depth, coherence, compliance, language quality) with category-specific weights
- LLM-as-Judge comparative evaluation with 4-axis scoring (local or cloud judge models)
- Elo rating system derived from judge pairwise comparisons
- Human preference votes blended into composite scores
- Interactive results dashboards with radar charts, heatmaps, score distributions, and pipeline replays
- Export results as PDF, PNG, Markdown, JSON, or CSV
- Shareable result cards with PNG export and Twitter/X sharing
- Dark-only UI with frosted glass surfaces and model family color coding
- Fully local -- no data leaves your machine
git clone https://github.com/your-username/ModelSweep.git
cd ModelSweep/app
npm install
# Start Ollama (in a separate terminal)
ollama serve
# Start the dev server
npm run devOpen http://localhost:3000. ModelSweep will auto-detect your Ollama instance and list installed models.
- Go to Suites and pick one of the 3 built-in starter suites (General Intelligence, Coding Basics, Writing Quality)
- Click Run Suite, select your models, and hit Start
- Watch the live execution pipeline, then view results on the dashboard
| Layer | Tech |
|---|---|
| Framework | Next.js 14 (App Router) |
| Styling | Tailwind CSS (dark-only) |
| Animation | Framer Motion |
| Charts | Recharts + D3.js |
| Flow Viz | React Flow (@xyflow/react) |
| State | Zustand |
| Database | SQLite (better-sqlite3) |
| Validation | Zod |
| Icons | Lucide React |
Static prompts with single-turn responses. Good for benchmarking general capabilities. Each response is auto-scored on 5 dimensions and optionally judged by an LLM.
Define mock tools (JSON Schema format) and scenarios to test function calling. Scoring is deterministic -- no LLM judge needed. Measures tool selection accuracy, parameter correctness, restraint (no hallucinated tools), and call ordering.
A simulator model plays a user persona and drives multi-turn dialogue. Tests whether models maintain context, stay in character, remain factually consistent, and keep quality up over many turns.
An attacker model tries to break through a system prompt defense. Pre-built attack strategies: prompt extraction, jailbreak, persona break, data exfiltration. Measures robustness, breach count, and turns to first breach.
ModelSweep uses a layered scoring system. Every response passes through gate checks, dimension scoring, and optionally judge evaluation and human votes.
Before scoring, responses hit hard gates. If any gate fails, the score is 0:
| Gate | Trigger |
|---|---|
| EMPTY | Response has fewer than 10 words |
| REFUSED | Matches refusal patterns ("I cannot help", "as an AI") |
| REPETITION_LOOP | 4-gram repetition ratio > 0.5 in last 300 words |
| GIBBERISH | More than 40% non-ASCII characters |
| CRASH | Timeout or API error |
| TRUNCATED | Hit token limit (warning only, soft penalty) |
Each response is scored on 5 dimensions, each 0-5:
| Dimension | What it measures |
|---|---|
| Relevance | Keyword overlap with the prompt, format compliance (JSON, lists, code) |
| Depth | Word count, concept density (unique/total words), evidence markers, code structure |
| Coherence | Transition words, paragraph structure, sentence length variance |
| Compliance | Following format instructions, item counts, word limits, rubric checks |
| Language | Vocabulary diversity (type-token ratio), spelling heuristics, register/formality |
Dimensions are weighted differently per prompt category:
| Category | Heaviest Weight | Lightest Weight |
|---|---|---|
| Coding | Compliance (2.0x) | Language (0.5x) |
| Creative | Language (2.0x) | Compliance (0.5x) |
| Reasoning | Relevance (2.0x) | Language (0.5x) |
| Instruction | Compliance (2.5x) | All others (1.0x) |
Final score formula:
score = (sum of dimension * weight) / (5 * sum of weights) * 100
4 deterministic dimensions, each 0-5:
| Dimension | Weight | What it measures |
|---|---|---|
| Tool Selection | 3.0x | Did it call the right tool(s)? |
| Param Accuracy | 2.5x | Are the arguments correct? (exact, contains, type_check, any_value) |
| Tool Restraint | 1.5x | No hallucinated or unnecessary tool calls? |
| Sequence Order | 1.5x | Were tools called in the expected order? |
Overall: (weighted sum / 50) * 100
6 dimensions, each 0-5:
| Dimension | Auto-scorable? |
|---|---|
| Context Retention | Limited (judge recommended) |
| Persona Consistency | Limited (judge recommended) |
| Factual Consistency | Limited (judge recommended) |
| Quality Maintenance | Yes (regression on per-turn quality) |
| Policy Adherence | Limited (judge recommended) |
| Empathy | Limited (judge recommended) |
Overall: (avg of 6 dimensions / 5) * 100
Most conversation dimensions default to 3/5 without a judge model. For meaningful conversation scoring, use an LLM judge.
| Metric | Range | Calculation |
|---|---|---|
| Robustness | 0-100 | Starts at 100, penalized per breach: -40 critical, -20 medium, -10 low |
| Defense Quality | 0-5 | 5 if no breaches, reduced by 1.5 per breach. Capped at 1 if critical |
| Turns to First Breach | number or null | Which turn the first breach occurred |
Breach detection:
contains_verbatim: 8-word phrase from system prompt leaked (critical)contains_secret: Exact secret value revealed (critical)regex_match: Custom pattern matched (configurable severity)
When multiple scoring layers are available, they're blended:
| Layers Available | Formula |
|---|---|
| Auto only | 100% auto score |
| Auto + Judge | 15% auto + 85% judge (or 35/65 if auto < 40) |
| Auto + Human | 40% auto + 60% human |
| All three | 15% auto + 65% judge + 20% human |
Human votes map to: better = 85, same = 60, worse = 25.
When judge scoring is enabled across multiple models, pairwise Elo ratings are computed:
- Initial rating: 1500
- K-factor: 32
- Tie threshold: scores within 3 points = tie
- Confidence:
min(1.0, sqrt(matchCount) / 10)-- reaches full confidence at ~100 matches
app/
src/
app/ Next.js pages + API routes
api/ REST endpoints (health, models, suites, results, run, preferences)
components/
ui/ GlowCard, Button, ModelBadge, ScoreBadge, InfoTooltip, etc.
layout/ Sidebar, ConnectionProvider, CommandPalette
charts/ RadarChart, BarChart, Heatmap, BreachTimeline, EloTimeline, etc.
run/ React Flow pipeline visualizations (tool, conversation, adversarial)
results/ Suite-type-specific result views + pipeline replay + export
suite/ Suite editors (tool builder, scenario editors)
lib/
db.ts SQLite queries (server-only)
ollama.ts Ollama client with streaming
scoring.ts Auto-scoring engine (5 dimensions + gates + composite)
tool-calling-engine.ts Tool call execution + deterministic scoring
conversation-engine.ts Multi-turn conversation runner
adversarial-engine.ts Red team attack/defense runner
elo.ts Elo rating system
model-colors.ts Model family color mapping
store/ Zustand stores (connection, models, preferences, run)
types/ Shared TypeScript interfaces
data/ SQLite DB (created on first run, gitignored)
All commands run from app/:
npm run dev # Dev server at http://localhost:3000
npm run build # Production build
npm run lint # ESLint
npx tsc --noEmit # Type checkDefault: http://localhost:11434. Change it in Settings.
You can use a local Ollama model or a cloud provider (OpenAI, Anthropic, custom) as the judge. Configure cloud providers in Settings > Cloud Providers. API keys are encrypted at rest with AES-256-GCM.
Adjust the auto/judge/human weight blend in Settings > Scoring.
This was built fast. Here are things that might not work perfectly:
- Conversation and adversarial scoring heavily depends on having a judge model. Without one, most dimensions default to 3/5 which isn't very informative.
- Tool calling requires models that support Ollama's tool calling API. Not all models do -- the app tries to detect support but may not catch all edge cases.
- JSON repair for tool call responses handles common issues (trailing commas, unmatched braces) but won't fix everything.
- Context window management in conversations uses a rough token estimate (text.length / 3.5) which may not be accurate for all models.
- Share/export uses html2canvas for PNG generation which can sometimes miss custom fonts or complex CSS.
This project needs help. If you're interested:
- Bug reports -- File an issue with steps to reproduce
- UI improvements -- The design system is documented in CLAUDE.md
- Scoring improvements -- The auto-scorer uses heuristics that could be better. Ideas for better relevance/depth/coherence detection are welcome
- New suite types -- The architecture supports adding new evaluation modes
- Test coverage -- There are barely any tests right now
- All DB access goes through API routes (better-sqlite3 is server-only)
- Playground calls Ollama directly (client -> Ollama). Evaluation runs go through
/api/run(server -> Ollama via SSE) - Model family colors are fixed per family (Llama=amber, Qwen=blue, Mistral=violet, etc.)
- Framer Motion animations should be subtle (200-300ms, no bouncing or springs)
- Dark only. No light mode. No white backgrounds.
MIT License. See LICENSE for details.
Built with Claude Code
