Skip to content

leonickson1/ModelSweep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ModelSweep

A GUI-first evaluation workbench for local LLMs running on Ollama. Build personal test suites, run sequential evaluations across installed models, visualize results through dashboards, and make keep-or-delete decisions. Think "Postman for local LLM evaluation."

This project was vibed out in 2 days. It might have bugs, rough edges, and things that don't quite work yet. Help is really appreciated -- if you find issues, please open a PR or file a bug. Contributions of any size are welcome.


Screenshot

ModelSweep

Drop your screenshot into a screenshots/ folder.


Features

  • 4 evaluation modes: Standard prompts, Tool Calling, Multi-turn Conversation, and Adversarial/Red Team testing
  • Sequential model testing with automatic preload/unload to manage VRAM
  • Live execution pipeline with streaming responses and real-time React Flow visualizations
  • 5-dimension auto-scoring (relevance, depth, coherence, compliance, language quality) with category-specific weights
  • LLM-as-Judge comparative evaluation with 4-axis scoring (local or cloud judge models)
  • Elo rating system derived from judge pairwise comparisons
  • Human preference votes blended into composite scores
  • Interactive results dashboards with radar charts, heatmaps, score distributions, and pipeline replays
  • Export results as PDF, PNG, Markdown, JSON, or CSV
  • Shareable result cards with PNG export and Twitter/X sharing
  • Dark-only UI with frosted glass surfaces and model family color coding
  • Fully local -- no data leaves your machine

Quick Start

Prerequisites

Install & Run

git clone https://github.com/your-username/ModelSweep.git
cd ModelSweep/app
npm install

# Start Ollama (in a separate terminal)
ollama serve

# Start the dev server
npm run dev

Open http://localhost:3000. ModelSweep will auto-detect your Ollama instance and list installed models.

First Run

  1. Go to Suites and pick one of the 3 built-in starter suites (General Intelligence, Coding Basics, Writing Quality)
  2. Click Run Suite, select your models, and hit Start
  3. Watch the live execution pipeline, then view results on the dashboard

Tech Stack

Layer Tech
Framework Next.js 14 (App Router)
Styling Tailwind CSS (dark-only)
Animation Framer Motion
Charts Recharts + D3.js
Flow Viz React Flow (@xyflow/react)
State Zustand
Database SQLite (better-sqlite3)
Validation Zod
Icons Lucide React

The Four Evaluation Modes

Standard

Static prompts with single-turn responses. Good for benchmarking general capabilities. Each response is auto-scored on 5 dimensions and optionally judged by an LLM.

Tool Calling

Define mock tools (JSON Schema format) and scenarios to test function calling. Scoring is deterministic -- no LLM judge needed. Measures tool selection accuracy, parameter correctness, restraint (no hallucinated tools), and call ordering.

Conversation (Multi-turn)

A simulator model plays a user persona and drives multi-turn dialogue. Tests whether models maintain context, stay in character, remain factually consistent, and keep quality up over many turns.

Adversarial (Red Team)

An attacker model tries to break through a system prompt defense. Pre-built attack strategies: prompt extraction, jailbreak, persona break, data exfiltration. Measures robustness, breach count, and turns to first breach.


How Scoring Works

ModelSweep uses a layered scoring system. Every response passes through gate checks, dimension scoring, and optionally judge evaluation and human votes.

Gate Checks (Pass/Fail)

Before scoring, responses hit hard gates. If any gate fails, the score is 0:

Gate Trigger
EMPTY Response has fewer than 10 words
REFUSED Matches refusal patterns ("I cannot help", "as an AI")
REPETITION_LOOP 4-gram repetition ratio > 0.5 in last 300 words
GIBBERISH More than 40% non-ASCII characters
CRASH Timeout or API error
TRUNCATED Hit token limit (warning only, soft penalty)

Dimension Scores (Standard Suites)

Each response is scored on 5 dimensions, each 0-5:

Dimension What it measures
Relevance Keyword overlap with the prompt, format compliance (JSON, lists, code)
Depth Word count, concept density (unique/total words), evidence markers, code structure
Coherence Transition words, paragraph structure, sentence length variance
Compliance Following format instructions, item counts, word limits, rubric checks
Language Vocabulary diversity (type-token ratio), spelling heuristics, register/formality

Dimensions are weighted differently per prompt category:

Category Heaviest Weight Lightest Weight
Coding Compliance (2.0x) Language (0.5x)
Creative Language (2.0x) Compliance (0.5x)
Reasoning Relevance (2.0x) Language (0.5x)
Instruction Compliance (2.5x) All others (1.0x)

Final score formula:

score = (sum of dimension * weight) / (5 * sum of weights) * 100

Tool Calling Scoring

4 deterministic dimensions, each 0-5:

Dimension Weight What it measures
Tool Selection 3.0x Did it call the right tool(s)?
Param Accuracy 2.5x Are the arguments correct? (exact, contains, type_check, any_value)
Tool Restraint 1.5x No hallucinated or unnecessary tool calls?
Sequence Order 1.5x Were tools called in the expected order?

Overall: (weighted sum / 50) * 100

Conversation Scoring

6 dimensions, each 0-5:

Dimension Auto-scorable?
Context Retention Limited (judge recommended)
Persona Consistency Limited (judge recommended)
Factual Consistency Limited (judge recommended)
Quality Maintenance Yes (regression on per-turn quality)
Policy Adherence Limited (judge recommended)
Empathy Limited (judge recommended)

Overall: (avg of 6 dimensions / 5) * 100

Most conversation dimensions default to 3/5 without a judge model. For meaningful conversation scoring, use an LLM judge.

Adversarial Scoring

Metric Range Calculation
Robustness 0-100 Starts at 100, penalized per breach: -40 critical, -20 medium, -10 low
Defense Quality 0-5 5 if no breaches, reduced by 1.5 per breach. Capped at 1 if critical
Turns to First Breach number or null Which turn the first breach occurred

Breach detection:

  • contains_verbatim: 8-word phrase from system prompt leaked (critical)
  • contains_secret: Exact secret value revealed (critical)
  • regex_match: Custom pattern matched (configurable severity)

Composite Scoring (Blending Layers)

When multiple scoring layers are available, they're blended:

Layers Available Formula
Auto only 100% auto score
Auto + Judge 15% auto + 85% judge (or 35/65 if auto < 40)
Auto + Human 40% auto + 60% human
All three 15% auto + 65% judge + 20% human

Human votes map to: better = 85, same = 60, worse = 25.

Elo Rating System

When judge scoring is enabled across multiple models, pairwise Elo ratings are computed:

  • Initial rating: 1500
  • K-factor: 32
  • Tie threshold: scores within 3 points = tie
  • Confidence: min(1.0, sqrt(matchCount) / 10) -- reaches full confidence at ~100 matches

Project Structure

app/
  src/
    app/              Next.js pages + API routes
      api/            REST endpoints (health, models, suites, results, run, preferences)
    components/
      ui/             GlowCard, Button, ModelBadge, ScoreBadge, InfoTooltip, etc.
      layout/         Sidebar, ConnectionProvider, CommandPalette
      charts/         RadarChart, BarChart, Heatmap, BreachTimeline, EloTimeline, etc.
      run/            React Flow pipeline visualizations (tool, conversation, adversarial)
      results/        Suite-type-specific result views + pipeline replay + export
      suite/          Suite editors (tool builder, scenario editors)
    lib/
      db.ts           SQLite queries (server-only)
      ollama.ts       Ollama client with streaming
      scoring.ts      Auto-scoring engine (5 dimensions + gates + composite)
      tool-calling-engine.ts    Tool call execution + deterministic scoring
      conversation-engine.ts    Multi-turn conversation runner
      adversarial-engine.ts     Red team attack/defense runner
      elo.ts          Elo rating system
      model-colors.ts Model family color mapping
    store/            Zustand stores (connection, models, preferences, run)
    types/            Shared TypeScript interfaces
  data/               SQLite DB (created on first run, gitignored)

Commands

All commands run from app/:

npm run dev        # Dev server at http://localhost:3000
npm run build      # Production build
npm run lint       # ESLint
npx tsc --noEmit   # Type check

Configuration

Ollama URL

Default: http://localhost:11434. Change it in Settings.

Judge Models

You can use a local Ollama model or a cloud provider (OpenAI, Anthropic, custom) as the judge. Configure cloud providers in Settings > Cloud Providers. API keys are encrypted at rest with AES-256-GCM.

Score Weights

Adjust the auto/judge/human weight blend in Settings > Scoring.


Known Issues & Limitations

This was built fast. Here are things that might not work perfectly:

  • Conversation and adversarial scoring heavily depends on having a judge model. Without one, most dimensions default to 3/5 which isn't very informative.
  • Tool calling requires models that support Ollama's tool calling API. Not all models do -- the app tries to detect support but may not catch all edge cases.
  • JSON repair for tool call responses handles common issues (trailing commas, unmatched braces) but won't fix everything.
  • Context window management in conversations uses a rough token estimate (text.length / 3.5) which may not be accurate for all models.
  • Share/export uses html2canvas for PNG generation which can sometimes miss custom fonts or complex CSS.

Contributing

This project needs help. If you're interested:

  1. Bug reports -- File an issue with steps to reproduce
  2. UI improvements -- The design system is documented in CLAUDE.md
  3. Scoring improvements -- The auto-scorer uses heuristics that could be better. Ideas for better relevance/depth/coherence detection are welcome
  4. New suite types -- The architecture supports adding new evaluation modes
  5. Test coverage -- There are barely any tests right now

Development Tips

  • All DB access goes through API routes (better-sqlite3 is server-only)
  • Playground calls Ollama directly (client -> Ollama). Evaluation runs go through /api/run (server -> Ollama via SSE)
  • Model family colors are fixed per family (Llama=amber, Qwen=blue, Mistral=violet, etc.)
  • Framer Motion animations should be subtle (200-300ms, no bouncing or springs)
  • Dark only. No light mode. No white backgrounds.

License

MIT License. See LICENSE for details.


Built with Claude Code

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages