Superbench

A unified benchmarking platform for evaluating memory providers, RAG systems, and context management solutions. Inspired by MTEB and SWE-bench, Superbench enables fair, reproducible comparisons across different memory architectures.

Overview

Superbench is designed to answer the question: "How well does this system use context to provide correct answers?"

Unlike traditional retrieval benchmarks that measure recall/precision, Superbench focuses on end-to-end correctness and memory-enabled success - metrics that matter for production memory systems.

Key Features

🎯 Memory-focused metrics: Accuracy, Success@K, F1 (not just recall/precision)
🔌 Pluggable providers: Easy to add new memory/RAG providers
📊 Multiple benchmarks: RAG template, LongMemEval, LoCoMo support
🤖 LLM-as-a-Judge: Automated evaluation using language models
📈 Performance tracking: Latency, token usage, cost metrics
🔄 Resumable runs: Checkpointing for long evaluations
📋 Clean CLI: Table-formatted results, export to JSON/CSV

Quick Start

Installation

# Install dependencies
bun install

# Link the CLI globally
bun link

Setup

Copy the environment template:

cp .env.example .env

Add your API keys to .env:
- OPENROUTER_API_KEY - Required for LLM evaluation (OpenRouter)
- ANTHROPIC_API_KEY - Optional (for direct Anthropic access)
- VOYAGE_API_KEY - Required for AQRAG embeddings
- DATABASE_URL - PostgreSQL connection string (for PostgreSQL-based providers)
- GOOGLE_GENERATIVE_AI_API_KEY - For Google embeddings (ContextualRetrieval)

Running Your First Evaluation

Run the RAG template benchmark against AQRAG:

superbench eval --benchmarks rag-template --providers aqrag --metrics accuracy f1 success_at_5 --limit 10

This will:

Load 10 questions from the RAG template benchmark
Ingest document contexts into the AQRAG provider
Search for relevant information for each question
Generate answers using the LLM
Evaluate correctness using an LLM judge
Compute metrics (accuracy, F1, Success@K)
Display results in a clean table
Save results to SQLite database

View Results

# List recent runs
superbench results

# View specific run
superbench results <runId>

# Export to JSON
superbench export <runId> --format json -o results.json

Available Providers

Code Chunking Providers (Production)

Provider	Type	Description
code-chunk-ast	Local	AST-based semantic chunking for code
code-chunk-fixed	Local	Fixed-size chunking for code
chonkie-code	Local	Python-based code chunking (requires Python 3.10+)
chonkie-recursive	Local	Recursive text chunking via Python
langchain-code	Local	LangChain-based code splitter
llamaindex-code	Local	LlamaIndex-based code splitter
full-context-session	Local	Full context per session (no chunking)
full-context-turn	Local	Full context per conversation turn

Experimental Providers (Coming Soon)

These memory providers are planned but not yet fully implemented:

Provider	Type	Status
AQRAG	Local	Needs PostgreSQL + adapter implementation
ContextualRetrieval	Local	Needs adapter implementation
Mem0	Hosted	API integration planned
Supermemory	Hosted	API integration planned
Zep	Hosted	API integration planned

Available Benchmarks

Benchmark	Description	Metrics
rag-template	General-purpose RAG evaluation (10 questions)	accuracy, f1, success_at_5
longmemeval	Multi-session long-term memory evaluation	accuracy_by_question_type, recall_at_5
locomo	Long-form conversation memory	accuracy_by_category, bleu_1, rouge_l

Metrics Explained

Primary Metrics

Accuracy: Binary correctness (did the LLM judge mark it as correct?)
Success@K: End-to-end success = correct answer AND relevant context in top-K
F1: Token-level overlap between generated and expected answers (0-1)

Why These Metrics?

Memory benchmarks answer: "Did access to context change behavior correctly?" - not "Was the gold passage retrieved?"

✅ Accuracy = End-to-end correctness (what users care about)
✅ Success@K = Verifies retrieval-to-answer pipeline worked
✅ F1 = Captures partial recall and degradation
❌ Recall@K/MRR = Not primary metrics (assume gold passages, lexical matching)

See METRICS_AND_ORCHESTRATION.md for detailed metric definitions.

CLI Commands

List Providers and Benchmarks

superbench list

Run Evaluation

# Single provider, single benchmark
superbench eval --benchmarks rag-template --providers aqrag --limit 10

# Multiple providers (comparison)
superbench eval --benchmarks rag-template --providers aqrag contextual-retrieval openrouter-rag

# Custom metrics
superbench eval --benchmarks rag-template --providers aqrag --metrics accuracy f1 success_at_5 bleu_1

# With filtering
superbench eval --benchmarks rag-template --providers aqrag --limit 5 --start 0

View and Export Results

# List recent runs
superbench results

# View specific run with metrics
superbench results <runId> --metrics accuracy f1

# Export to JSON
superbench export <runId> --format json -o results.json

# Export to CSV
superbench export <runId> --format csv -o results.csv

Architecture

superbench/
├── core/               # Core evaluation engine
│   ├── metrics/       # Pluggable metric registry
│   ├── runner.ts      # Benchmark runner with checkpointing
│   └── results.ts     # SQLite results storage
├── providers/         # Provider implementations
│   ├── adapters/      # Provider adapters
│   ├── configs/       # Provider YAML configs
│   └── */             # Provider-specific code
├── benchmarks/        # Benchmark definitions
│   ├── configs/       # Benchmark YAML configs
│   ├── loaders/       # Data loaders
│   └── evaluators/    # Evaluation methods (LLM judge, exact-match)
└── cli/               # Command-line interface

See ARCHITECTURE.md for detailed architecture documentation.

Adding a New Provider

Create provider directory: providers/YourProvider/
Implement provider interface (see providers/base/types.ts)
Create adapter: providers/adapters/your-provider.ts
Add config: providers/configs/your-provider.yaml
Register in factory: providers/factory.ts

See PROVIDER_CUSTOMIZATION_GUIDE.md for details.

Adding a New Benchmark

Create benchmark data file (JSON/CSV)
Create config: benchmarks/configs/your-benchmark.yaml
Define schema mapping (itemId, question, answer, context)
Specify evaluation method (llm-judge, exact-match, etc.)
Set default metrics

Development

# Run tests (if available)
bun test

# Type check
bun run typecheck

# Lint
bun run lint

License

[Your License Here]

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

References

LongMemEval: Multi-session long-term memory evaluation
LoCoMo: Long-form Conversation Memory benchmark
MTEB: Massive Text Embedding Benchmark (inspiration)
SWE-bench: Software Engineering Benchmark (inspiration)

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
benchmarks		benchmarks
cli		cli
core		core
docs		docs
providers		providers
scripts/ablation		scripts/ablation
.env.example		.env.example
.gitignore		.gitignore
.memorybench.sqlite		.memorybench.sqlite
CLAUDE.md		CLAUDE.md
README.md		README.md
audit-summary.txt		audit-summary.txt
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
package.json		package.json
test_loader.ts		test_loader.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Superbench

Overview

Key Features

Quick Start

Installation

Setup

Running Your First Evaluation

View Results

Available Providers

Code Chunking Providers (Production)

Experimental Providers (Coming Soon)

Available Benchmarks

Metrics Explained

Primary Metrics

Why These Metrics?

CLI Commands

List Providers and Benchmarks

Run Evaluation

View and Export Results

Architecture

Adding a New Provider

Adding a New Benchmark

Development

License

Contributing

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Superbench

Overview

Key Features

Quick Start

Installation

Setup

Running Your First Evaluation

View Results

Available Providers

Code Chunking Providers (Production)

Experimental Providers (Coming Soon)

Available Benchmarks

Metrics Explained

Primary Metrics

Why These Metrics?

CLI Commands

List Providers and Benchmarks

Run Evaluation

View and Export Results

Architecture

Adding a New Provider

Adding a New Benchmark

Development

License

Contributing

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages