A unified benchmarking platform for evaluating memory providers, RAG systems, and context management solutions. Inspired by MTEB and SWE-bench, Superbench enables fair, reproducible comparisons across different memory architectures.
Superbench is designed to answer the question: "How well does this system use context to provide correct answers?"
Unlike traditional retrieval benchmarks that measure recall/precision, Superbench focuses on end-to-end correctness and memory-enabled success - metrics that matter for production memory systems.
- π― Memory-focused metrics: Accuracy, Success@K, F1 (not just recall/precision)
- π Pluggable providers: Easy to add new memory/RAG providers
- π Multiple benchmarks: RAG template, LongMemEval, LoCoMo support
- π€ LLM-as-a-Judge: Automated evaluation using language models
- π Performance tracking: Latency, token usage, cost metrics
- π Resumable runs: Checkpointing for long evaluations
- π Clean CLI: Table-formatted results, export to JSON/CSV
# Install dependencies
bun install
# Link the CLI globally
bun link- Copy the environment template:
cp .env.example .env- Add your API keys to
.env:OPENROUTER_API_KEY- Required for LLM evaluation (OpenRouter)ANTHROPIC_API_KEY- Optional (for direct Anthropic access)VOYAGE_API_KEY- Required for AQRAG embeddingsDATABASE_URL- PostgreSQL connection string (for PostgreSQL-based providers)GOOGLE_GENERATIVE_AI_API_KEY- For Google embeddings (ContextualRetrieval)
Run the RAG template benchmark against AQRAG:
superbench eval --benchmarks rag-template --providers aqrag --metrics accuracy f1 success_at_5 --limit 10This will:
- Load 10 questions from the RAG template benchmark
- Ingest document contexts into the AQRAG provider
- Search for relevant information for each question
- Generate answers using the LLM
- Evaluate correctness using an LLM judge
- Compute metrics (accuracy, F1, Success@K)
- Display results in a clean table
- Save results to SQLite database
# List recent runs
superbench results
# View specific run
superbench results <runId>
# Export to JSON
superbench export <runId> --format json -o results.json| Provider | Type | Description |
|---|---|---|
| code-chunk-ast | Local | AST-based semantic chunking for code |
| code-chunk-fixed | Local | Fixed-size chunking for code |
| chonkie-code | Local | Python-based code chunking (requires Python 3.10+) |
| chonkie-recursive | Local | Recursive text chunking via Python |
| langchain-code | Local | LangChain-based code splitter |
| llamaindex-code | Local | LlamaIndex-based code splitter |
| full-context-session | Local | Full context per session (no chunking) |
| full-context-turn | Local | Full context per conversation turn |
These memory providers are planned but not yet fully implemented:
| Provider | Type | Status |
|---|---|---|
| AQRAG | Local | Needs PostgreSQL + adapter implementation |
| ContextualRetrieval | Local | Needs adapter implementation |
| Mem0 | Hosted | API integration planned |
| Supermemory | Hosted | API integration planned |
| Zep | Hosted | API integration planned |
| Benchmark | Description | Metrics |
|---|---|---|
| rag-template | General-purpose RAG evaluation (10 questions) | accuracy, f1, success_at_5 |
| longmemeval | Multi-session long-term memory evaluation | accuracy_by_question_type, recall_at_5 |
| locomo | Long-form conversation memory | accuracy_by_category, bleu_1, rouge_l |
- Accuracy: Binary correctness (did the LLM judge mark it as correct?)
- Success@K: End-to-end success = correct answer AND relevant context in top-K
- F1: Token-level overlap between generated and expected answers (0-1)
Memory benchmarks answer: "Did access to context change behavior correctly?" - not "Was the gold passage retrieved?"
- β Accuracy = End-to-end correctness (what users care about)
- β Success@K = Verifies retrieval-to-answer pipeline worked
- β F1 = Captures partial recall and degradation
- β Recall@K/MRR = Not primary metrics (assume gold passages, lexical matching)
See METRICS_AND_ORCHESTRATION.md for detailed metric definitions.
superbench list# Single provider, single benchmark
superbench eval --benchmarks rag-template --providers aqrag --limit 10
# Multiple providers (comparison)
superbench eval --benchmarks rag-template --providers aqrag contextual-retrieval openrouter-rag
# Custom metrics
superbench eval --benchmarks rag-template --providers aqrag --metrics accuracy f1 success_at_5 bleu_1
# With filtering
superbench eval --benchmarks rag-template --providers aqrag --limit 5 --start 0# List recent runs
superbench results
# View specific run with metrics
superbench results <runId> --metrics accuracy f1
# Export to JSON
superbench export <runId> --format json -o results.json
# Export to CSV
superbench export <runId> --format csv -o results.csvsuperbench/
βββ core/ # Core evaluation engine
β βββ metrics/ # Pluggable metric registry
β βββ runner.ts # Benchmark runner with checkpointing
β βββ results.ts # SQLite results storage
βββ providers/ # Provider implementations
β βββ adapters/ # Provider adapters
β βββ configs/ # Provider YAML configs
β βββ */ # Provider-specific code
βββ benchmarks/ # Benchmark definitions
β βββ configs/ # Benchmark YAML configs
β βββ loaders/ # Data loaders
β βββ evaluators/ # Evaluation methods (LLM judge, exact-match)
βββ cli/ # Command-line interface
See ARCHITECTURE.md for detailed architecture documentation.
- Create provider directory:
providers/YourProvider/ - Implement provider interface (see
providers/base/types.ts) - Create adapter:
providers/adapters/your-provider.ts - Add config:
providers/configs/your-provider.yaml - Register in factory:
providers/factory.ts
See PROVIDER_CUSTOMIZATION_GUIDE.md for details.
- Create benchmark data file (JSON/CSV)
- Create config:
benchmarks/configs/your-benchmark.yaml - Define schema mapping (itemId, question, answer, context)
- Specify evaluation method (llm-judge, exact-match, etc.)
- Set default metrics
# Run tests (if available)
bun test
# Type check
bun run typecheck
# Lint
bun run lint[Your License Here]
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
- LongMemEval: Multi-session long-term memory evaluation
- LoCoMo: Long-form Conversation Memory benchmark
- MTEB: Massive Text Embedding Benchmark (inspiration)
- SWE-bench: Software Engineering Benchmark (inspiration)