Superbench - Memory + code chunker benchmark

Existing benchmarks (MTEB, CoIR, openbench) focus on embeddings and llms, but nobody was measuring what actually matters for production memory systems and chunkers: how well do memory providers and code chunkers perform on real retrieval tasks?

What it does

Superbench benchmarks two things:

Memory providers (Supermemory, Mem0, Zep) on conversational memory tasks (LongMemEval, LoCoMo)
Code chunkers (code-chunk, LlamaIndex, LangChain, Chonkie) on code retrieval tasks (RepoEval, RepoBench-R)

Run superbench eval --benchmarks repoeval --providers code-chunk-ast and get nDCG, MRR, Recall metrics.

How I built it

Bun + TypeScript for fast CLI execution
YAML-driven configs for extensibility (add providers/benchmarks without code changes)
Registry pattern for pluggable metrics, chunkers, and loaders
SQLite for result storage and comparison across runs

Challenges

External memory APIs re-chunk content differently, breaking metadata-based relevance matching
Balancing paper-faithful evaluation (LongMemEval, LoCoMo specs) with practical provider differences
Rate limits on HuggingFace and provider APIs during dataset downloads

What I learned

Memory providers vary wildly on temporal reasoning tasks. The "best" chunker depends heavily on the task. AST-aware chunkers excel at code retrieval but character-based chunkers are surprisingly competitive.

Built With

ai-(vercel-ai-sdk)
chonkie-benchmarks:-repoeval
crosscodeeval
github-api-libraries:-zod
huggingface-datasets-api
hyparquet
langchain
languages:-typescript
llamaindex
longmemeval
mem0-api
openrouter-api
postgresql-apis/services:-supermemory-api
python-runtime:-bun-databases:-sqlite
repobench-r
swe-bench-lite
tree-sitter-code-chunkers:-code-chunk
yaml
zep-api

Updates

Ashik Shafi started this project — Dec 23, 2025 12:47 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.