Existing benchmarks (MTEB, CoIR, openbench) focus on embeddings and llms, but nobody was measuring what actually matters for production memory systems and chunkers: how well do memory providers and code chunkers perform on real retrieval tasks?
What it does
Superbench benchmarks two things:
- Memory providers (Supermemory, Mem0, Zep) on conversational memory tasks (LongMemEval, LoCoMo)
- Code chunkers (code-chunk, LlamaIndex, LangChain, Chonkie) on code retrieval tasks (RepoEval, RepoBench-R)
Run superbench eval --benchmarks repoeval --providers code-chunk-ast and get nDCG, MRR, Recall metrics.
How I built it
- Bun + TypeScript for fast CLI execution
- YAML-driven configs for extensibility (add providers/benchmarks without code changes)
- Registry pattern for pluggable metrics, chunkers, and loaders
- SQLite for result storage and comparison across runs
Challenges
- External memory APIs re-chunk content differently, breaking metadata-based relevance matching
- Balancing paper-faithful evaluation (LongMemEval, LoCoMo specs) with practical provider differences
- Rate limits on HuggingFace and provider APIs during dataset downloads
What I learned
Memory providers vary wildly on temporal reasoning tasks. The "best" chunker depends heavily on the task. AST-aware chunkers excel at code retrieval but character-based chunkers are surprisingly competitive.
Built With
- ai-(vercel-ai-sdk)
- chonkie-benchmarks:-repoeval
- crosscodeeval
- github-api-libraries:-zod
- huggingface-datasets-api
- hyparquet
- langchain
- languages:-typescript
- llamaindex
- longmemeval
- mem0-api
- openrouter-api
- postgresql-apis/services:-supermemory-api
- python-runtime:-bun-databases:-sqlite
- repobench-r
- swe-bench-lite
- tree-sitter-code-chunkers:-code-chunk
- yaml
- zep-api
Log in or sign up for Devpost to join the conversation.