Existing benchmarks (MTEB, CoIR, openbench) focus on embeddings and llms, but nobody was measuring what actually matters for production memory systems and chunkers: how well do memory providers and code chunkers perform on real retrieval tasks?

What it does

Superbench benchmarks two things:

  • Memory providers (Supermemory, Mem0, Zep) on conversational memory tasks (LongMemEval, LoCoMo)
  • Code chunkers (code-chunk, LlamaIndex, LangChain, Chonkie) on code retrieval tasks (RepoEval, RepoBench-R)

Run superbench eval --benchmarks repoeval --providers code-chunk-ast and get nDCG, MRR, Recall metrics.

How I built it

  • Bun + TypeScript for fast CLI execution
  • YAML-driven configs for extensibility (add providers/benchmarks without code changes)
  • Registry pattern for pluggable metrics, chunkers, and loaders
  • SQLite for result storage and comparison across runs

Challenges

  • External memory APIs re-chunk content differently, breaking metadata-based relevance matching
  • Balancing paper-faithful evaluation (LongMemEval, LoCoMo specs) with practical provider differences
  • Rate limits on HuggingFace and provider APIs during dataset downloads

What I learned

Memory providers vary wildly on temporal reasoning tasks. The "best" chunker depends heavily on the task. AST-aware chunkers excel at code retrieval but character-based chunkers are surprisingly competitive.

Built With

  • ai-(vercel-ai-sdk)
  • chonkie-benchmarks:-repoeval
  • crosscodeeval
  • github-api-libraries:-zod
  • huggingface-datasets-api
  • hyparquet
  • langchain
  • languages:-typescript
  • llamaindex
  • longmemeval
  • mem0-api
  • openrouter-api
  • postgresql-apis/services:-supermemory-api
  • python-runtime:-bun-databases:-sqlite
  • repobench-r
  • swe-bench-lite
  • tree-sitter-code-chunkers:-code-chunk
  • yaml
  • zep-api
Share this project:

Updates