Memory is the missing layer in AI. While LLMs can reason and generate, they lack persistent memory across conversations. Multiple memory providers have emerged (Supermemory, Mem0, Zep), but there's no standardized way to evaluate them. Existing provider evaluations use custom flows tied to specific benchmarks, making apples-to-apples comparisons impossible. We built MemoryBench to change that.

## What it does

MemoryBench is a unified benchmarking framework for memory layer providers. It enables fair, reproducible evaluation across providers and benchmarks with a single CLI.

Core Features:

  • Multi-provider support: Supermemory, Mem0, Zep - plug in any memory provider
  • Multiple benchmarks: LoCoMo (temporal reasoning), LongMemEval (long-term memory), ConvoMem (conversational memory)
  • Model-agnostic judging: Use GPT-4o, Claude Opus 4.5, Gemini 2.5 as judges - auto-detected from model name
  • Full pipeline: Ingest → Search → Answer → Evaluate → Report in one command
  • Parallel comparison: Run multiple providers against same questions simultaneously
  • Sampling modes: Per-category sampling, limit-based, or full benchmark runs
  • Checkpoint/resume: Runs can be interrupted and resumed from any phase
  • Real-time Web UI: Monitor runs, view leaderboards, compare results with live WebSocket updates
  • Retrieval metrics: Hit@K, Precision, Recall, F1, MRR, NDCG

UI Pages:

  • Runs: Start new runs, view progress, inspect question-level results
  • Compare: Side-by-side provider comparison with accuracy breakdowns by question type
  • Leaderboard: Aggregated rankings across all completed runs

## How we built it

Architecture:

  • CLI-first design: All functionality accessible via bun run src/index.ts <command>
  • Provider adapter pattern: Uniform interface for ingest/search/clear across providers
  • Benchmark adapter pattern: Unified question/session format regardless of source
  • Orchestrator phases: Modular pipeline with checkpoint after each phase
  • Provider-specific prompts: Custom answer/judge prompts optimized for each provider's output format

Tech Stack:

  • Bun runtime (TypeScript)
  • Next.js 15 + TailwindCSS for UI
  • SQLite for persistence
  • WebSocket for real-time progress
  • AI SDK for multi-model support

Key Commands: bun run src/index.ts run -p supermemory -b locomo -j gpt-4o -r myrun bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j claude-sonnet-4 bun run src/index.ts serve

## Challenges we ran into

Zep Integration:

  • 10,000 character limit per episode - implemented chunking with sentence boundary detection
  • Maximum 20 episodes per batch request - built batched ingestion
  • Graph ontology required - without setting entity types (Person, Event, Location, etc.), temporal data wouldn't be returned
  • Required custom prompt to interpret valid_at timestamps correctly (event time vs. mention time)

Mem0 Integration:

  • Custom instructions at project-level needed for temporal information extraction
  • Async indexing with event polling - had to implement status checking loop
  • Specific metadata format required for date information to be captured

Benchmark Generalization:

  • Zep's official LoCoMo evaluation used custom flow incompatible with other providers
  • Mem0's evaluation had similar provider-specific assumptions
  • Built unified session format to normalize across benchmarks
  • Created provider-specific prompts while keeping pipeline generic

Temporal Reasoning:

  • Different providers interpret timestamps differently
  • Built specialized judge prompts for temporal questions with off-by-one tolerance
  • Added question-type-specific evaluation (abstention, knowledge update, preference)

## Accomplishments that we're proud of

  • True apples-to-apples comparison: Same questions, same judge, same answering model across providers
  • One-command comparison: compare runs all providers in parallel with shared question set
  • Resumable runs: Checkpoint system survives crashes and allows phase-by-phase debugging
  • Question-type breakdown: See exactly where each provider excels or fails
  • Retrieval metrics: Not just accuracy - full IR evaluation when ground truth passages available
  • Clean separation: Provider specifics isolated to adapter + prompts, everything else generic

## What we learned

  • Memory providers have vastly different APIs and assumptions - standardization is essential
  • Temporal reasoning is the hardest challenge for memory systems
  • Provider-specific prompts matter more than we expected - raw JSON parsing varies wildly
  • Batching and rate limits are provider-specific gotchas that need careful handling
  • A good checkpoint system saves hours of rerunning failed evaluations

## What's next for MemoryBench

  • More providers: LangMem, custom RAG baselines, vector-only solutions
  • More benchmarks: Multi-turn reasoning, entity tracking, preference learning
  • Automated leaderboard: Public results with verified reproducibility
  • Provider optimization hints: Suggest prompt/config improvements based on failure patterns
  • Hosted version: One-click evaluation without local setup

Built With

Share this project:

Updates