MemoryBench

Memory is the missing layer in AI. While LLMs can reason and generate, they lack persistent memory across conversations. Multiple memory providers have emerged (Supermemory, Mem0, Zep), but there's no standardized way to evaluate them. Existing provider evaluations use custom flows tied to specific benchmarks, making apples-to-apples comparisons impossible. We built MemoryBench to change that.

## What it does

MemoryBench is a unified benchmarking framework for memory layer providers. It enables fair, reproducible evaluation across providers and benchmarks with a single CLI.

Core Features:

Multi-provider support: Supermemory, Mem0, Zep - plug in any memory provider
Multiple benchmarks: LoCoMo (temporal reasoning), LongMemEval (long-term memory), ConvoMem (conversational memory)
Model-agnostic judging: Use GPT-4o, Claude Opus 4.5, Gemini 2.5 as judges - auto-detected from model name
Full pipeline: Ingest → Search → Answer → Evaluate → Report in one command
Parallel comparison: Run multiple providers against same questions simultaneously
Sampling modes: Per-category sampling, limit-based, or full benchmark runs
Checkpoint/resume: Runs can be interrupted and resumed from any phase
Real-time Web UI: Monitor runs, view leaderboards, compare results with live WebSocket updates
Retrieval metrics: Hit@K, Precision, Recall, F1, MRR, NDCG

UI Pages:

Runs: Start new runs, view progress, inspect question-level results
Compare: Side-by-side provider comparison with accuracy breakdowns by question type
Leaderboard: Aggregated rankings across all completed runs

## How we built it

Architecture:

CLI-first design: All functionality accessible via bun run src/index.ts <command>
Provider adapter pattern: Uniform interface for ingest/search/clear across providers
Benchmark adapter pattern: Unified question/session format regardless of source
Orchestrator phases: Modular pipeline with checkpoint after each phase
Provider-specific prompts: Custom answer/judge prompts optimized for each provider's output format

Tech Stack:

Bun runtime (TypeScript)
Next.js 15 + TailwindCSS for UI
SQLite for persistence
WebSocket for real-time progress
AI SDK for multi-model support

Key Commands: bun run src/index.ts run -p supermemory -b locomo -j gpt-4o -r myrun bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j claude-sonnet-4 bun run src/index.ts serve

## Challenges we ran into

Zep Integration:

10,000 character limit per episode - implemented chunking with sentence boundary detection
Maximum 20 episodes per batch request - built batched ingestion
Graph ontology required - without setting entity types (Person, Event, Location, etc.), temporal data wouldn't be returned
Required custom prompt to interpret valid_at timestamps correctly (event time vs. mention time)

Mem0 Integration:

Custom instructions at project-level needed for temporal information extraction
Async indexing with event polling - had to implement status checking loop
Specific metadata format required for date information to be captured

Benchmark Generalization:

Zep's official LoCoMo evaluation used custom flow incompatible with other providers
Mem0's evaluation had similar provider-specific assumptions
Built unified session format to normalize across benchmarks
Created provider-specific prompts while keeping pipeline generic

Temporal Reasoning:

Different providers interpret timestamps differently
Built specialized judge prompts for temporal questions with off-by-one tolerance
Added question-type-specific evaluation (abstention, knowledge update, preference)

## Accomplishments that we're proud of

True apples-to-apples comparison: Same questions, same judge, same answering model across providers
One-command comparison: compare runs all providers in parallel with shared question set
Resumable runs: Checkpoint system survives crashes and allows phase-by-phase debugging
Question-type breakdown: See exactly where each provider excels or fails
Retrieval metrics: Not just accuracy - full IR evaluation when ground truth passages available
Clean separation: Provider specifics isolated to adapter + prompts, everything else generic

## What we learned

Memory providers have vastly different APIs and assumptions - standardization is essential
Temporal reasoning is the hardest challenge for memory systems
Provider-specific prompts matter more than we expected - raw JSON parsing varies wildly
Batching and rate limits are provider-specific gotchas that need careful handling
A good checkpoint system saves hours of rerunning failed evaluations

## What's next for MemoryBench

More providers: LangMem, custom RAG baselines, vector-only solutions
More benchmarks: Multi-turn reasoning, entity tracking, preference learning
Automated leaderboard: Public results with verified reproducibility
Provider optimization hints: Suggest prompt/config improvements based on failure patterns
Hosted version: One-click evaluation without local setup

Built With

convomem
locomo
longmemeval
mem0
next.js
react
supermemory
typescript
zep

Updates

Prasanna A P started this project — Dec 23, 2025 12:57 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.