Comparing changes

Paper updates (docs/featherdb_paper.tex + .pdf): - New §4.7 'End-to-End Memory Benchmark: LongMemEval' with the 0.657 S-variant headline, configuration, per-axis breakdown, three supported claims, reproduction pointer. - Comparison table referencing Mem0/Zep/Supermemory/full-context numbers. - Bibliography: new entries for Xu et al. 2024 (LongMemEval), Rasmussen et al. 2025 (Zep), Mem0 token-efficient blog, Supermemory research page. - Recompiled with tectonic (97KB). docs/blog/longmemeval-results.md: - Public-facing draft post. Headline: 'Feather DB beats GPT-4o full-context on LongMemEval — using a free-tier model'. - Three claims, three caveats, reproduction command, per-axis table vs Zep + Supermemory, and a pointer to docs/benchmarks/longmemeval.md for the long-form report.

- bench/providers_azure.py: AzureChatProvider implementing the LLMProvider.complete() interface, env-driven via AZURE_OPENAI_CHAT_{ENDPOINT,API_KEY,DEPLOYMENT,API_VERSION}. Falls back to AZURE_OPENAI_{ENDPOINT,API_KEY} so a single Azure resource works for both embeddings and chat without renaming. - bench/judges_llm.py: 'azure' / 'azure-openai' as valid provider names in _provider_from_name(). Lazy import keeps the openai SDK optional for users on Gemini/Claude. - bench/__main__.py: extend --judge-provider / --answerer-provider choices. Smoke: GPT-4o on Azure replies '43' to '75 minus 32'. Ready to run LongMemEval_S with GPT-4o answerer to measure how much of the 0.657 -> 0.816 gap to Supermemory is model-class.

Same retrieval pipeline (Feather + Azure text-embedding-3-small + adaptive decay), GPT-4o answerer instead of gemini-2.5-flash. Wall ~272 min, 5/500 failures (same embedder context-length issue as before), ~$7-9 total. Per-axis vs Gemini-Flash run (same retrieval): flash gpt-4o Δ overall 0.657 0.693 +3.6pp information-extraction 0.896 0.942 +4.6pp knowledge-updates 0.714 0.714 +0.0pp (unchanged) multi-session 0.583 0.606 +2.3pp temporal-reasoning 0.417 0.477 +6.0pp By question_type: single-session-user 0.941 1.000 +5.9pp (PERFECT) single-session-assistant 0.964 0.964 TIE single-session-preference 0.667 0.767 +10.0pp knowledge-update 0.714 0.714 UNCHANGED multi-session 0.583 0.606 +2.3pp temporal-reasoning 0.417 0.477 +6.0pp vs Supermemory + GPT-4o (same model class): overall 0.693 0.816 -12.3pp Supermemory leads single-session-user 1.000 0.971 +2.9pp WE WIN single-session-assistant 0.964 0.964 TIE single-session-preference 0.767 0.700 +6.7pp WE WIN knowledge-update 0.714 0.885 -17.1pp multi-session 0.606 0.714 -10.8pp temporal-reasoning 0.477 0.767 -29.0pp Diagnostic: Supermemory's lead is concentrated in the three reasoning axes (KU + multi-session + temporal). Knowledge-update is unchanged across model classes for us, indicating it's a *structural* gap (lack of LLM fact extraction at ingest), not an answerer-capability gap. Closing the gap requires Phase 9 (LLM extractors) and decay-aware retrieval (surface old + new in parallel for temporal). Updates: - bench/results/longmemeval__s__*.json: GPT-4o run added. - docs/benchmarks/longmemeval.md: TL;DR, results table, comparison table, and 'what we don't beat' section all reflect both runs. - docs/featherdb_paper.tex: §4.7 results paragraph + table updated with GPT-4o numbers. PDF recompiled. - README.md: Benchmarks table now lists both runs, GPT-4o first.

Three assets ready for Claude Cowork to execute: docs/marketing/gtm-plan.md - Positioning, ICP, three-claim core message, channel strategy, 90-day KPIs, founder talking points, asset checklist. - Conversion goal: Cloud waitlist email capture. - Explicitly: no Supermemory head-to-head in launch creative. docs/blog/longmemeval-publish.md - Public-ready article (800-1200 words). Headline: 'You don't need GPT-4o full-context for AI memory — Feather DB beats it for $2.40'. - Lists Feather + GPT-4o (0.693), Feather + Gemini-Flash (0.657), full-context ceilings, naive RAG. Does NOT list Mem0/Zep/Supermemory. - Includes reproduce command, per-axis tables, Phase 9 + Cloud teaser. - Note: docs/blog/longmemeval-results.md (the original draft, with Supermemory) is left in place as the internal-only / detailed version. docs/marketing/twitter-thread.md - 7-tweet thread, image spec for the headline chart, posting timing, reply templates for predictable Qs. docs/marketing/hn-submission.md - Title (70-char), submission URL placeholder, first-comment context template, 7 reply templates for predictable HN questions. GTM hand-off: this pack is what the marketing function works from.

HF Dataset created: https://huggingface.co/datasets/Hawky-ai/feather-db-benchmarks - 22 result JSONs (LongMemEval oracle/S, SIFT1M, synthetic) - Dataset card with schema documentation, headline numbers, reproduce command. Loadable via datasets.load_dataset(...). README header badges: - LongMemEval_S: 0.693 (GPT-4o) and 0.657 (Gemini-Flash) - SIFT1M p50 = 0.19ms - Recall@10 = 0.972 - HF benchmarks dataset link - Updated HF Space link from Sri-Vigneshwar-DJ to Hawky-ai (org-owned) docs/arxiv-submission/: - featherdb_paper.tex + featherdb_paper.pdf (verified compile) - SUBMISSION_GUIDE.md: manual upload runbook for arxiv.org since the arXiv replace-article flow is web-form-only.

The Feather website's Claude-code agent can now watch docs/CONTENT_INDEX.json as the single manifest for all publishable content. The manifest lists every blog post, marketing asset, technical report, and paper with: - canonical raw_url (raw.githubusercontent.com) - status (ready / internal) - channels (where to publish) - tags, summary, cover_image_spec - do_not_mention rules (e.g. no Supermemory head-to-heads) - headline_metrics (single source of truth for numbers) Also added rules_for_consumers (governance), watcher_recipe (how to detect changes), and always_include_links (canonical URLs). YAML front-matter added to docs/blog/longmemeval-publish.md so the blog post is also self-describing for any consumer that fetches it directly. Other files will get front-matter as they become publish-ready. The website agent flow: 1. Watch CONTENT_INDEX.json for SHA changes (15-min cadence). 2. For each item with status=ready, fetch raw_url. 3. Parse front-matter + body, render per channel. 4. Re-publish on any detected SHA change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Apr 26, 2026

This comparison is taking too long to generate.

Uh oh!