Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: feather-store/feather
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master@{1day}
Choose a base ref
...
head repository: feather-store/feather
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
  • 6 commits
  • 17 files changed
  • 1 contributor

Commits on Apr 26, 2026

  1. arXiv paper: add LongMemEval section + 4 new refs; draft blog post

    Paper updates (docs/featherdb_paper.tex + .pdf):
    - New §4.7 'End-to-End Memory Benchmark: LongMemEval' with the 0.657
      S-variant headline, configuration, per-axis breakdown, three
      supported claims, reproduction pointer.
    - Comparison table referencing Mem0/Zep/Supermemory/full-context numbers.
    - Bibliography: new entries for Xu et al. 2024 (LongMemEval),
      Rasmussen et al. 2025 (Zep), Mem0 token-efficient blog,
      Supermemory research page.
    - Recompiled with tectonic (97KB).
    
    docs/blog/longmemeval-results.md:
    - Public-facing draft post. Headline: 'Feather DB beats GPT-4o
      full-context on LongMemEval — using a free-tier model'.
    - Three claims, three caveats, reproduction command, per-axis table
      vs Zep + Supermemory, and a pointer to docs/benchmarks/longmemeval.md
      for the long-form report.
    ashwath007 committed Apr 26, 2026
    Configuration menu
    Copy the full SHA
    a5f0958 View commit details
    Browse the repository at this point in the history
  2. Add Azure OpenAI chat provider for benchmark answerer/judge

    - bench/providers_azure.py: AzureChatProvider implementing the
      LLMProvider.complete() interface, env-driven via
      AZURE_OPENAI_CHAT_{ENDPOINT,API_KEY,DEPLOYMENT,API_VERSION}.
      Falls back to AZURE_OPENAI_{ENDPOINT,API_KEY} so a single Azure
      resource works for both embeddings and chat without renaming.
    - bench/judges_llm.py: 'azure' / 'azure-openai' as valid provider
      names in _provider_from_name(). Lazy import keeps the openai SDK
      optional for users on Gemini/Claude.
    - bench/__main__.py: extend --judge-provider / --answerer-provider
      choices.
    
    Smoke: GPT-4o on Azure replies '43' to '75 minus 32'. Ready to run
    LongMemEval_S with GPT-4o answerer to measure how much of the
    0.657 -> 0.816 gap to Supermemory is model-class.
    ashwath007 committed Apr 26, 2026
    Configuration menu
    Copy the full SHA
    83af5a0 View commit details
    Browse the repository at this point in the history
  3. LongMemEval_S with GPT-4o answerer: 0.693 (+3.6pp over gemini-flash)

    Same retrieval pipeline (Feather + Azure text-embedding-3-small +
    adaptive decay), GPT-4o answerer instead of gemini-2.5-flash. Wall
    ~272 min, 5/500 failures (same embedder context-length issue as
    before), ~$7-9 total.
    
    Per-axis vs Gemini-Flash run (same retrieval):
                              flash    gpt-4o   Δ
      overall                 0.657    0.693    +3.6pp
      information-extraction  0.896    0.942    +4.6pp
      knowledge-updates       0.714    0.714    +0.0pp  (unchanged)
      multi-session           0.583    0.606    +2.3pp
      temporal-reasoning      0.417    0.477    +6.0pp
    
    By question_type:
      single-session-user       0.941    1.000    +5.9pp  (PERFECT)
      single-session-assistant  0.964    0.964    TIE
      single-session-preference 0.667    0.767    +10.0pp
      knowledge-update          0.714    0.714    UNCHANGED
      multi-session             0.583    0.606    +2.3pp
      temporal-reasoning        0.417    0.477    +6.0pp
    
    vs Supermemory + GPT-4o (same model class):
      overall                   0.693    0.816    -12.3pp  Supermemory leads
      single-session-user       1.000    0.971    +2.9pp   WE WIN
      single-session-assistant  0.964    0.964    TIE
      single-session-preference 0.767    0.700    +6.7pp   WE WIN
      knowledge-update          0.714    0.885    -17.1pp
      multi-session             0.606    0.714    -10.8pp
      temporal-reasoning        0.477    0.767    -29.0pp
    
    Diagnostic: Supermemory's lead is concentrated in the three reasoning
    axes (KU + multi-session + temporal). Knowledge-update is unchanged
    across model classes for us, indicating it's a *structural* gap (lack
    of LLM fact extraction at ingest), not an answerer-capability gap.
    Closing the gap requires Phase 9 (LLM extractors) and decay-aware
    retrieval (surface old + new in parallel for temporal).
    
    Updates:
    - bench/results/longmemeval__s__*.json: GPT-4o run added.
    - docs/benchmarks/longmemeval.md: TL;DR, results table, comparison
      table, and 'what we don't beat' section all reflect both runs.
    - docs/featherdb_paper.tex: §4.7 results paragraph + table updated
      with GPT-4o numbers. PDF recompiled.
    - README.md: Benchmarks table now lists both runs, GPT-4o first.
    ashwath007 committed Apr 26, 2026
    Configuration menu
    Copy the full SHA
    bb57d62 View commit details
    Browse the repository at this point in the history
  4. Marketing pack for v0.8.0 LongMemEval launch

    Three assets ready for Claude Cowork to execute:
    
    docs/marketing/gtm-plan.md
    - Positioning, ICP, three-claim core message, channel strategy,
      90-day KPIs, founder talking points, asset checklist.
    - Conversion goal: Cloud waitlist email capture.
    - Explicitly: no Supermemory head-to-head in launch creative.
    
    docs/blog/longmemeval-publish.md
    - Public-ready article (800-1200 words). Headline: 'You don't need
      GPT-4o full-context for AI memory — Feather DB beats it for $2.40'.
    - Lists Feather + GPT-4o (0.693), Feather + Gemini-Flash (0.657),
      full-context ceilings, naive RAG. Does NOT list Mem0/Zep/Supermemory.
    - Includes reproduce command, per-axis tables, Phase 9 + Cloud teaser.
    - Note: docs/blog/longmemeval-results.md (the original draft, with
      Supermemory) is left in place as the internal-only / detailed version.
    
    docs/marketing/twitter-thread.md
    - 7-tweet thread, image spec for the headline chart, posting timing,
      reply templates for predictable Qs.
    
    docs/marketing/hn-submission.md
    - Title (70-char), submission URL placeholder, first-comment context
      template, 7 reply templates for predictable HN questions.
    
    GTM hand-off: this pack is what the marketing function works from.
    ashwath007 committed Apr 26, 2026
    Configuration menu
    Copy the full SHA
    786a705 View commit details
    Browse the repository at this point in the history
  5. Publish round 2: HF Dataset + README badges + arXiv submission package

    HF Dataset created: https://huggingface.co/datasets/Hawky-ai/feather-db-benchmarks
    - 22 result JSONs (LongMemEval oracle/S, SIFT1M, synthetic)
    - Dataset card with schema documentation, headline numbers, reproduce
      command. Loadable via datasets.load_dataset(...).
    
    README header badges:
    - LongMemEval_S: 0.693 (GPT-4o) and 0.657 (Gemini-Flash)
    - SIFT1M p50 = 0.19ms
    - Recall@10 = 0.972
    - HF benchmarks dataset link
    - Updated HF Space link from Sri-Vigneshwar-DJ to Hawky-ai (org-owned)
    
    docs/arxiv-submission/:
    - featherdb_paper.tex + featherdb_paper.pdf (verified compile)
    - SUBMISSION_GUIDE.md: manual upload runbook for arxiv.org since the
      arXiv replace-article flow is web-form-only.
    ashwath007 committed Apr 26, 2026
    Configuration menu
    Copy the full SHA
    5df6473 View commit details
    Browse the repository at this point in the history
  6. Add CONTENT_INDEX.json + front-matter for website auto-pull

    The Feather website's Claude-code agent can now watch
    docs/CONTENT_INDEX.json as the single manifest for all publishable
    content. The manifest lists every blog post, marketing asset,
    technical report, and paper with:
      - canonical raw_url (raw.githubusercontent.com)
      - status (ready / internal)
      - channels (where to publish)
      - tags, summary, cover_image_spec
      - do_not_mention rules (e.g. no Supermemory head-to-heads)
      - headline_metrics (single source of truth for numbers)
    
    Also added rules_for_consumers (governance), watcher_recipe (how to
    detect changes), and always_include_links (canonical URLs).
    
    YAML front-matter added to docs/blog/longmemeval-publish.md so the
    blog post is also self-describing for any consumer that fetches it
    directly. Other files will get front-matter as they become
    publish-ready.
    
    The website agent flow:
      1. Watch CONTENT_INDEX.json for SHA changes (15-min cadence).
      2. For each item with status=ready, fetch raw_url.
      3. Parse front-matter + body, render per channel.
      4. Re-publish on any detected SHA change.
    ashwath007 committed Apr 26, 2026
    Configuration menu
    Copy the full SHA
    7f812d2 View commit details
    Browse the repository at this point in the history
Loading