openexp/backlog.yaml at main · anthroos/openexp

439 lines (396 loc) · 14.6 KB
project: openexp-v2
goal: Persistent memory for Claude Code that learns from experience
created: 2026-04-08
stage_0_cleanup:
  name: Cleanup v1 dead code
  status: DONE
  - id: S0-01
    title: Delete observation pipeline (PostToolUse hook + ingest code)
    status: DONE
    description: 'Removed post-tool-use.sh hook, observation.py, filters.py, session_summary.py,
      reward.py. Removed from settings.local.json.
    done_at: 2026-04-08
  - id: S0-02
    title: Create transcript.py — store full conversations
    status: DONE
    description: 'New module openexp/ingest/transcript.py. Parses Claude Code JSONL,
      embeds user/assistant messages, batch upserts to Qdrant.
    done_at: 2026-04-08
  - id: S0-03
    title: Wire transcript ingest into session-end.sh
    status: DONE
    description: 'Added Phase 2d to session-end.sh — calls ingest_transcript() after
      decision extraction.
    done_at: 2026-04-08
  - id: S0-04
    title: Backfill all historical transcripts
    status: DONE
    description: '158 sessions, 13,154 messages ingested into Qdrant. Replaced 284K
      noise observations with 16K clean conversation data.
    done_at: 2026-04-08
  - id: S0-05
    title: Fix broken tests after cleanup
    status: DONE
    description: 'Deleted 3 test files, removed 6 tests from 3 files. 256 passed,
      0 failed.
    done_at: 2026-04-08
  - id: S0-06
    title: Delete all old observations from Qdrant
    status: DONE
    priority: P0
    description: 'Remove all points where source != "transcript" and type != "decision".
      Keep only conversation transcripts and extracted decisions. User explicitly
      asked to remove all old observations.
    done_at: '2026-04-09'
  - id: S0-07
    title: Commit and PR all cleanup changes
    status: DONE
    priority: P0
    description: 'Branch cleanup/v2-prep. All changes from S0-01 through S0-05. Run
      tests, verify, PR, merge.
    done_at: '2026-04-09'
stage_1_store:
  name: Reliable transcript storage
  status: DONE
  definition_of_done: 'Every session''s full conversation is stored exactly once in
    Qdrant. Re-running ingest on the same session is a no-op. CLI can ingest any transcript
    by path or session ID.
  - id: S1-01
    title: Add idempotency guard to transcript ingest
    status: DONE
    priority: P0
    description: 'Before ingesting, check if session_id already has points in Qdrant.
      If yes — skip. Prevents duplicates on re-run. Implementation: scroll with filter
      session_id=X, if count > 0 skip.
    - test_ingest_same_session_twice_is_noop
    - test_ingest_new_session_stores_messages
    done_at: '2026-04-09'
  - id: S1-02
    title: Add dedup check for backfill (detect existing duplicates)
    status: DONE
    priority: P1
    description: 'Scan Qdrant for duplicate session_ids. Report count. Optionally
      delete duplicates keeping newest batch.
    - test_find_duplicate_sessions
    done_at: '2026-04-09'
  - id: S1-03
    title: Improve transcript parsing — handle edge cases
    status: DONE
    priority: P1
    description: 'Handle: empty messages, very long messages (>5000 chars → chunk),
      messages with only tool calls (skip), image blocks (skip). Add content-type
      metadata to each point.
    - test_parse_empty_message_skipped
    - test_parse_long_message_chunked
    - test_parse_tool_only_message_skipped
    done_at: '2026-04-09'
  - id: S1-04
    title: 'CLI: openexp ingest --all (bulk with idempotency)'
    status: DONE
    priority: P1
    description: 'Ingest all transcripts from all project dirs. Skip already-ingested
      sessions. Show progress bar.
    - test_cli_ingest_all_skips_existing
    done_at: '2026-04-09'
  - id: S1-05
    title: Add transcript ingest tests
    status: DONE
    priority: P0
    description: 'Unit tests for parse_transcript() and ingest_transcript(). Mock
      Qdrant client. Test JSONL parsing, system-reminder filtering, message extraction,
      batch upsert logic.
    - test_parse_transcript_user_messages
    - test_parse_transcript_assistant_messages
    - test_parse_transcript_filters_system_reminders
    - test_ingest_transcript_batch_upsert
    - test_ingest_transcript_dry_run
    done_at: '2026-04-09'
  - id: S1-06
    title: Reset Q-cache (all zeros → empty)
    status: DONE
    priority: P2
    description: 'Q-cache has 100K entries all at 0.0, 12MB file. Reset to empty.
      Q-values will rebuild from v2 reward system.
    done_at: '2026-04-09'
stage_2_search:
  name: Fast, accurate memory retrieval
  status: IN_PROGRESS
  definition_of_done: 'search_memory returns relevant conversation fragments. Scoring:
    vector 50% + BM25 15% + recency 20% + importance 15%. No Q-value in scoring until
    Stage 4 proves it works. p50 latency < 200ms for top-10 results.
  - id: S2-01
    title: Simplify scoring formula — remove Q-value weight
    status: DONE
    priority: P1
    description: 'Current: vector 30% + BM25 10% + recency 15% + importance 15% +
      Q 30%. New: vector 50% + BM25 15% + recency 20% + importance 15%. Q-value weight
      = 0 until Stage 4. Keep Q infrastructure, just zero the weight.
    - test_scoring_without_q_value
    - test_scoring_weights_sum_to_1
    done_at: '2026-04-09'
  - id: S2-02
    title: Add conversation-aware search filters
    status: DONE
    priority: P1
    description: 'Filter by: source (transcript/decision), role (user/assistant),
      date range, project, session_id. All via Qdrant payload filters.
    - test_search_filter_by_role
    - test_search_filter_by_date_range
    - test_search_filter_by_session
    done_at: '2026-04-09'
  - id: S2-03
    title: Benchmark search quality on real queries
    status: TODO
    priority: P2
    description: 'Create 20 test queries with expected results. Measure recall@10
      and MRR. Baseline for future improvements.
  - id: S2-04
    title: Tune BM25 parameters
    status: TODO
    priority: P3
    description: 'Current BM25 uses defaults. Test k1=1.2..2.0 and b=0.5..0.9 on the
      benchmark set from S2-03.
stage_3_interface:
  name: 'Hippocampus model: write everything, retrieve on demand'
  status: DONE
  definition_of_done: 'Write path: every session auto-ingested (SessionEnd hook).
    Read path: /recall skill for on-demand retrieval. MCP: 3 core tools (search, add,
    stats) + 2 reward (predict, outcome). No auto-injection on every message (UserPromptSubmit
    removed).
  - id: S3-01
    title: Reduce MCP tools to 5 (hippocampus model)
    status: DONE
    priority: P0
    description: "NEW MODEL: Write everything automatically, retrieve on demand.\n\
      Keep 5 tools:\n  search_memory — core retrieval (used by /recall skill and hooks)\n\
      \  add_memory — explicit memory capture (decisions, facts)\n  memory_stats —\
      \ system health check\n  log_prediction — reward loop input\n  log_outcome —\
      \ reward loop output\n\nRemove 11 tools: explain_q, calibrate_experience_q,\
      \ protect_memory, reload_q_cache, resolve_outcomes, experience_info, experience_insights,\
      \ experience_top_memories, reflect, memory_reward_history, reward_detail.\n\
      Also remove get_agent_context (dead).\n"
    - test_mcp_lists_exactly_5_tools
    - test_each_tool_responds
    done_at: '2026-04-13'
  - id: S3-02
    title: Simplify SessionStart hook
    status: DONE
    priority: P1
    description: 'Simplified to: search top-10 → format as additionalContext → return.
    done_at: '2026-04-09'
  - id: S3-03
    title: Remove UserPromptSubmit hook (hippocampus model)
    status: DONE
    priority: P0
    description: 'OLD: search top-5 on EVERY user message, inject as REMINDER. Problem:
      noise, slow, fills context with low-relevance results.
      NEW: No auto-recall per message. Retrieval is on-demand via /recall. SessionStart
      still injects broad context at session start.
      Action: remove UserPromptSubmit hook from settings.local.json. Keep the script
      file for reference but deactivate the hook.
    done_at: '2026-04-13'
  - id: S3-04
    title: Simplify SessionEnd hook
    status: DONE
    priority: P1
    description: 'Two steps: (1) extract decisions, (2) ingest transcript. This is
      the WRITE path — runs automatically on every session end.
    done_at: '2026-04-09'
  - id: S3-06
    title: Create /recall skill — on-demand hippocampus retrieval
    status: TODO
    priority: P0
    description: "The KEY new piece. A Claude Code skill that:\nUser says: /recall\
      \ Acme contract Skill does:\n  1. search_memory(\"Acme contract\", limit=20)\n\
      \  2. Group results by session/date\n  3. Format as structured context with\
      \ scores\n  4. Return to Claude for reasoning\n\nUser says: /recall --session\
      \ abc123 Skill does: retrieve all messages from that session\nUser says: /recall\
      \ --last-week pipeline decisions Skill does: search with date_from filter, type=decision\n\
      SKILL.md frontmatter:\n  name: recall\n  description: Search hippocampus memory\
      \ on demand\n  user_invocable: true\n  arguments: query text + optional flags\n\
      \nImplementation: as a Claude Code skill with SKILL.md.\n"
  - id: S3-07
    title: Decide SessionStart hook fate (keep vs remove)
    status: TODO
    priority: P2
    description: "With /recall available, do we still need SessionStart auto-injection?\n\
      Arguments FOR keeping:\n  - Gives baseline context without user asking\n  -\
      \ Cheap (one search at session start)\n\nArguments AGAINST:\n  - May inject\
      \ irrelevant context\n  - /recall is more targeted\n\nDecision: keep for now\
      \ but make it opt-out via .openexp.yaml. Revisit after /recall is used for 2\
      \ weeks.\n"
stage_4_reward:
  name: Working Q-learning loop
  status: TODO
  definition_of_done: 'ONE reward path works end-to-end: prediction → outcome → Q-value
    update. Q-values actually change from defaults. Search results improve with accumulated
    rewards.
  - id: S4-01
    title: Implement prediction→outcome reward path
    status: TODO
    priority: P1
    description: 'log_prediction stores prediction with memory_ids. log_outcome matches
      prediction, computes reward delta, updates Q-values of linked memories. This
      is the ONLY reward path in v2.
    - test_prediction_logged_with_memory_ids
    - test_outcome_updates_q_values
    - test_prediction_without_outcome_no_change
  - id: S4-02
    title: Add Q-value weight back to scoring
    status: DONE
    priority: P1
    description: 'Once predictions prove Q-values move meaningfully, add Q back to
      scoring. Start with 10% weight, tune up.
    depends_on: S4-01
    - test_scoring_with_q_value_weight
    done_at: '2026-04-13'
  - id: S4-03
    title: CRM outcome resolver (optional, if CRM still used)
    status: TODO
    priority: P3
    description: 'Keep crm_csv resolver but as optional plugin. Only wire in if CRM
      CSVs exist.
  - id: S4-04
    title: Q-value decay for stale memories
    status: TODO
    priority: P3
    description: 'Memories not retrieved for 30+ days slowly decay toward 0. Prevents
      permanently high Q from one lucky prediction.
    - test_q_decay_after_30_days
  - id: S4-05
    title: Reward dashboard / CLI report
    status: TODO
    priority: P3
    description: 'CLI command: openexp stats --rewards Shows: total predictions, resolved
      %, avg reward, top Q memories.
stage_5_experience_library:
  name: Experience Library — structured experience from conversation data
  status: DONE
  definition_of_done: 'Full pipeline: chunk → topics → threads → experience labels → Qdrant.
    269 experience labels across 35 threads. Searchable via search_memory(type="experience").
    Skills /experience and /label-thread working.
  done_at: '2026-04-14'
  - id: S5-01
    title: Chunking pipeline
    status: DONE
    description: 'Fetch all transcripts from Qdrant, group by session, sort chronologically,
      split into ~200K token chunks. Output: 18 chunks from 156 sessions.
    done_at: '2026-04-13'
  - id: S5-02
    title: Topic extraction per chunk
    status: DONE
    description: 'Opus extracts topics per chunk. 170 topics across 18 chunks.
    done_at: '2026-04-13'
  - id: S5-03
    title: Thread grouping across chunks
    status: DONE
    description: 'Opus groups 170 topics into 36 work threads spanning multiple chunks.
    done_at: '2026-04-14'
  - id: S5-04
    title: Experience labeling (pilot thread)
    status: DONE
    description: 'Validated the approach on thread #4 (pilot). 19 timeline events, 8
      experience labels in context→actions→outcome format.
    done_at: '2026-04-14'
  - id: S5-05
    title: add_experience() in Qdrant
    status: DONE
    description: 'Store experience labels in Qdrant with search-optimized embedding
      (situation + insight + applies_when). type="experience", source="experience_library".
    done_at: '2026-04-14'
  - id: S5-06
    title: Batch label all 36 threads
    status: DONE
    description: '269 unique experience labels across 35 threads (1 low_data skip).
      All stored in Qdrant. Smoke tests pass for all 5 categories.
    done_at: '2026-04-14'
  - id: S5-07
    title: /experience skill — retrieve past experience
    status: DONE
    description: 'Skill searches Qdrant for type="experience", formats advice.
    done_at: '2026-04-14'
  - id: S5-08
    title: /label-thread skill — repeatable labeling
    status: DONE
    description: '7-step process encoded as skill. Tested on Mercury thread.
    done_at: '2026-04-14'
stage_6_next:
  name: Experience Library — adoption and integration
  status: TODO
  - id: S6-01
    title: Auto-experience in SessionStart hook
    status: TODO
    priority: P1
    description: 'Search type="experience" on each session start. Inject top 3 relevant
      experiences into context alongside regular memories.
  - id: S6-02
    title: Experience compression via compresr.ai
    status: TODO
    priority: P2
    description: 'Compress all 269 experience labels to fit in context window. Partnership
      with external compression service.
  - id: S6-03
    title: LoRA training data export
    status: TODO
    priority: P3
    description: 'Export experience labels as training pairs for LoRA fine-tuning.
      Format: instruction (situation) → response (actions + reasoning).
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

backlog.yaml

Latest commit

History

backlog.yaml

File metadata and controls