-
Notifications
You must be signed in to change notification settings - Fork 4
Expand file tree
/
Copy pathbacklog.yaml
More file actions
439 lines (396 loc) · 14.6 KB
/
backlog.yaml
File metadata and controls
439 lines (396 loc) · 14.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
project: openexp-v2
goal: Persistent memory for Claude Code that learns from experience
created: 2026-04-08
stage_0_cleanup:
name: Cleanup v1 dead code
status: DONE
tickets:
- id: S0-01
title: Delete observation pipeline (PostToolUse hook + ingest code)
status: DONE
description: 'Removed post-tool-use.sh hook, observation.py, filters.py, session_summary.py,
reward.py. Removed from settings.local.json.
'
done_at: 2026-04-08
- id: S0-02
title: Create transcript.py — store full conversations
status: DONE
description: 'New module openexp/ingest/transcript.py. Parses Claude Code JSONL,
embeds user/assistant messages, batch upserts to Qdrant.
'
done_at: 2026-04-08
- id: S0-03
title: Wire transcript ingest into session-end.sh
status: DONE
description: 'Added Phase 2d to session-end.sh — calls ingest_transcript() after
decision extraction.
'
done_at: 2026-04-08
- id: S0-04
title: Backfill all historical transcripts
status: DONE
description: '158 sessions, 13,154 messages ingested into Qdrant. Replaced 284K
noise observations with 16K clean conversation data.
'
done_at: 2026-04-08
- id: S0-05
title: Fix broken tests after cleanup
status: DONE
description: 'Deleted 3 test files, removed 6 tests from 3 files. 256 passed,
0 failed.
'
done_at: 2026-04-08
- id: S0-06
title: Delete all old observations from Qdrant
status: DONE
priority: P0
description: 'Remove all points where source != "transcript" and type != "decision".
Keep only conversation transcripts and extracted decisions. User explicitly
asked to remove all old observations.
'
done_at: '2026-04-09'
- id: S0-07
title: Commit and PR all cleanup changes
status: DONE
priority: P0
description: 'Branch cleanup/v2-prep. All changes from S0-01 through S0-05. Run
tests, verify, PR, merge.
'
done_at: '2026-04-09'
stage_1_store:
name: Reliable transcript storage
status: DONE
definition_of_done: 'Every session''s full conversation is stored exactly once in
Qdrant. Re-running ingest on the same session is a no-op. CLI can ingest any transcript
by path or session ID.
'
tickets:
- id: S1-01
title: Add idempotency guard to transcript ingest
status: DONE
priority: P0
description: 'Before ingesting, check if session_id already has points in Qdrant.
If yes — skip. Prevents duplicates on re-run. Implementation: scroll with filter
session_id=X, if count > 0 skip.
'
tests:
- test_ingest_same_session_twice_is_noop
- test_ingest_new_session_stores_messages
done_at: '2026-04-09'
- id: S1-02
title: Add dedup check for backfill (detect existing duplicates)
status: DONE
priority: P1
description: 'Scan Qdrant for duplicate session_ids. Report count. Optionally
delete duplicates keeping newest batch.
'
tests:
- test_find_duplicate_sessions
done_at: '2026-04-09'
- id: S1-03
title: Improve transcript parsing — handle edge cases
status: DONE
priority: P1
description: 'Handle: empty messages, very long messages (>5000 chars → chunk),
messages with only tool calls (skip), image blocks (skip). Add content-type
metadata to each point.
'
tests:
- test_parse_empty_message_skipped
- test_parse_long_message_chunked
- test_parse_tool_only_message_skipped
done_at: '2026-04-09'
- id: S1-04
title: 'CLI: openexp ingest --all (bulk with idempotency)'
status: DONE
priority: P1
description: 'Ingest all transcripts from all project dirs. Skip already-ingested
sessions. Show progress bar.
'
tests:
- test_cli_ingest_all_skips_existing
done_at: '2026-04-09'
- id: S1-05
title: Add transcript ingest tests
status: DONE
priority: P0
description: 'Unit tests for parse_transcript() and ingest_transcript(). Mock
Qdrant client. Test JSONL parsing, system-reminder filtering, message extraction,
batch upsert logic.
'
tests:
- test_parse_transcript_user_messages
- test_parse_transcript_assistant_messages
- test_parse_transcript_filters_system_reminders
- test_ingest_transcript_batch_upsert
- test_ingest_transcript_dry_run
done_at: '2026-04-09'
- id: S1-06
title: Reset Q-cache (all zeros → empty)
status: DONE
priority: P2
description: 'Q-cache has 100K entries all at 0.0, 12MB file. Reset to empty.
Q-values will rebuild from v2 reward system.
'
done_at: '2026-04-09'
stage_2_search:
name: Fast, accurate memory retrieval
status: IN_PROGRESS
definition_of_done: 'search_memory returns relevant conversation fragments. Scoring:
vector 50% + BM25 15% + recency 20% + importance 15%. No Q-value in scoring until
Stage 4 proves it works. p50 latency < 200ms for top-10 results.
'
tickets:
- id: S2-01
title: Simplify scoring formula — remove Q-value weight
status: DONE
priority: P1
description: 'Current: vector 30% + BM25 10% + recency 15% + importance 15% +
Q 30%. New: vector 50% + BM25 15% + recency 20% + importance 15%. Q-value weight
= 0 until Stage 4. Keep Q infrastructure, just zero the weight.
'
tests:
- test_scoring_without_q_value
- test_scoring_weights_sum_to_1
done_at: '2026-04-09'
- id: S2-02
title: Add conversation-aware search filters
status: DONE
priority: P1
description: 'Filter by: source (transcript/decision), role (user/assistant),
date range, project, session_id. All via Qdrant payload filters.
'
tests:
- test_search_filter_by_role
- test_search_filter_by_date_range
- test_search_filter_by_session
done_at: '2026-04-09'
- id: S2-03
title: Benchmark search quality on real queries
status: TODO
priority: P2
description: 'Create 20 test queries with expected results. Measure recall@10
and MRR. Baseline for future improvements.
'
- id: S2-04
title: Tune BM25 parameters
status: TODO
priority: P3
description: 'Current BM25 uses defaults. Test k1=1.2..2.0 and b=0.5..0.9 on the
benchmark set from S2-03.
'
stage_3_interface:
name: 'Hippocampus model: write everything, retrieve on demand'
status: DONE
definition_of_done: 'Write path: every session auto-ingested (SessionEnd hook).
Read path: /recall skill for on-demand retrieval. MCP: 3 core tools (search, add,
stats) + 2 reward (predict, outcome). No auto-injection on every message (UserPromptSubmit
removed).
'
tickets:
- id: S3-01
title: Reduce MCP tools to 5 (hippocampus model)
status: DONE
priority: P0
description: "NEW MODEL: Write everything automatically, retrieve on demand.\n\
Keep 5 tools:\n search_memory — core retrieval (used by /recall skill and hooks)\n\
\ add_memory — explicit memory capture (decisions, facts)\n memory_stats —\
\ system health check\n log_prediction — reward loop input\n log_outcome —\
\ reward loop output\n\nRemove 11 tools: explain_q, calibrate_experience_q,\
\ protect_memory, reload_q_cache, resolve_outcomes, experience_info, experience_insights,\
\ experience_top_memories, reflect, memory_reward_history, reward_detail.\n\
Also remove get_agent_context (dead).\n"
tests:
- test_mcp_lists_exactly_5_tools
- test_each_tool_responds
done_at: '2026-04-13'
- id: S3-02
title: Simplify SessionStart hook
status: DONE
priority: P1
description: 'Simplified to: search top-10 → format as additionalContext → return.
'
done_at: '2026-04-09'
- id: S3-03
title: Remove UserPromptSubmit hook (hippocampus model)
status: DONE
priority: P0
description: 'OLD: search top-5 on EVERY user message, inject as REMINDER. Problem:
noise, slow, fills context with low-relevance results.
NEW: No auto-recall per message. Retrieval is on-demand via /recall. SessionStart
still injects broad context at session start.
Action: remove UserPromptSubmit hook from settings.local.json. Keep the script
file for reference but deactivate the hook.
'
done_at: '2026-04-13'
- id: S3-04
title: Simplify SessionEnd hook
status: DONE
priority: P1
description: 'Two steps: (1) extract decisions, (2) ingest transcript. This is
the WRITE path — runs automatically on every session end.
'
done_at: '2026-04-09'
- id: S3-06
title: Create /recall skill — on-demand hippocampus retrieval
status: TODO
priority: P0
description: "The KEY new piece. A Claude Code skill that:\nUser says: /recall\
\ Acme contract Skill does:\n 1. search_memory(\"Acme contract\", limit=20)\n\
\ 2. Group results by session/date\n 3. Format as structured context with\
\ scores\n 4. Return to Claude for reasoning\n\nUser says: /recall --session\
\ abc123 Skill does: retrieve all messages from that session\nUser says: /recall\
\ --last-week pipeline decisions Skill does: search with date_from filter, type=decision\n\
SKILL.md frontmatter:\n name: recall\n description: Search hippocampus memory\
\ on demand\n user_invocable: true\n arguments: query text + optional flags\n\
\nImplementation: as a Claude Code skill with SKILL.md.\n"
- id: S3-07
title: Decide SessionStart hook fate (keep vs remove)
status: TODO
priority: P2
description: "With /recall available, do we still need SessionStart auto-injection?\n\
Arguments FOR keeping:\n - Gives baseline context without user asking\n -\
\ Cheap (one search at session start)\n\nArguments AGAINST:\n - May inject\
\ irrelevant context\n - /recall is more targeted\n\nDecision: keep for now\
\ but make it opt-out via .openexp.yaml. Revisit after /recall is used for 2\
\ weeks.\n"
stage_4_reward:
name: Working Q-learning loop
status: TODO
definition_of_done: 'ONE reward path works end-to-end: prediction → outcome → Q-value
update. Q-values actually change from defaults. Search results improve with accumulated
rewards.
'
tickets:
- id: S4-01
title: Implement prediction→outcome reward path
status: TODO
priority: P1
description: 'log_prediction stores prediction with memory_ids. log_outcome matches
prediction, computes reward delta, updates Q-values of linked memories. This
is the ONLY reward path in v2.
'
tests:
- test_prediction_logged_with_memory_ids
- test_outcome_updates_q_values
- test_prediction_without_outcome_no_change
- id: S4-02
title: Add Q-value weight back to scoring
status: DONE
priority: P1
description: 'Once predictions prove Q-values move meaningfully, add Q back to
scoring. Start with 10% weight, tune up.
'
depends_on: S4-01
tests:
- test_scoring_with_q_value_weight
done_at: '2026-04-13'
- id: S4-03
title: CRM outcome resolver (optional, if CRM still used)
status: TODO
priority: P3
description: 'Keep crm_csv resolver but as optional plugin. Only wire in if CRM
CSVs exist.
'
- id: S4-04
title: Q-value decay for stale memories
status: TODO
priority: P3
description: 'Memories not retrieved for 30+ days slowly decay toward 0. Prevents
permanently high Q from one lucky prediction.
'
tests:
- test_q_decay_after_30_days
- id: S4-05
title: Reward dashboard / CLI report
status: TODO
priority: P3
description: 'CLI command: openexp stats --rewards Shows: total predictions, resolved
%, avg reward, top Q memories.
'
stage_5_experience_library:
name: Experience Library — structured experience from conversation data
status: DONE
definition_of_done: 'Full pipeline: chunk → topics → threads → experience labels → Qdrant.
269 experience labels across 35 threads. Searchable via search_memory(type="experience").
Skills /experience and /label-thread working.
'
done_at: '2026-04-14'
tickets:
- id: S5-01
title: Chunking pipeline
status: DONE
description: 'Fetch all transcripts from Qdrant, group by session, sort chronologically,
split into ~200K token chunks. Output: 18 chunks from 156 sessions.
'
done_at: '2026-04-13'
- id: S5-02
title: Topic extraction per chunk
status: DONE
description: 'Opus extracts topics per chunk. 170 topics across 18 chunks.
'
done_at: '2026-04-13'
- id: S5-03
title: Thread grouping across chunks
status: DONE
description: 'Opus groups 170 topics into 36 work threads spanning multiple chunks.
'
done_at: '2026-04-14'
- id: S5-04
title: Experience labeling (pilot thread)
status: DONE
description: 'Validated the approach on thread #4 (pilot). 19 timeline events, 8
experience labels in context→actions→outcome format.
'
done_at: '2026-04-14'
- id: S5-05
title: add_experience() in Qdrant
status: DONE
description: 'Store experience labels in Qdrant with search-optimized embedding
(situation + insight + applies_when). type="experience", source="experience_library".
'
done_at: '2026-04-14'
- id: S5-06
title: Batch label all 36 threads
status: DONE
description: '269 unique experience labels across 35 threads (1 low_data skip).
All stored in Qdrant. Smoke tests pass for all 5 categories.
'
done_at: '2026-04-14'
- id: S5-07
title: /experience skill — retrieve past experience
status: DONE
description: 'Skill searches Qdrant for type="experience", formats advice.
'
done_at: '2026-04-14'
- id: S5-08
title: /label-thread skill — repeatable labeling
status: DONE
description: '7-step process encoded as skill. Tested on Mercury thread.
'
done_at: '2026-04-14'
stage_6_next:
name: Experience Library — adoption and integration
status: TODO
tickets:
- id: S6-01
title: Auto-experience in SessionStart hook
status: TODO
priority: P1
description: 'Search type="experience" on each session start. Inject top 3 relevant
experiences into context alongside regular memories.
'
- id: S6-02
title: Experience compression via compresr.ai
status: TODO
priority: P2
description: 'Compress all 269 experience labels to fit in context window. Partnership
with external compression service.
'
- id: S6-03
title: LoRA training data export
status: TODO
priority: P3
description: 'Export experience labels as training pairs for LoRA fine-tuning.
Format: instruction (situation) → response (actions + reasoning).
'