MMR post-retrieval dedup for brain_search (R86 Q4)#242
Conversation
📝 WalkthroughWalkthroughThe pull request centralizes enrichment time-window defaults across CLI, MCP, and handler layers by introducing a Changes
Sequence Diagram(s)sequenceDiagram
participant HybridSearch as hybrid_search()
participant RRF as RRF Scoring
participant MMR as MMR Reranking
participant Embeddings as Embedding Loader
participant Similarity as Cosine Similarity
participant Output as Final Results
HybridSearch->>HybridSearch: Fetch candidates (3x over-fetch)
HybridSearch->>RRF: Combine semantic + FTS5 scores
RRF->>HybridSearch: Return scored candidates
HybridSearch->>MMR: Call _mmr_rerank_scored_results()
MMR->>Embeddings: Load candidate embeddings
Embeddings->>MMR: Return float embeddings
MMR->>Similarity: Compute cosine similarity matrix
Similarity->>MMR: Return similarity scores
MMR->>MMR: Iteratively select diverse subset (MMR algorithm)
MMR->>HybridSearch: Return reranked candidates
HybridSearch->>Output: Truncate to n_results & return
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Please review this PR, especially commit 00e08f5 (Fix enrichment default since-hours for MCP/CLI). Root cause: both MCP/CLI defaults used 24-hour window, which capped candidates to ~40 rows vs expected ~19K. This commit bumps default since-hours to 8760 and makes it env-configurable. |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b983f953c5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| remaining.remove(best_idx) | ||
|
|
||
| reranked = [mmr_candidates[idx] for idx in selected] | ||
| return reranked + fallback_candidates + tail_candidates |
There was a problem hiding this comment.
Preserve ranking for high-score chunks without embeddings
The new return order reranked + fallback_candidates + tail_candidates forces every chunk missing a float embedding to rank below all MMR-eligible candidates, even when its fused score is higher. In hybrid_search, this can hide top FTS-only results (e.g., deferred-embedding chunks with no chunk_vectors row) from the requested n_results, which is a retrieval-quality regression introduced by this reranker. Please keep fallback items in score-respecting order relative to the reranked set so missing-vector hits are not systematically suppressed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/brainlayer/cli/__init__.py`:
- Around line 27-29: Move the DEFAULT_REALTIME_ENRICH_SINCE_HOURS constant into
a single canonical module (e.g., enrichment_controller) and replace the three
duplicate definitions by importing that constant where needed; specifically,
define DEFAULT_REALTIME_ENRICH_SINCE_HOURS once in enrichment_controller (using
the same os.environ fallback logic) and in modules that currently declare it
(the modules containing DEFAULT_REALTIME_ENRICH_SINCE_HOURS in
mcp.enrich_handler, mcp.__init__, and cli.__init__) remove the local definition
and add from ..enrichment_controller import DEFAULT_REALTIME_ENRICH_SINCE_HOURS
(or the correct relative import) so all consumers use the single source of
truth.
In `@src/brainlayer/mcp/__init__.py`:
- Around line 61-63: Replace the duplicated constant definition
DEFAULT_REALTIME_ENRICH_SINCE_HOURS in __init__.py by importing it from
enrich_handler (which is already imported in this module) so there is a single
source of truth; after switching to the import, remove the local assignment and
also drop the now-unused import os at top of the file if nothing else uses it.
In `@src/brainlayer/search_repo.py`:
- Around line 155-188: The current return (reranked + fallback_candidates +
tail_candidates) always places all no-embedding candidates after all
vector-backed ones; instead merge reranked and fallback_candidates back together
and stable-sort by the original relevance score (candidate[0]) so fallback items
keep position relative to scores but the reranked items keep their MMR ordering
when scores are equal. Concretely, after computing reranked, form merged =
reranked + fallback_candidates, then do merged_sorted = sorted(merged,
key=lambda c: c[0], reverse=True) (stable sort preserves reranked/tie order),
and return merged_sorted + tail_candidates; references: _load_chunk_embeddings,
mmr_candidates, fallback_candidates, reranked, tail_candidates.
In `@tests/test_hybrid_search.py`:
- Around line 293-341: The test currently only calls the private helper
_mmr_rerank_scored_results, so a regression in the public wiring (hybrid_search)
could break behavior unnoticed; add a call to store.hybrid_search(...) using the
same query/context (e.g., "oauth token rotation") with parameters that trigger
MMR and n_results=3, then assert the returned result ids mirror the expectations
from the private test (top result is "dup-primary" and the top two are not both
the duplicate pair, and "distinct-relevant" appears in the top two) to ensure
the public hybrid_search path forwards the correct candidate pool into
_mmr_rerank_scored_results.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b338ad88-c72b-431e-afc0-3fe1d7ee76d0
📒 Files selected for processing (5)
src/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.pysrc/brainlayer/mcp/enrich_handler.pysrc/brainlayer/search_repo.pytests/test_hybrid_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
src/brainlayer/mcp/enrich_handler.pytests/test_hybrid_search.pysrc/brainlayer/search_repo.pysrc/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/mcp/enrich_handler.pysrc/brainlayer/search_repo.pysrc/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T23:32:14.543Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
📚 Learning: 2026-04-02T23:32:14.543Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T23:32:14.543Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/mcp/enrich_handler.pysrc/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)
Applied to files:
src/brainlayer/mcp/enrich_handler.pysrc/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-11T16:54:45.631Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-04-11T16:54:45.631Z
Learning: Applies to `src/brainlayer/enrichment_controller.py` and `src/brainlayer/pipeline/rate_limiter.py`: Gemini API calls in the enrichment pipeline are gated by a token bucket rate limiter. The rate is controlled by `BRAINLAYER_ENRICH_RATE` (default `5/s`, burst `10`) to keep throughput inside the Gemini Flex intended envelope. This default supersedes the earlier 0.2 (12 RPM) default for the Gemini Flex integration path.
Applied to files:
src/brainlayer/mcp/enrich_handler.pysrc/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.
Applied to files:
src/brainlayer/mcp/enrich_handler.py
📚 Learning: 2026-04-11T23:47:49.756Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-04-11T23:47:49.756Z
Learning: Applies to `src/brainlayer/enrichment_controller.py`: `service_tier="flex"` is the intentional default for all Gemini enrichment calls. Pass-2 enrichment is asynchronous backlog work where 1–15 minute latency is acceptable, and the 50% Gemini Flex Inference discount materially reduces backlog cost. This is locked by R84b design (§8 Q2). The `BRAINLAYER_GEMINI_SERVICE_TIER` environment variable is purely an operational escape hatch (e.g. `standard`), not the intended runtime default. Do not flag `service_tier="flex"` as a concern on this code path.
Applied to files:
src/brainlayer/mcp/enrich_handler.pysrc/brainlayer/cli/__init__.py
📚 Learning: 2026-04-04T23:24:03.159Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-04T23:24:03.159Z
Learning: Applies to src/brainlayer/{vector_store,search}*.py : Chunk lifecycle: implement columns `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search
Applied to files:
src/brainlayer/search_repo.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Implement chunk lifecycle columns: `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search; allow `include_archived=True` to show history
Applied to files:
src/brainlayer/search_repo.py
📚 Learning: 2026-04-04T23:49:56.485Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-04T23:49:56.485Z
Learning: Applies to src/brainlayer/vector_store.py : Use sqlite-vec with APSW for vector storage and retrieval
Applied to files:
src/brainlayer/search_repo.py
📚 Learning: 2026-04-04T15:22:02.740Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 198
File: hooks/brainlayer-prompt-search.py:241-259
Timestamp: 2026-04-04T15:22:02.740Z
Learning: In `hooks/brainlayer-prompt-search.py` (Python), `record_injection_event()` is explicitly best-effort telemetry: silent `except sqlite3.Error: pass` is intentional — table non-existence or lock failures are acceptable silent failures. `sqlite3.connect(timeout=2)` is the file-open timeout; `PRAGMA busy_timeout` governs per-statement lock-wait. The `DEADLINE_MS` (450ms) guard applies only to the FTS search phase, not to this side-channel write.
Applied to files:
src/brainlayer/search_repo.pysrc/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-06T00:22:38.541Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T00:22:38.541Z
Learning: Applies to src/**/*.py : Store and resolve database path using `paths.py:get_db_path()` — supports environment variable override or defaults to canonical path `~/.local/share/brainlayer/brainlayer.db`
Applied to files:
src/brainlayer/cli/__init__.py
📚 Learning: 2026-04-03T11:34:19.303Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-03T11:34:19.303Z
Learning: Applies to src/brainlayer/cli.py : Use Typer CLI framework for command-line interface in `src/brainlayer/`
Applied to files:
src/brainlayer/cli/__init__.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via `enrichment_controller.py`, and Ollama as offline last-resort; allow override via `BRAINLAYER_ENRICH_BACKEND` env var
Applied to files:
src/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-01T01:24:44.281Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment backend priority: Groq (primary/cloud) → Gemini (fallback) → Ollama (offline last-resort), configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable
Applied to files:
src/brainlayer/cli/__init__.pysrc/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-01T01:24:44.281Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/mcp/*.py : MCP tools include: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person, brain_tags, brain_supersede, brain_archive (legacy brainlayer_* aliases still supported)
Applied to files:
src/brainlayer/mcp/__init__.py
📚 Learning: 2026-03-17T01:04:22.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.
Applied to files:
src/brainlayer/mcp/__init__.py
📚 Learning: 2026-03-14T02:20:54.656Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-14T02:20:54.656Z
Learning: Applies to **/*.py : Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Applied to files:
src/brainlayer/mcp/__init__.py
🔇 Additional comments (6)
src/brainlayer/search_repo.py (1)
15-15: NumPy is already declared as a runtime dependency inpyproject.toml(line 34:"numpy>=1.22,<3.0"), so the import on line 15 does not create an import-time failure.> Likely an incorrect or invalid review comment.src/brainlayer/mcp/enrich_handler.py (2)
5-5: LGTM! Well-structured environment-configurable default.The constant is cleanly defined and makes the 8760-hour (1-year) default configurable via
BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS. This is consistent with the project's pattern for environment-based configuration.Also applies to: 12-14
19-26: LGTM! Default correctly applied to the handler function.The
since_hoursparameter default is now sourced from the module-level constant, ensuring consistency between the handler, MCP schema, and CLI.src/brainlayer/mcp/__init__.py (2)
1050-1058: LGTM! Tool schema correctly updated with dynamic default and descriptive help text.The schema now reflects the configurable default value and the description includes the actual default for user clarity.
1347-1355: LGTM! Call routing fallback correctly uses the constant.The
brain_enrichhandler correctly falls back toDEFAULT_REALTIME_ENRICH_SINCE_HOURSwhensince_hoursis not provided in arguments.src/brainlayer/cli/__init__.py (1)
885-896: LGTM! CLI option correctly updated with dynamic default.The
--since-hoursoption now uses the configurable constant for its default, and the help text clearly displays the current default value.
| DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int( | ||
| os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760") | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Centralize DEFAULT_REALTIME_ENRICH_SINCE_HOURS to avoid triple definition.
This constant is now defined identically in three files:
src/brainlayer/mcp/enrich_handler.py(lines 12-14)src/brainlayer/mcp/__init__.py(lines 61-63)src/brainlayer/cli/__init__.py(lines 27-29)
Consider defining it once in a shared location (e.g., enrichment_controller.py since it's already the canonical location for enrichment logic, or a dedicated constants module) and importing it where needed. This ensures consistency if the default or env var name ever changes.
♻️ Proposed approach
- Add the constant to
enrichment_controller.py:
DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(
os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760")
)- Import in all consuming modules:
from ..enrichment_controller import DEFAULT_REALTIME_ENRICH_SINCE_HOURS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/cli/__init__.py` around lines 27 - 29, Move the
DEFAULT_REALTIME_ENRICH_SINCE_HOURS constant into a single canonical module
(e.g., enrichment_controller) and replace the three duplicate definitions by
importing that constant where needed; specifically, define
DEFAULT_REALTIME_ENRICH_SINCE_HOURS once in enrichment_controller (using the
same os.environ fallback logic) and in modules that currently declare it (the
modules containing DEFAULT_REALTIME_ENRICH_SINCE_HOURS in mcp.enrich_handler,
mcp.__init__, and cli.__init__) remove the local definition and add from
..enrichment_controller import DEFAULT_REALTIME_ENRICH_SINCE_HOURS (or the
correct relative import) so all consumers use the single source of truth.
| DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int( | ||
| os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760") | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider importing the constant from enrich_handler instead of redefining it.
This constant is already defined identically in enrich_handler.py (line 12-14), which this module already imports from (line 38). Importing it would eliminate duplication and ensure a single source of truth.
♻️ Proposed refactor
-DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(
- os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760")
-)
+from .enrich_handler import DEFAULT_REALTIME_ENRICH_SINCE_HOURSAnd remove the import os addition on line 5 if it's no longer needed elsewhere in this file.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int( | |
| os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760") | |
| ) | |
| from .enrich_handler import DEFAULT_REALTIME_ENRICH_SINCE_HOURS |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/mcp/__init__.py` around lines 61 - 63, Replace the duplicated
constant definition DEFAULT_REALTIME_ENRICH_SINCE_HOURS in __init__.py by
importing it from enrich_handler (which is already imported in this module) so
there is a single source of truth; after switching to the import, remove the
local assignment and also drop the now-unused import os at top of the file if
nothing else uses it.
| embeddings_by_id = self._load_chunk_embeddings([candidate[1] for candidate in top_candidates]) | ||
| mmr_candidates = [candidate for candidate in top_candidates if candidate[1] in embeddings_by_id] | ||
| if len(mmr_candidates) < 2: | ||
| return scored | ||
|
|
||
| fallback_candidates = [candidate for candidate in top_candidates if candidate[1] not in embeddings_by_id] | ||
| relevance = np.array([candidate[0] for candidate in mmr_candidates], dtype=np.float32) | ||
| rel_max = float(relevance.max()) | ||
| if rel_max > 0.0: | ||
| normalized_relevance = relevance / rel_max | ||
| else: | ||
| normalized_relevance = np.ones_like(relevance) | ||
|
|
||
| matrix = np.stack([embeddings_by_id[candidate[1]] for candidate in mmr_candidates]).astype(np.float32, copy=False) | ||
| norms = np.linalg.norm(matrix, axis=1, keepdims=True) | ||
| norms = np.maximum(norms, 1e-12) | ||
| cosine = np.clip((matrix / norms) @ (matrix / norms).T, -1.0, 1.0) | ||
| np.fill_diagonal(cosine, 0.0) | ||
|
|
||
| selected: list[int] = [int(np.argmax(normalized_relevance))] | ||
| remaining = set(range(len(mmr_candidates))) - set(selected) | ||
|
|
||
| while remaining: | ||
| remaining_indices = sorted(remaining) | ||
| diversity_penalty = cosine[remaining_indices][:, selected].max(axis=1) | ||
| mmr_scores = (_MMR_LAMBDA * normalized_relevance[remaining_indices]) - ( | ||
| (1.0 - _MMR_LAMBDA) * diversity_penalty | ||
| ) | ||
| best_idx = remaining_indices[int(np.argmax(mmr_scores))] | ||
| selected.append(best_idx) | ||
| remaining.remove(best_idx) | ||
|
|
||
| reranked = [mmr_candidates[idx] for idx in selected] | ||
| return reranked + fallback_candidates + tail_candidates |
There was a problem hiding this comment.
Preserve score order for candidates without float vectors.
return reranked + fallback_candidates + tail_candidates demotes every no-embedding candidate behind all embedding-backed ones. A result that was originally ranked near the top can disappear from the visible n_results solely because _load_chunk_embeddings() missed its vector.
💡 Proposed fix
embeddings_by_id = self._load_chunk_embeddings([candidate[1] for candidate in top_candidates])
mmr_candidates = [candidate for candidate in top_candidates if candidate[1] in embeddings_by_id]
if len(mmr_candidates) < 2:
return scored
- fallback_candidates = [candidate for candidate in top_candidates if candidate[1] not in embeddings_by_id]
+ embedding_backed_ids = {candidate[1] for candidate in mmr_candidates}
relevance = np.array([candidate[0] for candidate in mmr_candidates], dtype=np.float32)
rel_max = float(relevance.max())
if rel_max > 0.0:
normalized_relevance = relevance / rel_max
else:
@@
- reranked = [mmr_candidates[idx] for idx in selected]
- return reranked + fallback_candidates + tail_candidates
+ reranked_iter = iter(mmr_candidates[idx] for idx in selected)
+ reranked_top = [
+ next(reranked_iter) if candidate[1] in embedding_backed_ids else candidate
+ for candidate in top_candidates
+ ]
+ return reranked_top + tail_candidates🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/brainlayer/search_repo.py` around lines 155 - 188, The current return
(reranked + fallback_candidates + tail_candidates) always places all
no-embedding candidates after all vector-backed ones; instead merge reranked and
fallback_candidates back together and stable-sort by the original relevance
score (candidate[0]) so fallback items keep position relative to scores but the
reranked items keep their MMR ordering when scores are equal. Concretely, after
computing reranked, form merged = reranked + fallback_candidates, then do
merged_sorted = sorted(merged, key=lambda c: c[0], reverse=True) (stable sort
preserves reranked/tie order), and return merged_sorted + tail_candidates;
references: _load_chunk_embeddings, mmr_candidates, fallback_candidates,
reranked, tail_candidates.
| def test_mmr_rerank_dedupes_near_duplicates(self, store): | ||
| def embedding(primary: float, secondary: float = 0.0) -> list[float]: | ||
| vector = [0.0] * 1024 | ||
| vector[0] = primary | ||
| vector[1] = secondary | ||
| return vector | ||
|
|
||
| _insert_chunk( | ||
| store, | ||
| chunk_id="dup-primary", | ||
| content="oauth token rotation incident rollback and session repair", | ||
| embedding=embedding(1.0, 0.0), | ||
| importance=6.0, | ||
| ) | ||
| _insert_chunk( | ||
| store, | ||
| chunk_id="dup-secondary", | ||
| content="oauth token rotation incident rollback and session repair duplicate notes", | ||
| embedding=embedding(0.999, 0.001), | ||
| importance=6.0, | ||
| ) | ||
| _insert_chunk( | ||
| store, | ||
| chunk_id="distinct-relevant", | ||
| content="oauth token rotation migration checklist and recovery guide", | ||
| embedding=embedding(0.72, 0.69), | ||
| importance=5.5, | ||
| ) | ||
| _insert_chunk( | ||
| store, | ||
| chunk_id="distinct-supporting", | ||
| content="oauth token rotation audit trail and operator runbook", | ||
| embedding=embedding(0.65, 0.75), | ||
| importance=5.0, | ||
| ) | ||
|
|
||
| scored = [ | ||
| (0.99, "dup-primary", "oauth token rotation incident rollback and session repair", {}, 0.01), | ||
| (0.98, "dup-secondary", "oauth token rotation incident rollback and session repair duplicate notes", {}, 0.02), | ||
| (0.94, "distinct-relevant", "oauth token rotation migration checklist and recovery guide", {}, 0.12), | ||
| (0.9, "distinct-supporting", "oauth token rotation audit trail and operator runbook", {}, 0.15), | ||
| ] | ||
|
|
||
| reranked = store._mmr_rerank_scored_results(scored, n_results=3) | ||
| ids = [item[1] for item in reranked[:3]] | ||
|
|
||
| assert ids[0] == "dup-primary", ids | ||
| assert "distinct-relevant" in ids[:2], ids | ||
| assert set(ids[:2]) != {"dup-primary", "dup-secondary"}, ids |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Cover the public hybrid_search() path as well.
This regression only exercises the private helper, so it will still pass if the hybrid_search() wiring regresses and never feeds the expected candidate pool into MMR. Please add one assertion through store.hybrid_search(...) to protect the behavior that actually changed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_hybrid_search.py` around lines 293 - 341, The test currently only
calls the private helper _mmr_rerank_scored_results, so a regression in the
public wiring (hybrid_search) could break behavior unnoticed; add a call to
store.hybrid_search(...) using the same query/context (e.g., "oauth token
rotation") with parameters that trigger MMR and n_results=3, then assert the
returned result ids mirror the expectations from the private test (top result is
"dup-primary" and the top two are not both the duplicate pair, and
"distinct-relevant" appears in the top two) to ensure the public hybrid_search
path forwards the correct candidate pool into _mmr_rerank_scored_results.
|
|
||
| from ..paths import get_db_path | ||
|
|
||
| DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760")) |
There was a problem hiding this comment.
🟢 Low cli/__init__.py:27
At module import time, int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760")) throws ValueError when the environment variable contains a non-numeric string. Since this happens during CLI initialization, the error appears as an internal traceback rather than a user-friendly configuration message. Consider validating the value and raising a clear typer.BadParameter or SystemExit with a descriptive message.
+_raw_since_hours = os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760")
+try:
+ DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(_raw_since_hours)
+except ValueError:
+ console.print(f"[bold red]Error:[/] BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS must be an integer, got: {_raw_since_hours!r}")
+ raise typer.Exit(1)
+del _raw_since_hoursAlso found in 2 other location(s)
src/brainlayer/mcp/__init__.py:61
Same issue as code object 0: the
int()conversion ofBRAINLAYER_DEFAULT_ENRICH_SINCE_HOURSat module import time will raise an unhelpfulValueErrorif the environment variable is set to a non-numeric string, crashing the MCP server initialization.
src/brainlayer/mcp/enrich_handler.py:12
If the
BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURSenvironment variable is set to a non-integer string (e.g.,"abc"or an empty string""),int()will raise an unhandledValueErrorat module import time, crashing the application with no helpful error message about the misconfiguration.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/cli/__init__.py around line 27:
At module import time, `int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760"))` throws `ValueError` when the environment variable contains a non-numeric string. Since this happens during CLI initialization, the error appears as an internal traceback rather than a user-friendly configuration message. Consider validating the value and raising a clear `typer.BadParameter` or `SystemExit` with a descriptive message.
Evidence trail:
src/brainlayer/cli/__init__.py lines 27: `DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760"))` - module-level code with no validation or exception handling around the int() call.
Also found in 2 other location(s):
- src/brainlayer/mcp/__init__.py:61 -- Same issue as code object 0: the `int()` conversion of `BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS` at module import time will raise an unhelpful `ValueError` if the environment variable is set to a non-numeric string, crashing the MCP server initialization.
- src/brainlayer/mcp/enrich_handler.py:12 -- If the `BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS` environment variable is set to a non-integer string (e.g., `"abc"` or an empty string `""`), `int()` will raise an unhandled `ValueError` at module import time, crashing the application with no helpful error message about the misconfiguration.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d545514c00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # 1. Semantic search leg — prefer binary vectors, fall back to float vectors | ||
| # when the binary index is unavailable (for example readonly live DBs). | ||
| candidate_fetch_count = max(n_results * 3, _MMR_CANDIDATE_LIMIT) |
There was a problem hiding this comment.
Avoid multiplying the new candidate floor into 500-KNN scans
hybrid_search now forces candidate_fetch_count to at least 50, but the binary path still applies its existing filtered-query overfetch (effective_k = n_results * 10 in _binary_search). For entity_id/source_filter_like/correction-filtered searches, this turns the intended 50-candidate pool into 500-vector KNN probes, which is a significant latency regression and can trigger MCP timeouts on larger DBs. The MMR floor should be decoupled from the binary leg’s internal overfetch multiplier.
Useful? React with 👍 / 👎.
|
|
||
| from ..paths import get_db_path | ||
|
|
||
| DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760")) |
There was a problem hiding this comment.
Guard env-based enrich default parsing at import time
This import-time int(...) cast will raise ValueError if BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS is set to a non-integer string, which prevents the CLI module from loading at all (including unrelated commands). Because this is user-controlled environment input, failing hard during module import is brittle; parse defensively (or defer validation to command execution) so one bad env value does not brick the entire CLI.
Useful? React with 👍 / 👎.
Design Summary
SearchMixin.hybrid_search.lambda=0.65.chunk_vectorsfor the similarity matrix so near-duplicate chunks lose top slots without changing existing filters or MCP response formatting.Test Output
Note
Add MMR post-retrieval deduplication to
hybrid_searchinSearchMixin_mmr_rerank_scored_resultstosearch_repo.py, implementing Maximal Marginal Relevance re-ranking that balances relevance and diversity using cosine similarity over chunk embeddings (_MMR_LAMBDA = 0.65)._load_chunk_embeddingshelper to fetch float32 embeddings from thechunk_vectorstable for MMR input.hybrid_searchnow fetches a minimum of 50 candidates per leg (_MMR_CANDIDATE_LIMIT) instead of onlyn_results * 3, increasing the pool available for re-ranking.since_hoursforbrain_enrichacross the CLI, MCP tool schema, and handler from 24 to 8760 (1 year), configurable viaBRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS.hybrid_searchresult ordering will change for callers that relied on pre-MMR ranking; the minimum candidate fetch size also increases, which may affect latency.📊 Macroscope summarized d545514. 5 files reviewed, 4 issues evaluated, 2 issues filtered, 1 comment posted
🗂️ Filtered Issues
src/brainlayer/mcp/__init__.py — 0 comments posted, 1 evaluated, 1 filtered
int()conversion ofBRAINLAYER_DEFAULT_ENRICH_SINCE_HOURSat module import time will raise an unhelpfulValueErrorif the environment variable is set to a non-numeric string, crashing the MCP server initialization. [ Cross-file consolidated ]src/brainlayer/mcp/enrich_handler.py — 0 comments posted, 1 evaluated, 1 filtered
BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURSenvironment variable is set to a non-integer string (e.g.,"abc"or an empty string""),int()will raise an unhandledValueErrorat module import time, crashing the application with no helpful error message about the misconfiguration. [ Cross-file consolidated ]Summary by CodeRabbit
New Features
Configuration & Enhancements
BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURSenvironment variable (defaults to 8760 hours, approximately one year).