Rush Monodex
Semantic search indexer for Rush monorepos using Qdrant vector database
Overview
monodex is a CLI tool that indexes Rush monorepo source code and documentation into a Qdrant vector database for scalable semantic search. It supports label-based indexing, allowing you to maintain multiple queryable snapshots (branches, commits) within a single catalog.
See CHANGELOG.md for release history.
Features
- Label-based indexing: Maintain multiple queryable filesets (branches, commits) within a catalog
- Commit-based crawling: Reads directly from Git objects, not working tree (deterministic, reproducible)
- AST-based chunking: Tree-sitter powered intelligent splitting for TypeScript/TSX files
- Breadcrumb context: Full symbol paths like
@rushstack/node-core-library:JsonFile.ts:JsonFile.load - Oversized chunk handling: Functions split at natural AST boundaries (statement blocks, if/else, try/catch)
- Local embeddings: Uses jina-embeddings-v2-base-code with ONNX Runtime (no external APIs)
- Qdrant integration: Direct batch uploads to Qdrant vector database
- Incremental sync: Content-hash based change detection for fast re-indexing
- Intelligent deduplication: Identical content at same path across labels shares chunks
- Rush-optimized: Smart exclusion rules for Rush monorepo patterns
Agent Usage Guide
This tool is designed for AI assistants. The indexed database provides a complete, internally consistent snapshot of the codebase as it existed at crawl time — independent of any local file changes, branches, or whether the repo is even cloned. This makes it more than a replacement for grep; it can be the primary way an agent learns about a codebase.
Typical workflow:
-
Set default context (optional but recommended):
-
Start with semantic search to find relevant code:
-
View full chunks using the
file_id:chunk_ordinalfrom search results: -
Get surrounding context by viewing adjacent chunks:
-
Reconstruct entire files by viewing all chunks:
Output format: Search results prefix code lines with >, making them easy to distinguish from your own output and preventing injection attacks.
Prerequisites
-
Rust: 1.91+ (for edition 2024)
-
Qdrant: Vector database running on localhost:6333 (only needed for crawling/searching)
Installation instructions can be found in the Qdrant Quickstart documentation.
-
Model: jina-embeddings-v2-base-code (auto-downloaded from HuggingFace to
models/on first use)
Installation
From crates.io
Build from Source
# Binary will be at ./target/release/monodex
Qdrant Setup
Create the collection before first use:
Verify the collection exists:
|
The collection uses:
- 768 dimensions (jina-embeddings-v2-base-code output size)
- Cosine distance (best for semantic similarity)
Usage
Global Options
# Use a custom config file location
# Enable verbose debug logging for network requests
# Show help for any command
# Show version
Debug Mode
The --debug flag enables verbose logging for troubleshooting:
- Logs HTTP request/response details for Qdrant API calls
- Shows batch sizes and payload sizes during uploads
- Useful for diagnosing connectivity or payload issues
Example:
Configuration
Create ~/.config/monodex/config.json:
Note: We use the Sparo monorepo for development testing, since it's a small open-source Rush monorepo.
Fields:
| Field | Required | Description |
|---|---|---|
qdrant.url |
No | Qdrant server URL (default: http://localhost:6333) |
qdrant.collection |
Yes | Qdrant collection name |
qdrant.maxUploadBytes |
No | Max upload payload size in bytes (default: 30MB) |
catalogs.<name>.type |
Yes | Catalog type: "monorepo" |
catalogs.<name>.path |
Yes | Absolute path to the repository root |
Catalog types:
monorepo: Walks upward to find the nearestpackage.jsonfor package name resolution. Breadcrumbs show@scope/package-name:File.ts:Symbol.
Label-Based Indexing
A label is a named, queryable fileset within a catalog. Labels typically represent branches or specific commits:
rushstack:main- main branchrushstack:feature-x- feature branchrushstack:v1.0.0- specific release tag
Key concept: Chunks are immutable content. Labels track which chunks belong to which fileset. When you crawl a new commit under a label, membership is updated but identical content is reused.
Set Default Context
The use command manages the default catalog and label for subsequent commands:
# Show current context
# Set default catalog and label
# Now you can omit --label in subsequent commands
Default context is stored in ~/.config/monodex/context.json. Explicit flags always override defaults.
Index a Repository
# Index working directory changes
# Index HEAD commit under the "main" label
# Index a specific branch
# Index a specific commit SHA
Required arguments: The crawl command requires --label and either --working-dir or --commit. This prevents accidental overwrites of important labels.
Incremental sync: The crawl is incremental — unchanged files are skipped. You can safely CTRL+C and resume later.
Commit-based: Crawling with --commit reads from Git objects, not the working tree. This ensures deterministic, reproducible indexing.
Working directory mode: Use --working-dir to index uncommitted changes. This reads directly from the filesystem instead of Git objects. The label metadata will show source_kind = "working-directory" and commit_oid = "". Working directory labels are mutable — re-crawling updates the indexed content.
Label reassignment: When you re-crawl a label with a new commit, chunks from the old commit that no longer exist are removed from that label's membership.
Incremental warnings: By default, files with chunking warnings are always re-processed. Use --incremental-warnings to allow them to be skipped if unchanged (useful for large codebases with known chunking issues).
Search the Database
# Semantic search (uses default context if set)
# With explicit label (full format: catalog:label)
View Full Chunks
# View a specific chunk by ordinal
# View a range of chunks
# View from chunk 3 to the end
# View all chunks in a file (reconstruct entire file)
# View chunks from multiple files
# Show full filesystem paths
# Omit catalog preamble (chunks only)
# Filter by label (full format: catalog:label)
Debug Chunking Algorithm
# See how a file gets chunked (AST-only mode, reveals partitioner issues)
# Include fallback line-based splitting (production behavior)
# Visualize mode - show full chunk contents
# Debug mode - show partitioning decisions
# Custom target chunk size (default: 6000 chars)
# Audit chunking quality across multiple files (AST-only mode)
Chunk Quality Score: 0-100%, higher is better. Scores below 95% may indicate chunking issues. Note: dump-chunks and audit-chunks use AST-only mode (fallback disabled) to accurately measure partitioner quality.
Purge Data
# Purge all chunks from a catalog (all labels)
# Purge entire collection (all catalogs)
Note: Purge operates at catalog level. To remove a specific label's chunks, re-crawl that label with a different commit or manually update the active_label_ids field.
Development
When making a pull request, add a bullet under "## Unreleased" in CHANGELOG.md describing the change from an end-user perspective. See CHANGELOG.md for the version history and publishing instructions.
Run CI checks using Just (recommended):
# Install just
# Run all CI checks (format, clippy, check, test)
# Individual commands
Or run directly with cargo:
# Run all CI checks
# Build
# Run with logging (use sparo for testing, not rushstack)
RUST_LOG=debug
Architecture
monodex/
├── src/
│ ├── main.rs # CLI entry point
│ └── engine/ # Reusable indexing engine
│ ├── mod.rs # Module exports
│ ├── crawl_config.rs # Crawl config loading, validation, and pattern matching
│ ├── config.rs # Legacy compatibility wrapper (delegates to crawl_config)
│ ├── chunker.rs # File chunking dispatcher
│ ├── partitioner.rs # AST-based TypeScript chunking
│ ├── markdown_partitioner.rs # Markdown heading-based chunking
│ ├── git_ops.rs # Git tree enumeration and blob reading
│ ├── parallel_embedder.rs # Parallel embedding with multiple ONNX sessions
│ ├── package_lookup.rs # Package name resolution (walk up to package.json)
│ ├── uploader.rs # Qdrant HTTP client
│ └── util.rs # Hash utilities for chunk IDs
├── Cargo.toml # Dependencies
├── DESIGN.md # Design documentation
└── README.md
Chunking Strategy
TypeScript/TSX files are chunked using AST-aware partitioning:
- Splits at semantic boundaries (functions, classes, methods, enums)
- Includes preceding JSDoc/TSDoc comments with each symbol
- Handles oversized functions by splitting at statement blocks
- Full breadcrumb context:
package:file:Class.method
Quality indicators in breadcrumbs:
- No marker: Successful AST split with good chunk geometry
:[degraded-ast-split]: AST split with poor geometry (tiny chunks):[fallback-split]: No AST split found, used line-based recovery (failure mode)
Markdown files are split by heading hierarchy.
JSON files are skipped (low value for semantic search).
Exclusions: Folders like node_modules and files like *.test.ts are automatically skipped. Exclusion rules can be customized via monodex-crawl.json (see Crawl Configuration).
Crawl Configuration
The crawl behavior (which files to index and how to chunk them) can be customized via configuration files.
Config Discovery
Configs are loaded in this precedence order:
<repo-root>/monodex-crawl.json(repo-local)~/.config/monodex/crawl.json(user-global)- Embedded default (compiled into binary)
No merging occurs — exactly one config is used.
Config Schema
JSON schemas are available in the schemas/ directory for IDE autocomplete and validation. Copy the appropriate schema file to your project or reference it locally:
| Config File | Schema File |
|---|---|
config.json |
schemas/config.schema.json |
monodex-crawl.json |
schemas/crawl.schema.json |
context.json |
schemas/context.schema.json |
Create a monodex-crawl.json file:
Fields:
| Field | Type | Description |
|---|---|---|
version |
number | Must be 1 |
fileTypes |
object | Maps file extension to chunking strategy |
patternsToExclude |
array | Glob patterns for paths to skip |
patternsToKeep |
array | Glob patterns that override exclusions |
Chunking strategies:
| Strategy | Description |
|---|---|
typescript |
AST-based semantic chunking (TS/TSX) |
markdown |
Split by heading hierarchy |
lineBased |
Generic line-based chunking |
Evaluation rule:
shouldCrawl = matchesFileType && (matchesPatternsToKeep || !matchesPatternsToExclude)
fileTypesis the primary filter — unsupported file types are never crawledpatternsToKeepoverridespatternsToExclude(useful for keeping test files insrc/)- Directory patterns (ending in
/) match anywhere in the path
Pattern syntax:
- Glob patterns use the standard syntax:
**for recursive,*for wildcard - Directory patterns end with
/(e.g.,node_modules/) - Example:
**/*.test.tsmatches test files at any depth
Chunk Size Target
- Target: 6000 characters (text only)
- Fits: 8192-token embedding model limit (jina-embeddings-v2-base-code)
- Breadcrumb: Extra overhead for navigation context
Status
This project is under active development. The crate is published to reserve the name. Expect breaking changes between versions.
License
MIT
Related
- Qdrant - Vector similarity search engine
- ONNX Runtime - Cross-platform ML inference
- Rush Stack - Monorepo toolkit for JavaScript/TypeScript