Current version: v0.70.0 · Schema: v17 · Test suite: ~364 tests
This document is a concise reference for implemented features grouped by area. For the full development roadmap and planned phases see
docs/PLAN.md.
- Indexing
- Search
- History / Temporal
- Change Detection
- Clustering
- Branch / Merge
- Analysis
- Visualization (HTML)
- HTTP API Server
- MCP Tools
- Protocol Servers (tools subcommand)
- Maintenance & DB
- Configuration
- Strategic Productization Backlog
- Planned / In Progress
All indexing is content-addressed: a blob (file snapshot) is embedded exactly once per SHA-1 hash, regardless of how many commits or paths reference it.
One database can hold embeddings from multiple embedding models simultaneously. Each embedding row is attributed to its embedding config via the embed_config table.
| Feature | Flag / command |
|---|---|
| Index coverage status (read-only, multi-model aware) | gitsema index |
| Start indexing (HEAD-first, then history) | gitsema index start |
| Incremental (default when run after prior index) | `gitsema index start --since <ref |
| Parallel embedding | --concurrency <n> (default 4) |
| Batch embedding requests | --embed-batch-size <n> |
| Extension filter | --ext ".ts,.py" |
| Path exclusion | --exclude "node_modules,dist" |
| Max blob size cap | --max-size 200kb |
| Glob-based selective indexing | --include-glob "src/**" |
| Specific file indexing from HEAD | --file <paths...> |
| Chunking strategies | `--chunker file |
| Fixed-window chunk tuning | --window-size <n>, --overlap <n> |
| VSS / HNSW index build after indexing | --auto-build-vss [threshold] |
| Int8 scalar quantization | --quantize |
| Cap commits per run | --max-commits <n> |
| Mixed-model index guard | --allow-mixed |
| Model override for a run | --model <name> |
| Index bundle export / import | gitsema index export/import |
| Automated hooks (post-commit, post-merge) | gitsema config set hooks.enabled true |
| Module-level embeddings (directory centroids) | gitsema index update-modules |
| Remote-repo indexing via HTTP server | gitsema remote-index <url> |
| Multi-repo registry | gitsema repos add/list/remove |
| Profile presets (Phase 63) | --profile speed|balanced|quality |
| Auto-batch detection (Phase 63) | Auto-enables embedBatch() when provider supports it |
| First-run CPU profiling | Enabled by GITSEMA_PROFILE_FIRST_RUN or index.profileFirstRun (default: true). Profiles written to .gitsema/profiles/embedeer-profile-<timestamp>.cpuprofile. Precedence: env GITSEMA_PROFILE_FIRST_RUN overrides repo config index.profileFirstRun. Recommended: disable in CI by setting GITSEMA_PROFILE_FIRST_RUN=0 in your CI environment. |
| Adaptive batch controller (Phase 63) | In-flight batch size adjustment based on observed latency |
| Post-run maintenance recommendations (Phase 63) | VSS, FTS backfill, vacuum suggestions after each run |
| BatchingProvider sub-batch chunking (Phase 62) | Transparent sub-batch split + retry wrapper for any provider (buildBatchingProvider()) |
| Ollama true-batch endpoint (Phase 62) | OllamaProvider uses /api/embed (Ollama ≥ 0.1.34) for native string[] batch; falls back to serial on 404 |
| Pipelined read/embed/store (Phase 69) | AsyncQueue-based overlap of batch stages; activated on the batch path |
| Per-repo project metadata | gitsema project (2D projections) |
Chunking fallback chain: whole-file → function boundaries → fixed windows (1500 chars) → fixed windows (800 chars) when a blob exceeds the embedding model's context limit.
All search uses the text embedding model (not the code model) to embed queries (natural language is the common case).
| Feature | Flag / command |
|---|---|
| Vector similarity search | gitsema search <query> |
| Top-k results | -k / --top <n> |
| Symbol / chunk-level code search | gitsema code-search <query> |
| Hybrid search (vector + BM25) | --hybrid, --bm25-weight <n> |
| Query expansion (BM25 keywords pre-embedding) | --expand-query |
| Recency-blended ranking | --recent, --alpha <n> |
| Three-signal ranking | --weight-vector, --weight-recency, --weight-path |
| Date range filter | --before <date>, --after <date> |
| Branch-scoped search | --branch <name> |
| Group results | `--group file |
| Include chunk results | --chunks |
| Contrastive / negative-example search | --not-like <query> |
| Lambda contrastive parameter | --lambda <n> |
| Result explanation | --explain |
| Boolean queries | --or, --and; inline A AND B / A OR B |
| LLM narrative summary | --narrate (requires GITSEMA_LLM_URL) |
| HNSW approximate-nearest-neighbor search | --vss (requires built VSS index) |
| HTML output | --html [file] |
| Multi-repo search | gitsema repos + MCP multi_repo_search |
| Early-cut (Phase 64) | --early-cut <n> — random-sample candidate pool for speed on large indexes |
| LLM provenance citations (Phase 64) | --explain-llm — structured citation block for LLM prompt grounding |
| Feature | Flag / command |
|---|---|
| Find concept origin (first-seen chronologically) | gitsema first-seen <query> |
| Single-file semantic drift timeline | gitsema file-evolution <path> |
| Concept drift timeline across history | gitsema evolution <query> |
| Semantic diff between two refs | gitsema diff <ref1> <ref2> <query> |
| Semantic diff of a file between two refs | gitsema file-diff <ref1> <ref2> <path> |
| Per-block nearest-neighbor attribution | gitsema blame <file> (alias: semantic-blame) |
| Concept lifecycle (birth, growth, plateau, decay) | gitsema lifecycle <query> |
| Semantic bisect (find regressions) | gitsema bisect <good> <bad> <query> |
| Dead-concept detection (deleted blobs) | gitsema dead-concepts |
| Evolution alerts (largest jumps) | --alerts [n] on file-evolution |
| Structured JSON / HTML dump | --dump [file], --html [file] (legacy; prefer --out) |
| Include stored content in dumps | --include-content |
| LLM narrative | --narrate on evolution, diff, file-evolution |
| Unified output system (Phase 70) | --out <format>[:<file>] (repeatable) on search, evolution, triage, policy check, ownership, workflow run; formats: text|json|html|markdown|sarif |
| Feature | Flag / command |
|---|---|
| Concept-level change points across history | gitsema change-points <query> |
| Single-file semantic change points | gitsema file-change-points <path> |
| Cluster-structure change points | gitsema cluster-change-points |
| Threshold tuning | --threshold <n> (cosine distance, default 0.3) |
| Show top-N jumps | --top-points <n> |
| Date range | --since <ref>, --until <ref> |
| Commit cap (for large repos) | --max-commits <n> on cluster-change-points |
| Structured JSON dump | --dump [file] |
| LLM narrative | --narrate |
| Feature | Flag / command |
|---|---|
| K-means cluster snapshot | gitsema clusters |
| Temporal cluster diff (two refs) | gitsema cluster-diff <ref1> <ref2> |
| Multi-step cluster timeline | gitsema cluster-timeline |
| Number of clusters | --k <n> (default 8) |
| Timeline steps | --steps <n> |
| Date range | --since <ref>, --until <ref> |
| HTML interactive output | --html [file] |
| LLM narrative | --narrate |
| HNSW warm-start k-means | built into build-vss pipeline |
| Feature | Flag / command |
|---|---|
| Branch semantic summary vs base | gitsema branch-summary <branch> |
| Semantic collision detection before merge | gitsema merge-audit <branch-a> <branch-b> |
| Pre-merge concept landscape preview | gitsema merge-preview <branch> |
| Cherry-pick suggestions based on semantic similarity | gitsema cherry-pick-suggest <query> |
| CI diff (post to PR as GitHub review comment) | gitsema ci-diff --github-token <token> |
| Branch filter on search/evolution | --branch <name> |
| Feature | Flag / command |
|---|---|
| Semantic authorship attribution | gitsema author <query> |
| Cross-module coupling / refactor impact | gitsema impact <path> |
| Refactor candidates (cross-cutting duplication) | gitsema refactor-candidates |
| Documentation gap analysis | gitsema doc-gap |
| Contributor profile (per-author concept map) | gitsema contributor-profile <author> |
| Security scan (vulnerability pattern similarity) | gitsema security-scan (results are similarity scores, not confirmed CVEs) |
| Health timeline (churn rate, dead-concept ratio) | gitsema health |
| Technical debt scoring (isolation, age, frequency) | gitsema debt |
| Experts / reviewer suggestions (Phase 61) | gitsema experts |
| Semantic PR report (Phase 61) | gitsema pr-report |
| Retrieval evaluation harness (Phase 64) | gitsema eval <file.jsonl> |
| Incident triage bundle (Phase 65) | gitsema triage <query> [--ref1] [--ref2] [--file] [--top] [--dump] |
| Policy checks for CI (Phase 66) | gitsema policy check [--max-drift] [--max-debt-score] [--min-security-score] [--query] |
| Ownership heatmap by concept (Phase 67) | gitsema ownership <query> [--top] [--window] [--dump] |
| Workflow templates (Phase 68) | gitsema workflow run <pr-review|incident|release-audit> [--format] [--dump] |
Interactive single-file HTML outputs; no external dependencies required.
| Renderer | Command(s) |
|---|---|
| Evolution / concept-evolution timeline | gitsema evolution --html |
| Cluster snapshot | gitsema clusters --html |
| Cluster diff | gitsema cluster-diff --html |
| Cluster timeline | gitsema cluster-timeline --html |
| Search results | gitsema search --html |
| Author attribution | gitsema author --html |
| First-seen results | gitsema first-seen --html |
| Impact heatmap | gitsema impact --html |
| Semantic diff | gitsema diff --html |
| Codebase map (2D scatter) | gitsema map |
| Temporal heatmap | gitsema heatmap |
| Web UI (served inline) | gitsema tools serve --ui |
Start with gitsema tools serve [--port n] [--key token] [--ui].
| Route prefix | Endpoints |
|---|---|
GET /api/v1/status |
Index statistics |
POST /api/v1/blobs/check |
Check if blobs are already indexed |
POST /api/v1/blobs |
Write blob + embedding |
POST /api/v1/commits, POST /api/v1/commits/mark-indexed |
Commit metadata |
POST /api/v1/search, POST /api/v1/search/first-seen |
Search |
POST /api/v1/evolution/file, POST /api/v1/evolution/concept |
Evolution |
POST /api/v1/remote/index |
Remote repo indexing |
GET /api/v1/remote/jobs/metrics, GET /api/v1/remote/jobs/:id/progress |
Job progress |
POST /api/v1/analysis/clusters |
Clustering |
POST /api/v1/analysis/change-points |
Change-point detection |
POST /api/v1/analysis/author |
Author attribution |
POST /api/v1/analysis/impact |
Impact analysis |
POST /api/v1/analysis/semantic-diff |
Semantic diff |
POST /api/v1/analysis/semantic-blame |
Semantic blame |
POST /api/v1/analysis/dead-concepts |
Dead-concept detection |
POST /api/v1/analysis/merge-audit |
Merge audit |
POST /api/v1/analysis/merge-preview |
Merge preview |
POST /api/v1/analysis/branch-summary |
Branch summary |
POST /api/v1/analysis/experts |
Experts / reviewer suggestions (Phase 61) |
POST /api/v1/analysis/security-scan |
Vulnerability pattern similarity scan (Phase 43) |
POST /api/v1/analysis/health |
Time-bucketed health timeline (Phase 44) |
POST /api/v1/analysis/debt |
Technical debt scoring (Phase 45) |
POST /api/v1/analysis/doc-gap |
Documentation gap analysis (Phase 38) |
POST /api/v1/analysis/contributor-profile |
Contributor semantic profile (Phase 39) |
POST /api/v1/analysis/triage |
Incident triage bundle (Phase 65) |
POST /api/v1/analysis/policy-check |
Automated CI gate checks (Phase 66) |
POST /api/v1/analysis/ownership |
Ownership heatmap by concept (Phase 67) |
POST /api/v1/analysis/workflow |
Workflow template runner — pr-review | incident | release-audit (Phase 68) |
POST /api/v1/analysis/eval |
Inline retrieval evaluation harness — P@k, R@k, MRR (Phase 64) |
POST /api/v1/analysis/multi-repo-search |
Search across multiple registered repos |
GET /api/v1/capabilities |
Capabilities manifest (Phase 64) |
GET /ui |
Embedded 2D codebase map UI (requires --ui) |
GET /metrics |
Prometheus metrics scrape endpoint (P2) |
GET /openapi.json |
OpenAPI 3.1 JSON specification (P2) |
GET /docs |
Swagger UI (P2) |
Authentication: optional Bearer token via --key <token> / GITSEMA_SERVE_KEY. Per-repo scoped tokens can be minted with gitsema repos token add <repo-id> and are stored as SHA-256 hashes at rest (review7 §4.1) — the plaintext is never persisted in the database.
- Prometheus metrics (
GET /metrics): exposes HTTP latency histograms, index size gauges, embedding error counters, query cache hit/miss counters, and Node.js default metrics. Protected by auth by default; setGITSEMA_METRICS_PUBLIC=1to allow unauthenticated scraping. - Rate limiting: per-token when auth is enabled, per-IP otherwise. Returns
429 Too Many RequestswithRetry-Afterheader. Configure viaGITSEMA_RATE_LIMIT_RPM(default 300) andGITSEMA_RATE_LIMIT_BURST. - OpenAPI spec (
GET /openapi.json): machine-readable OpenAPI 3.1 spec generated from Zod route schemas. - Swagger UI (
GET /docs): interactive API explorer loaded from CDN. - Deployment guide:
docs/deploy.mdcovers systemd, Docker/Ollama sidecar, secrets, backups, model rotation, recommended settings, and team operations (token rotation, audit logging, backup/restore drills). - Playbooks:
docs/playbooks.mdprovides role-based quickstart recipes for solo developers, PR reviewers, security engineers, and release managers.
Start with gitsema tools mcp. All tools share the same core logic as the CLI.
| Tool name | Description |
|---|---|
semantic_search |
Vector similarity search |
code_search |
Symbol / chunk-level code search |
search_history |
Vector search enriched with Git history metadata |
first_seen |
Find when a concept first appeared (chronological sort) |
evolution |
Single-file semantic drift timeline |
concept_evolution |
Concept drift across codebase history |
index |
Trigger incremental (or full) re-indexing |
branch_summary |
Semantic summary of a branch vs base |
merge_audit |
Detect semantic collisions between two branches |
merge_preview |
Predict concept-landscape shift after merge |
clusters |
K-means cluster snapshot |
change_points |
Concept-level change-point detection |
semantic_diff |
Conceptual diff across two git refs |
semantic_blame |
Semantic origin of each logical block |
file_change_points |
Change points for a single file |
cluster_diff |
Compare cluster snapshots at two refs |
cluster_timeline |
Multi-step cluster drift timeline |
author |
Authorship attribution for a concept |
impact |
Cross-module coupling / refactor-impact analysis |
dead_concepts |
Find deleted semantic blobs |
security_scan |
Vulnerability-pattern similarity scan |
health_timeline |
Time-bucketed codebase health metrics |
debt_score |
Technical debt scoring |
multi_repo_search |
Search across multiple registered gitsema repos |
| Subcommand | Description |
|---|---|
gitsema tools mcp |
MCP stdio server (preferred entry point for AI clients) |
gitsema tools lsp [--tcp <port>] |
LSP semantic hover server (JSON-RPC over stdio or TCP) |
gitsema tools serve [--port n] [--key token] [--ui] |
HTTP API server |
Legacy top-level aliases
gitsema mcp,gitsema lsp, andgitsema servestill work but emit a deprecation warning.
| Feature | Command |
|---|---|
| Index statistics | gitsema status [file] |
| DB integrity check | gitsema index doctor |
| SQLite VACUUM + ANALYZE | gitsema index vacuum |
| Garbage-collect orphan embeddings | gitsema index gc |
| Rebuild FTS5 index | gitsema index rebuild-fts |
| Backfill FTS5 content for pre-Phase-11 blobs | gitsema index backfill-fts |
| Build / rebuild HNSW VSS index | gitsema index build-vss |
| Remove embeddings for a specific model | gitsema index clear-model <model> |
| Recalculate module-level embeddings | gitsema index update-modules |
| Export index bundle (tar.gz) | gitsema index export |
| Import index bundle | gitsema index import |
| Saved semantic watches | gitsema watch add/list/remove/run |
Model profiles allow different models to use different providers, base URLs, and API keys. Profiles are stored in .gitsema/config.json (local) or ~/.config/gitsema/config.json (global, --global).
Per-model settings override the global GITSEMA_PROVIDER / GITSEMA_HTTP_URL / GITSEMA_API_KEY environment variables, so Ollama and OpenAI models can coexist in the same index.
| Feature | Command |
|---|---|
| List configured profiles + indexed models | gitsema models list [--json] |
| Show model info (config + index stats) | gitsema models info <name> |
| Configure a model's provider settings | gitsema models add <name> [--provider] [--url] [--key] |
| Set as default / text / code model | gitsema models add <name> --set-default (or --set-text, --set-code) |
| Remove a model profile | gitsema models remove <name> |
| Remove profile + purge index data | gitsema models remove <name> --purge-index |
Example:
# Add OpenAI model with dedicated API key
gitsema models add text-embedding-3-small \
--provider http --url https://api.openai.com --key sk-... --set-text
# Use Ollama for code, OpenAI for prose
gitsema models add nomic-embed-text --provider ollama --set-code
# Then index — the right provider is chosen per model automatically
gitsema index startPersistent configuration lives in .gitsema/config.json (repo-level) or ~/.config/gitsema/config.json (global, --global).
gitsema config set provider http
gitsema config set model text-embedding-3-small
gitsema config set index.concurrency 8
gitsema config set hooks.enabled true # auto-install git hooks
gitsema config list # show all active values + sourcesEnvironment variables always override config-file values. See README.md for the full env-var reference.
Detailed rationale is documented in docs/review4.md. High-value productizations proposed from the current codebase:
Add✅ Phase 61expertsparity to MCP and HTTP (/analysis/experts+ MCP tool).Add a machine-readable capabilities manifest across CLI/MCP/HTTP.✅ Phase 64Add pipelined batch indexing (overlap read/embed/store stages).✅ Phase 69 (AsyncQueue-based pipeline)Add speed/quality/balanced search profile presets.✅ Phase 63Add top-K early-cut scoring mode for large candidate sets.✅ Phase 64Add semantic PR report generation for CI and code review.✅ Phase 61 (gitsema pr-report)Add incident triage bundles (✅ Phase 65 (bisect+change-points+first-seen).gitsema triage)Add concept ownership heatmap and ownership-shift tracking.✅ Phase 67 (gitsema ownership)Add policy-style CI gates for drift/debt/security thresholds.✅ Phase 66 (gitsema policy check)Add AI-oriented provenance explain mode for prompt grounding.✅ Phase 64 (--explain-llm)Add saved workflow templates (✅ Phase 68 (pr-review,incident,release-audit).gitsema workflow run)Add retrieval quality evaluation harness for AI workflows.✅ Phase 64 (gitsema eval)
All 12 original productization proposals from review4 are now shipped. See docs/review5.md for the next set of priorities.
This section is intentionally brief. The canonical roadmap is in docs/PLAN.md.