A content-addressed semantic index synchronized with Git's object model.
Gitsema walks your Git history, embeds every blob, and lets you semantically search your codebase — including across time. It treats blob hashes as the unit of identity, so identical content is only embedded once regardless of how many commits reference it.
- Node.js 20+
- Git (must be on
PATH) - An embedding backend — either:
- Ollama (local, default): ollama.ai with
nomic-embed-textpulled - HTTP / OpenAI-compatible API: any endpoint that speaks the OpenAI embeddings API
- Ollama (local, default): ollama.ai with
Install from npm (requires Node.js >=20):
npm install -g gitsemaOr install from source:
git clone https://github.com/jsilvanus/gitsema.git
cd gitsema
pnpm install
pnpm build # compiles TypeScript → dist/
# Optional: put `gitsema` on your PATH
pnpm setup # one-time setup; then open a new terminal
pnpm link --globalTo use without linking, prefix commands with node dist/cli/index.js instead of gitsema.
cd /path/to/your/git/repo
# 1. Start indexing (uses Ollama by default)
gitsema index start
# 2. Search
gitsema search "authentication middleware"
# 3. Check index coverage (per-model, multi-model aware)
gitsema indexAll configuration is done through environment variables. Set them in your shell or in a .env file loaded before running gitsema.
| Variable | Default | Description |
|---|---|---|
GITSEMA_PROVIDER |
ollama |
Embedding backend: ollama, http, or embedeer |
| Variable | Default | Description |
|---|---|---|
GITSEMA_MODEL |
nomic-embed-text |
Ollama model to use for embeddings |
GITSEMA_TEXT_MODEL |
value of GITSEMA_MODEL |
Model used for text/prose files |
GITSEMA_CODE_MODEL |
value of GITSEMA_TEXT_MODEL |
Model used for source code files (overrides text model) |
Ollama is assumed to be running at http://localhost:11434. Pull the model first:
ollama pull nomic-embed-text| Variable | Default | Description |
|---|---|---|
GITSEMA_HTTP_URL |
(required) | Base URL of the embeddings API, e.g. https://api.openai.com |
GITSEMA_MODEL |
nomic-embed-text |
Model name passed in the request body |
GITSEMA_TEXT_MODEL |
value of GITSEMA_MODEL |
Model for text files |
GITSEMA_CODE_MODEL |
value of GITSEMA_TEXT_MODEL |
Model for code files |
GITSEMA_API_KEY |
(optional) | Bearer token sent as Authorization: Bearer <key> |
Example for OpenAI:
export GITSEMA_PROVIDER=http
export GITSEMA_HTTP_URL=https://api.openai.com
export GITSEMA_MODEL=text-embedding-3-small
export GITSEMA_API_KEY=sk-...
gitsema index startGitsema can generate a CPU profile during the first successful indexing run to help tune embedding concurrency and batchSize.
-
Environment variable:
GITSEMA_PROFILE_FIRST_RUN(truthy enables, falsy disables) -
Repo config:
index.profileFirstRun(usegitsema config set index.profileFirstRun false --localto disable) -
Profiles are written into the indexed repo at
.gitsema/profiles/embedeer-profile-<timestamp>.cpuprofile -
Precedence: the
GITSEMA_PROFILE_FIRST_RUNenvironment variable overrides the repo configindex.profileFirstRun. -
Recommended: disable profiling in CI. Example (GitHub Actions):
env: GITSEMA_PROFILE_FIRST_RUN: '0'
By default profiling is enabled on the first run when no prior embeddings exist. If an index attempt fails, a partial profile is still saved but the "profile-done" marker is only written after a successful, full indexing run.
| Variable | Default | Description |
|---|---|---|
GITSEMA_VERBOSE |
off | Set to 1 for debug logging (same as --verbose) |
GITSEMA_REMOTE |
(optional) | Default remote gitsema tools serve URL; overridden per-command by --remote |
GITSEMA_LLM_URL |
(optional) | OpenAI-compatible URL for --narrate LLM summaries |
GITSEMA_LOG_MAX_BYTES |
1048576 |
Log rotation threshold (1 MB) |
Commands are organised into groups. See docs/features.md for the full feature catalog.
| Group | Commands |
|---|---|
| Setup | quickstart, config, status, models, repos |
| Indexing | index (status), index start, index doctor, index vacuum, index backfill-fts, index rebuild-fts, index gc, index clear-model, index update-modules, index build-vss, index export, index import, remote-index, watch |
| Protocol Servers | tools mcp, tools serve, tools lsp |
| Search & Discovery | search, code-search, first-seen, dead-concepts, repl |
| File History | file-evolution, file-diff, blame, impact, file-change-points |
| Concept History | evolution, diff, author, lifecycle |
| Cluster Analysis | clusters, cluster-diff, cluster-timeline |
| Change Detection | change-points, file-change-points, cluster-change-points |
| Branch / Merge | branch-summary, merge-audit, merge-preview, cherry-pick-suggest, ci-diff, bisect |
| Code Quality | code-review, security-scan, health, debt, doc-gap, refactor-candidates |
| Analysis | author, contributor-profile, triage, policy, ownership, eval, cross-repo-similarity, pr-report |
| Workflows | workflow run <template>, workflow list |
| Visualization | map, heatmap, project |
Backward-compatible aliases:
concept-evolution→evolution,semantic-blame→blame,gitsema mcp/gitsema serve/gitsema lsp→ usegitsema tools mcp/gitsema tools serve/gitsema tools lspinstead. The old DB maintenance commands (gitsema doctor,gitsema vacuum,gitsema gc,gitsema backfill-fts,gitsema rebuild-fts,gitsema update-modules,gitsema build-vss,gitsema clear-model) still work as hidden deprecated aliases and print a migration hint — use thegitsema index <subcommand>forms instead.
Not sure which command to use? Search by what you want to accomplish:
| I want to… | Command(s) |
|---|---|
| Get started with guided setup | gitsema quickstart |
| Index and search | gitsema index start, gitsema search "query" |
| See what's indexed / coverage | gitsema index, gitsema status |
| Find where a concept first appeared | gitsema first-seen "query" |
| Track how a file changed semantically over time | gitsema file-evolution path/to/file |
| Compare two versions of a file | gitsema file-diff <ref1> <ref2> path/to/file |
| Understand how a concept evolved | gitsema evolution "query" |
| Find functions or classes by meaning | gitsema code-search "query" |
| Detect when major semantic shifts happened | gitsema change-points "query" |
| See which commits diverged most semantically | gitsema cluster-diff <ref1> <ref2> |
| Understand who "owns" a concept | gitsema author "query" |
| Find stale or dead concepts | gitsema dead-concepts |
| Assess code health over time | gitsema health, gitsema debt |
| Find security-pattern matches | gitsema security-scan |
| Review a PR semantically | gitsema code-review, gitsema branch-summary, gitsema merge-audit |
| Find refactor candidates | gitsema refactor-candidates |
| Find doc coverage gaps | gitsema doc-gap |
| Triage an incident | gitsema triage "query" |
| Run a full analysis workflow | gitsema workflow run <template> |
| Run an interactive search session | gitsema repl |
| Set up a team server | gitsema tools serve --port 4242 --key <token> |
| Expose to Claude / AI assistants | gitsema tools mcp |
| Search across multiple repos | gitsema repos add, gitsema search "query" --repos <ids> |
| Add narrated summaries to any output | Append --narrate to most commands |
| Output JSON / HTML / Markdown | Append --out json, --out html, or --out markdown |
See docs/playbooks.md for role-based recipes (solo dev, PR reviewer, security engineer, release manager).
Interactive setup wizard. Detects your environment, walks through provider configuration (Ollama or HTTP), runs a test embedding, and records settings to .gitsema/config.json.
gitsema quickstartUse this the first time you set up gitsema in a new repo or on a new machine.
Show index statistics and database path. Also displays embed config provenance (provider, model, dimensions, chunker) recorded from previous index runs.
gitsema status
Show index coverage status — read-only, no writes. Displays Git-reachable blob counts and per-embedding-model coverage, including file-level, chunk-level, symbol-level and module-level stats.
One database can hold embeddings from multiple models simultaneously; this command reports coverage for each.
Output includes:
DB path and schema version
Git-reachable blob count (true 100% denominator — all refs)
DB blob count (what gitsema has seen)
Per embed-config / model:
file blobs embedded + coverage %
chunks, symbols, modules embedded (where present)
Walk the Git history and embed all blobs into the index. Starts from HEAD first (fastest time-to-first-results) then walks history. Already-indexed blobs are skipped automatically (content-addressed deduplication).
Uses the currently configured embedding model (GITSEMA_MODEL / gitsema config) unless overridden by --model.
Options:
--since <ref> Only index commits after this point.
Accepts a date (2024-01-01), tag (v1.0), or commit hash.
Use "all" to force a full re-index.
--max-commits <n> Stop after indexing this many commits.
--concurrency <n> Parallel embedding calls (default: 4). Increase on fast
hardware; decrease if the embedding server throttles.
--embed-batch-size <n> Batch size for embedding API calls.
--ext <extensions> Only index files with these extensions, e.g. ".ts,.js,.py"
--include-glob <patterns> Only index paths matching these glob patterns (comma-separated).
--max-size <size> Skip blobs larger than this (e.g. "200kb", "1mb"; default: 200kb)
--exclude <patterns> Skip blobs whose path contains any of these substrings.
--chunker <strategy> Chunking strategy: file (default), function, or fixed.
--level <granularity> Alias for --chunker: blob/file, function, fixed, multi.
--window-size <n> Characters per chunk for the fixed chunker (default: 1500).
--overlap <n> Character overlap between adjacent fixed chunks (default: 200).
--file <paths...> Index specific file(s) from HEAD (repeatable).
--model <model> Override all embedding models for this run.
--text-model <model> Override the text/prose embedding model.
--code-model <model> Override the code embedding model.
--quantize Enable Int8 scalar quantization of stored vectors.
--build-vss Build the HNSW vector index immediately after indexing.
--auto-build-vss [n] Auto-build VSS when total blobs exceed n (default: 10000).
--remote <url> Proxy embedding calls to a remote gitsema server.
--branch <name> Tag indexed blobs as belonging to this branch.
--profile <preset> Apply a preset: speed, balanced, or quality.
--allow-mixed Skip embed-config compatibility check (allow mixing
different models/dimensions in the same index).
Examples:
# Start full index from HEAD first, then walk history
gitsema index start
# Only TypeScript files added since a tag
gitsema index start --since v1.2.0 --ext ".ts,.tsx"
# Use function-level chunking with higher concurrency
gitsema index start --chunker function --concurrency 8
# Index specific files from HEAD
gitsema index start --file docs/PLAN.md src/cli/commands/index.ts --concurrency 2
# Force full re-index with a different model
gitsema index start --since all --model text-embedding-3-smallAsk a remote gitsema tools serve instance to clone and index a Git repository.
Populate FTS5 content for blobs indexed before Phase 11. Required to use --hybrid search on older index entries.
Run integrity checks and report the health of the index database.
gitsema index doctorChecks performed:
- Schema version vs expected version
- Blob / embedding / FTS row counts
- Missing FTS rows (suggests
gitsema index backfill-fts) - Orphan embeddings (suggests
gitsema index gc) - SQLite integrity check (
PRAGMA integrity_check) - Stored embed config provenance (provider, model, dimensions, chunker)
Exits with code 1 if critical issues (integrity failures or schema mismatch) are detected.
Run VACUUM and ANALYZE on the SQLite index database. Compacts the file and refreshes query planner statistics. Safe to run at any time.
gitsema index vacuumRebuild the FTS5 full-text search index from stored data. Use after bulk deletions or if hybrid search returns stale results.
gitsema index rebuild-fts # prompts for confirmation
gitsema index rebuild-fts --yes # skip confirmationGarbage collect unreachable blob records from the DB (blobs not reachable from any Git ref).
gitsema index gc
gitsema index gc --dry-run # preview what would be removedDelete all stored embeddings and cache entries for a specific model.
gitsema index clear-model nomic-embed-text
gitsema index clear-model text-embedding-3-small --yesRecalculate module (directory) centroid embeddings from stored whole-file embeddings.
gitsema index update-modulesBuild a usearch HNSW ANN index from stored embeddings for fast approximate search. Requires the usearch optional package.
gitsema index build-vss
gitsema index build-vss --model text-embedding-3-smallNote: The old top-level forms (
gitsema doctor,gitsema vacuum,gitsema backfill-fts, etc.) still work as deprecated aliases and will print a migration hint.
Manage embedding model configurations. Different models can use different providers, base URLs, and API keys. Model profiles are stored in .gitsema/config.json (local) or ~/.config/gitsema/config.json (global, --global).
Subcommands:
| Subcommand | Description |
|---|---|
gitsema models list |
List all configured profiles and indexed models |
gitsema models info <name> |
Show provider config + index stats for a model |
gitsema models add <name> |
Configure provider settings for a model |
gitsema models remove <name> |
Remove a model profile from config |
# List all models (from index + config profiles)
gitsema models list
# Show detailed info for a model
gitsema models info text-embedding-3-small
# Add an OpenAI model with its own provider config
gitsema models add text-embedding-3-small \
--provider http \
--url https://api.openai.com \
--key sk-... \
--set-text # also set as default text model
# Add a local Ollama model
gitsema models add nomic-embed-text --provider ollama --set-default
# Remove a profile (keep index data)
gitsema models remove text-embedding-3-small
# Remove a profile AND purge all its embeddings from the index
gitsema models remove text-embedding-3-small --purge-indexPer-model provider settings override global GITSEMA_PROVIDER / GITSEMA_HTTP_URL / GITSEMA_API_KEY environment variables, so you can use Ollama for one model and OpenAI for another in the same repo.
The --level flag on gitsema index start is a convenience alias for --chunker:
--level |
--chunker equivalent |
Description |
|---|---|---|
blob or file |
file (default) |
One embedding per file |
function |
function |
Function and class boundaries |
fixed |
fixed |
Fixed-size sliding windows |
gitsema index start --level function # embed at function granularity
gitsema index start --level blob # one embedding per file (default)
gitsema search "auth middleware" --level function # search function-level embeddingsTip: Use
--level functiononindex startand--level functiononsearchtogether for function-granularity semantic search.
Start the gitsema MCP server over stdio. Allows AI assistants (Claude, VS Code Copilot, etc.) to query the semantic index via the Model Context Protocol.
gitsema tools mcpAlias:
gitsema mcpstill works but is deprecated. Usegitsema tools mcp.
Start the LSP semantic hover server. Responds to hover requests with nearest-neighbor blobs.
gitsema tools lsp # stdio (default)
gitsema tools lsp --tcp 7777Start the gitsema HTTP API server so remote machines can delegate embedding and storage to a central host. Replaces the deprecated top-level gitsema serve command.
Options:
--port <n> Port to listen on (default: 4242)
--key <token> Require this Bearer token on all requests
--ui Serve the embedded 2D codebase map web UI at /ui
P2 operational features exposed by the HTTP server:
| Endpoint | Description |
|---|---|
GET /metrics |
Prometheus metrics scrape (protected by auth; set GITSEMA_METRICS_PUBLIC=1 to bypass) |
GET /openapi.json |
OpenAPI 3.1 spec (always public) |
GET /docs |
Swagger UI (always public) |
Rate limiting env vars:
| Variable | Default | Description |
|---|---|---|
GITSEMA_RATE_LIMIT_RPM |
300 |
Requests per minute per token/IP |
GITSEMA_RATE_LIMIT_BURST |
= RPM |
Per-window burst allowance |
GITSEMA_METRICS_PUBLIC |
off | Set to 1 to expose /metrics without auth |
GITSEMA_MAX_BODY_SIZE |
1mb |
Max request body size (e.g. 2mb, 512kb) |
For full deployment instructions (systemd, Docker, secrets, backups) see docs/deploy.md.
Alias:
gitsema servestill works but is deprecated. Usegitsema tools serve.
Semantically search the index.
Options:
-k, --top <n> Number of results (default: 10)
--level <granularity> Search at: file, chunk, or symbol level (default: symbol)
--threshold <n> Minimum similarity score 0–1 to include a result (default: 0)
--recent Blend cosine similarity with a recency score
--alpha <n> Cosine weight in blended score (0–1, default: 0.8)
--before <date> Only blobs first seen before this date (YYYY-MM-DD)
--after <date> Only blobs first seen after this date (YYYY-MM-DD)
--weight-vector <n> Vector weight in three-signal ranking (default: 0.7)
--weight-recency <n> Recency weight (default: 0.2)
--weight-path <n> Path-relevance weight (default: 0.1)
--group <mode> Group results by: file, module, or commit
--chunks Include chunk-level embeddings in results
--hybrid Combine vector similarity with BM25 keyword matching
--bm25-weight <n> BM25 weight in hybrid score (default: 0.3)
--branch <name> Restrict results to blobs seen on this branch
--model <model> Override query embedding model
--vss Use the HNSW approximate nearest-neighbour index
--repos <ids> Comma-separated repo IDs for multi-repo search
--narrate Generate an LLM summary of the results
--out <spec> Output format (repeatable): text, json[:file], html[:file],
markdown[:file]
Examples:
gitsema search "authentication middleware"
gitsema search "database connection pool" --top 20
gitsema search "rate limiting" --recent --after 2024-01-01
gitsema search "error handling" --hybridFind when a concept first appeared in the codebase, sorted chronologically.
Options:
-k, --top <n> Number of results (default: 10)
--hybrid Combine vector + BM25 search
--bm25-weight <n> BM25 weight in hybrid score (default: 0.3)
--include-commits Also search commit messages
--branch <name> Restrict to this branch
--model <model> Override query embedding model
--narrate Generate an LLM summary
--dump [file] Output JSON to file or stdout
--out <spec> Output format (repeatable): text, json[:file], html[:file],
markdown[:file]
gitsema first-seen "JWT token validation"
gitsema first-seen "rate limiting" --hybrid --include-commitsFind historical concepts that no longer exist in HEAD but are semantically similar to current code.
Options:
-k, --top <n> Number of results (default: 10)
--since <date> Only consider blobs whose latest commit is on or after this date
--branch <name> Restrict to this branch
--dump [file] Output structured JSON
--out <spec> Output format (repeatable)
Interactive semantic exploration REPL. Provides a persistent session where you can run search, first-seen, evolution, and other queries without re-embedding the query each time.
gitsema replInside the REPL, type a query to search, or prefix with a command name (e.g. first-seen auth, evolution "error handling"). Type help for available commands, exit to quit.
Track the semantic drift of a file across its Git history.
See also: file-diff, evolution
Options:
--threshold <n> Cosine distance above which a version change is flagged (default: 0.3)
--dump [file] Output structured JSON; writes to <file> or stdout if omitted
--include-content Include stored file content in the JSON dump (requires --dump)
--alerts [n] Show the top-N largest semantic jumps (default: 5)
gitsema file-evolution src/core/auth/middleware.ts
gitsema file-evolution src/core/auth/middleware.ts --dump evolution.jsonCompute the semantic diff between two versions of a file.
See also: file-evolution, cluster-diff, diff
Options:
--neighbors <n> Number of nearest-neighbour blobs to show for each version (default: 0)
gitsema file-diff HEAD~10 HEAD src/api/router.tsAlias:
gitsema semantic-blame(backward-compatible)
Show the semantic origin of each logical block in a file — nearest-neighbour blame.
See also: file-evolution, impact
Options:
-k, --top <n> Number of nearest-neighbor blobs to show per block (default: 3)
--dump [file] Output structured JSON
Compute semantically similar blobs across the codebase to highlight refactor impact.
Options:
-k, --top <n> Number of similar blobs to return (default: 10)
--chunks Include chunk-level embeddings for finer-grained coupling
--dump [file] Output structured JSON
Alias:
gitsema concept-evolution(backward-compatible)
Show how a semantic concept evolved across the entire commit history.
See also: file-evolution, first-seen, diff
Options:
-k, --top <n> Number of top-matching blobs to include (default: 50)
--threshold <n> Cosine distance threshold for flagging large changes (default: 0.3)
--dump [file] Output structured JSON
--html [file] Output an interactive HTML visualization
--include-content Include stored file content in the JSON dump (requires --dump)
gitsema evolution "authentication"
gitsema concept-evolution "authentication" # backward-compatible aliasCompute a conceptual/semantic diff of a topic across two git refs. Shows which blobs matching the topic were gained (new in ref2), lost (removed from ref1), and stable (present in both), each ranked by topic relevance — most relevant files for the topic appear at the top of each group.
See also: evolution, file-diff, cluster-diff
Arguments:
query Topic or concept to compare across the two refs
Options:
-k, --top <n> Max results per group (gained/lost/stable) (default: 10)
--dump [file] Output structured JSON
gitsema diff v1.0.0 HEAD "authentication"
gitsema diff 2024-01-01 2024-06-01 "error handling" --top 5
gitsema diff HEAD~20 HEAD "database access" --dump diff.jsonCluster all blob embeddings into semantic regions using k-means++ and display a concept graph.
See also: cluster-diff, cluster-timeline
Options:
--k <n> Number of clusters (default: 8)
--top <n> Top representative paths per cluster (default: 5)
--iterations <n> Max k-means iterations (default: 20)
--edge-threshold <n> Cosine similarity threshold for concept graph edges (default: 0.3)
--dump [file] Output structured JSON
--html [file] Output an interactive HTML visualization
--enhanced-labels Enhance cluster labels using TF-IDF path and identifier analysis
Compare semantic clusters between two points in history (temporal clustering).
See also: clusters, cluster-timeline, file-diff
gitsema cluster-diff v1.0.0 HEAD
gitsema cluster-diff 2024-01-01 2024-06-01Show how semantic clusters shifted over the commit history — multi-step timeline.
See also: clusters, cluster-diff
Options:
--k <n> Number of clusters per step (default: 8)
--steps <n> Number of evenly-spaced time checkpoints (default: 5)
--since <ref> Start date or git ref for the timeline
--until <ref> End date or git ref for the timeline
--html [file] Output an interactive HTML visualization
Detect conceptual change points for a semantic query across the entire commit history. For each indexed commit the command builds a weighted centroid from the top-k matching blobs visible at that point in time and reports commits where the centroid shifted sharply.
See also: concept-evolution, cluster-change-points
Options:
-k, --top <n> Top-k blobs used to define concept state per commit (default: 50)
--threshold <n> Cosine distance threshold to flag a change point (default: 0.3)
--top-points <n> Show top-N largest jumps (default: 5)
--since <ref> Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
--until <ref> Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
--dump [file] Output structured JSON; writes to <file> or stdout if omitted
gitsema change-points "authentication middleware"
gitsema change-points "database connection" --threshold 0.4 --top-points 3
gitsema change-points "error handling" --since 2024-01-01 --dump changes.jsonExample JSON output (--dump):
{
"type": "concept-change-points",
"query": "authentication middleware",
"k": 50,
"threshold": 0.3,
"range": { "since": null, "until": null },
"points": [
{
"before": { "commit": "a1b2c3d", "date": "2023-06-15", "timestamp": 1686787200, "topPaths": ["src/auth/session.ts"] },
"after": { "commit": "e4f5a6b", "date": "2023-09-20", "timestamp": 1695168000, "topPaths": ["src/auth/jwt.ts"] },
"distance": 0.412
}
]
}Detect semantic change points in a single file's Git history. Reports commits where the embedding distance between consecutive file versions exceeded the threshold.
See also: file-evolution, change-points
Options:
--threshold <n> Cosine distance threshold (default: 0.3)
--top-points <n> Show top-N largest jumps (default: 5)
--since <ref> Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
--until <ref> Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
--dump [file] Output structured JSON; writes to <file> or stdout if omitted
gitsema file-change-points src/core/auth/middleware.ts
gitsema file-change-points src/api/router.ts --threshold 0.4 --top-points 3
gitsema file-change-points src/db/schema.ts --since v1.0 --dump schema-changes.jsonExample JSON output (--dump):
{
"type": "file-change-points",
"path": "src/core/auth/middleware.ts",
"threshold": 0.3,
"range": { "since": null, "until": null },
"points": [
{
"before": { "commit": "a1b2c3d", "date": "2023-06-15", "timestamp": 1686787200, "blobHash": "abc1234..." },
"after": { "commit": "e4f5a6b", "date": "2023-09-20", "timestamp": 1695168000, "blobHash": "def5678..." },
"distance": 0.524
}
]
}Detect change points in the repo's cluster structure across commit history. For each sampled commit the command runs k-means clustering over visible blobs, matches clusters between consecutive steps using greedy centroid similarity, and reports steps where the mean centroid shift score exceeded the threshold.
See also: cluster-timeline, change-points
Performance note: By default every indexed commit is evaluated. On large repositories use
--max-commitsto cap the number of commits sampled (they are selected evenly across the since–until range).
Options:
--k <n> Number of clusters per step (default: 8)
--threshold <n> Mean centroid shift threshold (default: 0.3)
--top-points <n> Show top-N largest shifts (default: 5)
--since <ref> Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
--until <ref> Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
--max-commits <n> Cap commits evaluated; sampled evenly (omit to evaluate every commit)
--dump [file] Output structured JSON; writes to <file> or stdout if omitted
gitsema cluster-change-points
gitsema cluster-change-points --k 6 --threshold 0.4 --top-points 3
gitsema cluster-change-points --max-commits 200 --dump cluster-changes.jsonRank contributors by the number of distinct blobs they introduced and show which semantic clusters/concepts they worked on. No embedding provider required — uses data already in the index.
Tip: Run
gitsema clustersfirst to populate cluster labels. Without clusters, semantic areas are shown ascluster-<id>.
See also: author, contributor-profile
Options:
--top <n> Number of top contributors to show (default: 10)
--since <ref> Only count commits at or after this date (YYYY-MM-DD or ISO 8601)
--until <ref> Only count commits at or before this date (YYYY-MM-DD or ISO 8601)
--min-blobs <n> Suppress contributors with fewer than this many blobs (default: 1)
--top-clusters <n> Max semantic areas to show per contributor (default: 5)
--dump [file] Output structured JSON; writes to <file> or stdout if omitted
--html [file] Output an interactive HTML report; writes to <file> or experts.html
# Top 10 contributors overall
gitsema experts
# Top 5 contributors since 2024, with JSON output
gitsema experts --top 5 --since 2024-01-01 --dump experts.json
# Interactive HTML report
gitsema experts --html experts.htmlExample text output:
Top 3 contributors by semantic area (since 2024-01-01)
1. Alice <[email protected]>
Blobs: 142
Semantic areas:
· auth-module [38 blobs] (src/auth/jwt.ts, src/auth/session.ts)
· api-routes [31 blobs] (src/routes/auth.ts)
· db-layer [12 blobs] (src/db/users.ts)
2. Bob <[email protected]>
Blobs: 97
Semantic areas:
· db-layer [44 blobs] (src/db/schema.ts, src/db/migrations.ts)
· tests [29 blobs] (tests/integration/db.test.ts)
Generates a semantic PR report combining semantic diff, impacted modules, change-point highlights, and reviewer suggestions. Designed for CI/bot ingestion.
| Flag | Default | Description |
|---|---|---|
--ref1 <ref> |
HEAD~1 |
Earlier git ref |
--ref2 <ref> |
HEAD |
Later git ref |
--file <path> |
— | File to compute semantic diff and impact for |
--query <q> |
— | Topic query for change-point highlights |
-k, --top <n> |
10 |
Top-k results per section |
--since <date> |
— | Only include reviewer activity after this date |
--until <date> |
— | Only include reviewer activity before this date |
--dump [file] |
— | Output JSON to <file> or stdout if no file given |
gitsema pr-report --file src/auth.ts
gitsema pr-report --ref1 main --ref2 feature/auth --dump report.jsonRetrieval evaluation harness — measures search quality (P@k, R@k, MRR, latency) against a JSONL file of evaluation cases.
Each line of the JSONL file must be: { "query": "...", "expectedPaths": ["src/foo.ts"] }
| Flag | Default | Description |
|---|---|---|
-k, --top <n> |
10 |
Top-k results per query |
--dump [file] |
— | Write full JSON results to <file> or stdout |
gitsema eval eval-cases.jsonl --top 10
gitsema eval eval-cases.jsonl --dump eval-results.jsonSemantic code review assistant. Compares the diff between two refs and surfaces analogous blobs from history — prior implementations, related patterns, and known-good precedents — to inform a review.
Options:
--base <ref> Base ref (default: main)
--head <ref> Head ref (default: HEAD)
--diff-file <file> Read diff from a file instead of computing from refs
--top <n> Analogues to show per hunk (default: 5)
--threshold <n> Minimum similarity score (default: 0.75)
--format <fmt> Output format: text (default) or json
gitsema code-review
gitsema code-review --base main --head feature/authRun a productized analysis workflow. Each template bundles multiple commands into a coherent, narrated report.
| Template | Description |
|---|---|
pr-review |
Semantic PR review: diff, analogues, reviewer suggestions |
incident |
Incident triage: first-seen, change-points, bisect, experts |
onboarding |
Codebase orientation: clusters, experts, concept map |
release-audit |
Release readiness: health, debt, security, dead-concepts |
ownership-intel |
Ownership heatmap and contributor profiles |
arch-drift |
Architectural drift detection via cluster timeline |
knowledge-portal |
Knowledge discovery portal for a concept area |
regression-forecast |
Predict regression risk from semantic change signals |
Options:
--query <text> Concept or topic to focus the workflow on
--file <path> File to analyze (used by pr-review)
--base <ref> Base git ref (used by pr-review, regression-forecast)
--role <topic> Alias for --query
-k, --top <n> Result limit per section (default: 5)
--format <fmt> Output format: markdown (default) or json
--out <spec> Output format (repeatable)
--dump [file] Output JSON to file or stdout
gitsema workflow run pr-review --base main
gitsema workflow run incident --query "payment timeout"
gitsema workflow run release-auditList all available workflow templates with short descriptions.
gitsema workflow listIncident triage bundle. Runs first-seen, change-points, semantic bisect, and expert suggestions in one pass, then assembles a structured report.
Options:
--top <n> Top results per section (default: 10)
--ref1 <ref> Earlier bound for bisect / change-points
--ref2 <ref> Later bound for bisect / change-points
--file <path> File to include semantic diff for
--dump [file] Output JSON
--out <spec> Output format (repeatable)
gitsema triage "payment timeout error"
gitsema triage "auth regression" --ref1 v2.0 --ref2 HEAD --dump triage.jsonCI policy gates. Checks drift, debt, and security thresholds and exits non-zero when any gate fails — suitable for CI pipelines.
# Run all policy checks with defaults
gitsema policy check
# Override individual thresholds
gitsema policy check --max-debt-score 0.4 --max-drift 0.3Options:
--max-debt-score <n> Fail if mean debt score exceeds this (default: 0.6)
--min-security-score <n> Fail if security similarity score drops below this
--max-drift <n> Fail if concept drift exceeds this threshold
--query <q> Query to scope drift and change-point checks
--dump [file] Output JSON report
Returns HTTP 422 / exit code 1 when any gate fails; 200 / 0 when all pass.
Ownership heatmap. Shows which authors own blobs that are semantically related to a query, weighted by recency and volume.
Options:
--top <n> Top blobs to consider (default: 20)
--window-days <n> Rolling window for recency weighting
--branch <name> Restrict to this branch
--dump [file] Output JSON
--out <spec> Output format (repeatable)
gitsema ownership "authentication middleware"
gitsema ownership "database migrations" --window-days 90Limits the candidate pool to n randomly-sampled blobs before scoring. Useful for very large indexes (>100K blobs) to trade recall for speed.
gitsema search "authentication middleware" --early-cut 5000Outputs a provenance citation block for each result, formatted for injection into LLM prompts. Each block includes the file path, blob hash, first-seen date, score signals, and a content snippet.
gitsema search "authentication middleware" --explain-llmApplies a preset indexing profile that sets coherent defaults for concurrency, embed batch size, and chunker strategy.
| Profile | Concurrency | Batch size | Chunker | Best for |
|---|---|---|---|---|
speed |
8 | 32 | file | Fast indexing on fast hardware |
balanced |
4 | 16 | file | Default (auto-tuned) |
quality |
2 | 4 | function | Deep chunk/symbol indexing |
gitsema index start --profile speed
gitsema index start --profile qualityMost commands support --out for controlling output format. The flag is repeatable so you can emit multiple formats at once.
| Value | Description |
|---|---|
text |
Human-readable terminal output (default) |
json |
JSON to stdout |
json:<file> |
JSON written to <file> |
html |
Interactive HTML to stdout |
html:<file> |
Interactive HTML written to <file> |
markdown |
Markdown to stdout |
markdown:<file> |
Markdown written to <file> |
gitsema search "auth" --out json:results.json --out text
gitsema clusters --out html:clusters.html
gitsema evolution "error handling" --out markdown:report.md--dump [file] is a legacy alias for --out json[:file] and is still accepted.
Appending --narrate to any supporting command generates a plain-language narrative summary of the results using an LLM. Configure the LLM endpoint with GITSEMA_LLM_URL (OpenAI-compatible).
gitsema evolution "authentication" --narrate
gitsema clusters --narrate
gitsema health --narrateReturns a machine-readable JSON manifest of all features supported by the running server, including version, provider models, and enabled features. Useful for client auto-configuration.
curl http://localhost:4242/api/v1/capabilitiesAll analysis commands available in the CLI are also exposed over HTTP. Authentication via GITSEMA_SERVE_KEY applies to all routes.
| Route | Description | Key request fields |
|---|---|---|
POST /analysis/clusters |
K-means cluster snapshot | k, topKeywords, branch |
POST /analysis/change-points |
Concept change-point detection | query, topK, threshold |
POST /analysis/author |
Author attribution for a concept | query, topK, topAuthors |
POST /analysis/impact |
Cross-module coupling for a file | file, topK |
POST /analysis/semantic-diff |
Semantic diff between two refs | ref1, ref2, query |
POST /analysis/semantic-blame |
Semantic origin of code blocks | filePath, content, topK |
POST /analysis/dead-concepts |
Deleted semantic blobs | topK, since |
POST /analysis/merge-audit |
Semantic collision detection before merge | branchA, branchB, threshold |
POST /analysis/merge-preview |
Merge semantic impact preview | branch, into |
POST /analysis/branch-summary |
Branch semantic summary vs base | branch, baseBranch |
POST /analysis/experts |
Reviewer / expert suggestions | topN, since, until |
POST /analysis/security-scan |
Vulnerability pattern similarity scan | top |
POST /analysis/health |
Time-bucketed codebase health timeline | buckets, branch |
POST /analysis/debt |
Technical debt scoring | top, branch |
POST /analysis/doc-gap |
Documentation gap analysis | top, threshold, branch |
POST /analysis/contributor-profile |
Contributor semantic profile | author, top, branch |
POST /analysis/triage |
Incident triage bundle (first-seen + change-points + bisect + experts) | query, top, ref1, ref2, file |
POST /analysis/policy-check |
Automated CI gate (debt / security / drift thresholds) | maxDebtScore, minSecurityScore, maxDrift, query |
POST /analysis/ownership |
Ownership heatmap by concept | query, top, windowDays |
POST /analysis/workflow |
Workflow template runner | template (pr-review|incident|release-audit), query, file, top |
POST /analysis/eval |
Inline retrieval evaluation (P@k, R@k, MRR) | cases (array of {query, expectedPaths}), top |
POST /analysis/multi-repo-search |
Search across multiple registered repos | query, repoIds, topK |
Note on
security-scan: Results are semantic similarity scores, not confirmed vulnerabilities. Always perform manual review.
Note on
policy-check: Returns HTTP 200 when all gates pass, 422 when any gate fails — convenient for CI integration.
You can keep the semantic index in sync with your repository automatically by
installing the provided Git hook scripts. Once installed, gitsema index runs
in the background after every git commit and every git pull / git merge —
no manual intervention required.
| Hook | Trigger | Command run |
|---|---|---|
post-commit |
After every git commit |
gitsema index start --since HEAD~1 |
post-merge |
After every git pull / git merge |
gitsema index start --since ORIG_HEAD |
Both hooks are safe no-ops when:
gitsemais not on yourPATH, or- the index has not been initialised yet (run
gitsema index startonce first).
Copy the scripts into your repository's .git/hooks/ directory and make
them executable:
cp scripts/hooks/post-commit .git/hooks/post-commit
cp scripts/hooks/post-merge .git/hooks/post-merge
chmod +x .git/hooks/post-commit .git/hooks/post-mergeAlternatively, use symlinks so the scripts stay in sync whenever you pull
updates to the scripts/hooks/ directory:
ln -s ../../scripts/hooks/post-commit .git/hooks/post-commit
ln -s ../../scripts/hooks/post-merge .git/hooks/post-mergeThe gitsema config command can install or remove the hooks automatically —
no manual file copying required:
# Install hooks for the current repository (symlinks into .git/hooks/)
gitsema config set hooks.enabled true
# Remove the managed hooks
gitsema config set hooks.enabled falseThe config value is persisted in .gitsema/config.json so hooks are
re-enabled automatically when you run gitsema config set hooks.enabled true
again after a re-clone. The manual copy/symlink steps above remain a valid
alternative if you prefer not to use the config command.
The index is stored in .gitsema/index.db (SQLite) in the root of the repository. Add it to .gitignore to avoid committing it:
.gitsema/
See docs/features.md for the complete, grouped catalog of implemented features including indexing options, all search flags, history/temporal commands, clustering, branch/merge tools, the HTTP API route list, and all MCP tools.
For the latest deep review of bottlenecks, missing features, productization ideas, and AI-assisted coding workflows, see docs/review7.md.
A reusable AI-operator playbook is available at skill/gitsema-ai-assistant.md. Use it as a prompt scaffold for coding assistants that interact with gitsema.
See docs/PLAN.md for the full development roadmap, phase history, and backlog of planned features.
