gitsema

A content-addressed semantic index synchronized with Git's object model.

Gitsema walks your Git history, embeds every blob, and lets you semantically search your codebase — including across time. It treats blob hashes as the unit of identity, so identical content is only embedded once regardless of how many commits reference it.

Requirements

Node.js 20+
Git (must be on PATH)
An embedding backend — either:
- Ollama (local, default): ollama.ai with nomic-embed-text pulled
- HTTP / OpenAI-compatible API: any endpoint that speaks the OpenAI embeddings API

Installation

Install from npm (requires Node.js >=20):

npm install -g gitsema

Or install from source:

git clone https://github.com/jsilvanus/gitsema.git
cd gitsema
pnpm install
pnpm build           # compiles TypeScript → dist/

# Optional: put `gitsema` on your PATH
pnpm setup           # one-time setup; then open a new terminal
pnpm link --global

To use without linking, prefix commands with node dist/cli/index.js instead of gitsema.

Quick start

cd /path/to/your/git/repo

# 1. Start indexing (uses Ollama by default)
gitsema index start

# 2. Search
gitsema search "authentication middleware"

# 3. Check index coverage (per-model, multi-model aware)
gitsema index

Configuration (environment variables)

All configuration is done through environment variables. Set them in your shell or in a .env file loaded before running gitsema.

Provider selection

Variable	Default	Description
`GITSEMA_PROVIDER`	`ollama`	Embedding backend: `ollama`, `http`, or `embedeer`

Ollama provider (`GITSEMA_PROVIDER=ollama`)

Variable	Default	Description
`GITSEMA_MODEL`	`nomic-embed-text`	Ollama model to use for embeddings
`GITSEMA_TEXT_MODEL`	value of `GITSEMA_MODEL`	Model used for text/prose files
`GITSEMA_CODE_MODEL`	value of `GITSEMA_TEXT_MODEL`	Model used for source code files (overrides text model)

Ollama is assumed to be running at http://localhost:11434. Pull the model first:

ollama pull nomic-embed-text

HTTP / OpenAI-compatible provider (`GITSEMA_PROVIDER=http`)

Variable	Default	Description
`GITSEMA_HTTP_URL`	(required)	Base URL of the embeddings API, e.g. `https://api.openai.com`
`GITSEMA_MODEL`	`nomic-embed-text`	Model name passed in the request body
`GITSEMA_TEXT_MODEL`	value of `GITSEMA_MODEL`	Model for text files
`GITSEMA_CODE_MODEL`	value of `GITSEMA_TEXT_MODEL`	Model for code files
`GITSEMA_API_KEY`	(optional)	Bearer token sent as `Authorization: Bearer <key>`

Example for OpenAI:

export GITSEMA_PROVIDER=http
export GITSEMA_HTTP_URL=https://api.openai.com
export GITSEMA_MODEL=text-embedding-3-small
export GITSEMA_API_KEY=sk-...
gitsema index start

First-run profiling

Gitsema can generate a CPU profile during the first successful indexing run to help tune embedding concurrency and batchSize.

Environment variable: GITSEMA_PROFILE_FIRST_RUN (truthy enables, falsy disables)
Repo config: index.profileFirstRun (use gitsema config set index.profileFirstRun false --local to disable)
Profiles are written into the indexed repo at .gitsema/profiles/embedeer-profile-<timestamp>.cpuprofile
Precedence: the GITSEMA_PROFILE_FIRST_RUN environment variable overrides the repo config index.profileFirstRun.
Recommended: disable profiling in CI. Example (GitHub Actions):
```
env:
  GITSEMA_PROFILE_FIRST_RUN: '0'
```

By default profiling is enabled on the first run when no prior embeddings exist. If an index attempt fails, a partial profile is still saved but the "profile-done" marker is only written after a successful, full indexing run.

Operational settings

Variable	Default	Description
`GITSEMA_VERBOSE`	off	Set to `1` for debug logging (same as `--verbose`)
`GITSEMA_REMOTE`	(optional)	Default remote `gitsema tools serve` URL; overridden per-command by `--remote`
`GITSEMA_LLM_URL`	(optional)	OpenAI-compatible URL for `--narrate` LLM summaries
`GITSEMA_LOG_MAX_BYTES`	`1048576`	Log rotation threshold (1 MB)

Commands

Commands are organised into groups. See docs/features.md for the full feature catalog.

Group	Commands
Setup	`quickstart`, `config`, `status`, `models`, `repos`
Indexing	`index` (status), `index start`, `index doctor`, `index vacuum`, `index backfill-fts`, `index rebuild-fts`, `index gc`, `index clear-model`, `index update-modules`, `index build-vss`, `index export`, `index import`, `remote-index`, `watch`
Protocol Servers	`tools mcp`, `tools serve`, `tools lsp`
Search & Discovery	`search`, `code-search`, `first-seen`, `dead-concepts`, `repl`
File History	`file-evolution`, `file-diff`, `blame`, `impact`, `file-change-points`
Concept History	`evolution`, `diff`, `author`, `lifecycle`
Cluster Analysis	`clusters`, `cluster-diff`, `cluster-timeline`
Change Detection	`change-points`, `file-change-points`, `cluster-change-points`
Branch / Merge	`branch-summary`, `merge-audit`, `merge-preview`, `cherry-pick-suggest`, `ci-diff`, `bisect`
Code Quality	`code-review`, `security-scan`, `health`, `debt`, `doc-gap`, `refactor-candidates`
Analysis	`author`, `contributor-profile`, `triage`, `policy`, `ownership`, `eval`, `cross-repo-similarity`, `pr-report`
Workflows	`workflow run <template>`, `workflow list`
Visualization	`map`, `heatmap`, `project`

Backward-compatible aliases: concept-evolution → evolution, semantic-blame → blame, gitsema mcp / gitsema serve / gitsema lsp → use gitsema tools mcp / gitsema tools serve / gitsema tools lsp instead. The old DB maintenance commands (gitsema doctor, gitsema vacuum, gitsema gc, gitsema backfill-fts, gitsema rebuild-fts, gitsema update-modules, gitsema build-vss, gitsema clear-model) still work as hidden deprecated aliases and print a migration hint — use the gitsema index <subcommand> forms instead.

Find the right command by goal

Not sure which command to use? Search by what you want to accomplish:

I want to…	Command(s)
Get started with guided setup	`gitsema quickstart`
Index and search	`gitsema index start`, `gitsema search "query"`
See what's indexed / coverage	`gitsema index`, `gitsema status`
Find where a concept first appeared	`gitsema first-seen "query"`
Track how a file changed semantically over time	`gitsema file-evolution path/to/file`
Compare two versions of a file	`gitsema file-diff <ref1> <ref2> path/to/file`
Understand how a concept evolved	`gitsema evolution "query"`
Find functions or classes by meaning	`gitsema code-search "query"`
Detect when major semantic shifts happened	`gitsema change-points "query"`
See which commits diverged most semantically	`gitsema cluster-diff <ref1> <ref2>`
Understand who "owns" a concept	`gitsema author "query"`
Find stale or dead concepts	`gitsema dead-concepts`
Assess code health over time	`gitsema health`, `gitsema debt`
Find security-pattern matches	`gitsema security-scan`
Review a PR semantically	`gitsema code-review`, `gitsema branch-summary`, `gitsema merge-audit`
Find refactor candidates	`gitsema refactor-candidates`
Find doc coverage gaps	`gitsema doc-gap`
Triage an incident	`gitsema triage "query"`
Run a full analysis workflow	`gitsema workflow run <template>`
Run an interactive search session	`gitsema repl`
Set up a team server	`gitsema tools serve --port 4242 --key <token>`
Expose to Claude / AI assistants	`gitsema tools mcp`
Search across multiple repos	`gitsema repos add`, `gitsema search "query" --repos <ids>`
Add narrated summaries to any output	Append `--narrate` to most commands
Output JSON / HTML / Markdown	Append `--out json`, `--out html`, or `--out markdown`

See docs/playbooks.md for role-based recipes (solo dev, PR reviewer, security engineer, release manager).

Setup & Infrastructure

`gitsema quickstart`

Interactive setup wizard. Detects your environment, walks through provider configuration (Ollama or HTTP), runs a test embedding, and records settings to .gitsema/config.json.

gitsema quickstart

Use this the first time you set up gitsema in a new repo or on a new machine.

`gitsema status`

Show index statistics and database path. Also displays embed config provenance (provider, model, dimensions, chunker) recorded from previous index runs.

gitsema status

`gitsema index`

Show index coverage status — read-only, no writes. Displays Git-reachable blob counts and per-embedding-model coverage, including file-level, chunk-level, symbol-level and module-level stats.

One database can hold embeddings from multiple models simultaneously; this command reports coverage for each.

Output includes:
  DB path and schema version
  Git-reachable blob count (true 100% denominator — all refs)
  DB blob count (what gitsema has seen)
  Per embed-config / model:
    file blobs embedded + coverage %
    chunks, symbols, modules embedded (where present)

`gitsema index start [options]`

Walk the Git history and embed all blobs into the index. Starts from HEAD first (fastest time-to-first-results) then walks history. Already-indexed blobs are skipped automatically (content-addressed deduplication).

Uses the currently configured embedding model (GITSEMA_MODEL / gitsema config) unless overridden by --model.

Options:
  --since <ref>              Only index commits after this point.
                             Accepts a date (2024-01-01), tag (v1.0), or commit hash.
                             Use "all" to force a full re-index.
  --max-commits <n>          Stop after indexing this many commits.
  --concurrency <n>          Parallel embedding calls (default: 4). Increase on fast
                             hardware; decrease if the embedding server throttles.
  --embed-batch-size <n>     Batch size for embedding API calls.
  --ext <extensions>         Only index files with these extensions, e.g. ".ts,.js,.py"
  --include-glob <patterns>  Only index paths matching these glob patterns (comma-separated).
  --max-size <size>          Skip blobs larger than this (e.g. "200kb", "1mb"; default: 200kb)
  --exclude <patterns>       Skip blobs whose path contains any of these substrings.
  --chunker <strategy>       Chunking strategy: file (default), function, or fixed.
  --level <granularity>      Alias for --chunker: blob/file, function, fixed, multi.
  --window-size <n>          Characters per chunk for the fixed chunker (default: 1500).
  --overlap <n>              Character overlap between adjacent fixed chunks (default: 200).
  --file <paths...>          Index specific file(s) from HEAD (repeatable).
  --model <model>            Override all embedding models for this run.
  --text-model <model>       Override the text/prose embedding model.
  --code-model <model>       Override the code embedding model.
  --quantize                 Enable Int8 scalar quantization of stored vectors.
  --build-vss                Build the HNSW vector index immediately after indexing.
  --auto-build-vss [n]       Auto-build VSS when total blobs exceed n (default: 10000).
  --remote <url>             Proxy embedding calls to a remote gitsema server.
  --branch <name>            Tag indexed blobs as belonging to this branch.
  --profile <preset>         Apply a preset: speed, balanced, or quality.
  --allow-mixed              Skip embed-config compatibility check (allow mixing
                             different models/dimensions in the same index).

Examples:

# Start full index from HEAD first, then walk history
gitsema index start

# Only TypeScript files added since a tag
gitsema index start --since v1.2.0 --ext ".ts,.tsx"

# Use function-level chunking with higher concurrency
gitsema index start --chunker function --concurrency 8

# Index specific files from HEAD
gitsema index start --file docs/PLAN.md src/cli/commands/index.ts --concurrency 2

# Force full re-index with a different model
gitsema index start --since all --model text-embedding-3-small

`gitsema remote-index <repoUrl>`

Ask a remote gitsema tools serve instance to clone and index a Git repository.

`gitsema index backfill-fts`

Populate FTS5 content for blobs indexed before Phase 11. Required to use --hybrid search on older index entries.

`gitsema index doctor`

Run integrity checks and report the health of the index database.

gitsema index doctor

Checks performed:

Schema version vs expected version
Blob / embedding / FTS row counts
Missing FTS rows (suggests gitsema index backfill-fts)
Orphan embeddings (suggests gitsema index gc)
SQLite integrity check (PRAGMA integrity_check)
Stored embed config provenance (provider, model, dimensions, chunker)

Exits with code 1 if critical issues (integrity failures or schema mismatch) are detected.

`gitsema index vacuum`

Run VACUUM and ANALYZE on the SQLite index database. Compacts the file and refreshes query planner statistics. Safe to run at any time.

gitsema index vacuum

`gitsema index rebuild-fts`

Rebuild the FTS5 full-text search index from stored data. Use after bulk deletions or if hybrid search returns stale results.

gitsema index rebuild-fts        # prompts for confirmation
gitsema index rebuild-fts --yes  # skip confirmation

`gitsema index gc`

Garbage collect unreachable blob records from the DB (blobs not reachable from any Git ref).

gitsema index gc
gitsema index gc --dry-run  # preview what would be removed

`gitsema index clear-model <model>`

Delete all stored embeddings and cache entries for a specific model.

gitsema index clear-model nomic-embed-text
gitsema index clear-model text-embedding-3-small --yes

`gitsema index update-modules`

Recalculate module (directory) centroid embeddings from stored whole-file embeddings.

gitsema index update-modules

`gitsema index build-vss`

Build a usearch HNSW ANN index from stored embeddings for fast approximate search. Requires the usearch optional package.

gitsema index build-vss
gitsema index build-vss --model text-embedding-3-small

Note: The old top-level forms (gitsema doctor, gitsema vacuum, gitsema backfill-fts, etc.) still work as deprecated aliases and will print a migration hint.

`gitsema models`

Manage embedding model configurations. Different models can use different providers, base URLs, and API keys. Model profiles are stored in .gitsema/config.json (local) or ~/.config/gitsema/config.json (global, --global).

Subcommands:

Subcommand	Description
`gitsema models list`	List all configured profiles and indexed models
`gitsema models info <name>`	Show provider config + index stats for a model
`gitsema models add <name>`	Configure provider settings for a model
`gitsema models remove <name>`	Remove a model profile from config

# List all models (from index + config profiles)
gitsema models list

# Show detailed info for a model
gitsema models info text-embedding-3-small

# Add an OpenAI model with its own provider config
gitsema models add text-embedding-3-small \
  --provider http \
  --url https://api.openai.com \
  --key sk-... \
  --set-text                        # also set as default text model

# Add a local Ollama model
gitsema models add nomic-embed-text --provider ollama --set-default

# Remove a profile (keep index data)
gitsema models remove text-embedding-3-small

# Remove a profile AND purge all its embeddings from the index
gitsema models remove text-embedding-3-small --purge-index

Per-model provider settings override global GITSEMA_PROVIDER / GITSEMA_HTTP_URL / GITSEMA_API_KEY environment variables, so you can use Ollama for one model and OpenAI for another in the same repo.

`gitsema index start --level <level>`

The --level flag on gitsema index start is a convenience alias for --chunker:

`--level`	`--chunker` equivalent	Description
`blob` or `file`	`file` (default)	One embedding per file
`function`	`function`	Function and class boundaries
`fixed`	`fixed`	Fixed-size sliding windows

gitsema index start --level function     # embed at function granularity
gitsema index start --level blob         # one embedding per file (default)
gitsema search "auth middleware" --level function  # search function-level embeddings

Tip: Use --level function on index start and --level function on search together for function-granularity semantic search.

`gitsema tools mcp`

Start the gitsema MCP server over stdio. Allows AI assistants (Claude, VS Code Copilot, etc.) to query the semantic index via the Model Context Protocol.

gitsema tools mcp

Alias: gitsema mcp still works but is deprecated. Use gitsema tools mcp.

`gitsema tools lsp [--tcp <port>]`

Start the LSP semantic hover server. Responds to hover requests with nearest-neighbor blobs.

gitsema tools lsp          # stdio (default)
gitsema tools lsp --tcp 7777

`gitsema tools serve [options]`

Start the gitsema HTTP API server so remote machines can delegate embedding and storage to a central host. Replaces the deprecated top-level gitsema serve command.

Options:
  --port <n>      Port to listen on (default: 4242)
  --key <token>   Require this Bearer token on all requests
  --ui            Serve the embedded 2D codebase map web UI at /ui

P2 operational features exposed by the HTTP server:

Endpoint	Description
`GET /metrics`	Prometheus metrics scrape (protected by auth; set `GITSEMA_METRICS_PUBLIC=1` to bypass)
`GET /openapi.json`	OpenAPI 3.1 spec (always public)
`GET /docs`	Swagger UI (always public)

Rate limiting env vars:

Variable	Default	Description
`GITSEMA_RATE_LIMIT_RPM`	`300`	Requests per minute per token/IP
`GITSEMA_RATE_LIMIT_BURST`	`= RPM`	Per-window burst allowance
`GITSEMA_METRICS_PUBLIC`	off	Set to `1` to expose `/metrics` without auth
`GITSEMA_MAX_BODY_SIZE`	`1mb`	Max request body size (e.g. `2mb`, `512kb`)

For full deployment instructions (systemd, Docker, secrets, backups) see docs/deploy.md.

Alias: gitsema serve still works but is deprecated. Use gitsema tools serve.

Search & Discovery

`gitsema search <query> [options]`

Semantically search the index.

Options:
  -k, --top <n>           Number of results (default: 10)
  --level <granularity>   Search at: file, chunk, or symbol level (default: symbol)
  --threshold <n>         Minimum similarity score 0–1 to include a result (default: 0)
  --recent                Blend cosine similarity with a recency score
  --alpha <n>             Cosine weight in blended score (0–1, default: 0.8)
  --before <date>         Only blobs first seen before this date (YYYY-MM-DD)
  --after <date>          Only blobs first seen after this date (YYYY-MM-DD)
  --weight-vector <n>     Vector weight in three-signal ranking (default: 0.7)
  --weight-recency <n>    Recency weight (default: 0.2)
  --weight-path <n>       Path-relevance weight (default: 0.1)
  --group <mode>          Group results by: file, module, or commit
  --chunks                Include chunk-level embeddings in results
  --hybrid                Combine vector similarity with BM25 keyword matching
  --bm25-weight <n>       BM25 weight in hybrid score (default: 0.3)
  --branch <name>         Restrict results to blobs seen on this branch
  --model <model>         Override query embedding model
  --vss                   Use the HNSW approximate nearest-neighbour index
  --repos <ids>           Comma-separated repo IDs for multi-repo search
  --narrate               Generate an LLM summary of the results
  --out <spec>            Output format (repeatable): text, json[:file], html[:file],
                          markdown[:file]

Examples:

gitsema search "authentication middleware"
gitsema search "database connection pool" --top 20
gitsema search "rate limiting" --recent --after 2024-01-01
gitsema search "error handling" --hybrid

`gitsema first-seen <query> [options]`

Find when a concept first appeared in the codebase, sorted chronologically.

See also: search, evolution

Options:
  -k, --top <n>           Number of results (default: 10)
  --hybrid                Combine vector + BM25 search
  --bm25-weight <n>       BM25 weight in hybrid score (default: 0.3)
  --include-commits       Also search commit messages
  --branch <name>         Restrict to this branch
  --model <model>         Override query embedding model
  --narrate               Generate an LLM summary
  --dump [file]           Output JSON to file or stdout
  --out <spec>            Output format (repeatable): text, json[:file], html[:file],
                          markdown[:file]

gitsema first-seen "JWT token validation"
gitsema first-seen "rate limiting" --hybrid --include-commits

`gitsema dead-concepts [options]`

Find historical concepts that no longer exist in HEAD but are semantically similar to current code.

See also: search, evolution

Options:
  -k, --top <n>       Number of results (default: 10)
  --since <date>      Only consider blobs whose latest commit is on or after this date
  --branch <name>     Restrict to this branch
  --dump [file]       Output structured JSON
  --out <spec>        Output format (repeatable)

`gitsema repl`

Interactive semantic exploration REPL. Provides a persistent session where you can run search, first-seen, evolution, and other queries without re-embedding the query each time.

gitsema repl

Inside the REPL, type a query to search, or prefix with a command name (e.g. first-seen auth, evolution "error handling"). Type help for available commands, exit to quit.

File History

`gitsema file-evolution <path> [options]`

Track the semantic drift of a file across its Git history.

See also: file-diff, evolution

Options:
  --threshold <n>       Cosine distance above which a version change is flagged (default: 0.3)
  --dump [file]         Output structured JSON; writes to <file> or stdout if omitted
  --include-content     Include stored file content in the JSON dump (requires --dump)
  --alerts [n]          Show the top-N largest semantic jumps (default: 5)

gitsema file-evolution src/core/auth/middleware.ts
gitsema file-evolution src/core/auth/middleware.ts --dump evolution.json

`gitsema file-diff <ref1> <ref2> <path>`

Compute the semantic diff between two versions of a file.

See also: file-evolution, cluster-diff, diff

Options:
  --neighbors <n>   Number of nearest-neighbour blobs to show for each version (default: 0)

gitsema file-diff HEAD~10 HEAD src/api/router.ts

`gitsema blame <file> [options]`

Alias: gitsema semantic-blame (backward-compatible)

Show the semantic origin of each logical block in a file — nearest-neighbour blame.

See also: file-evolution, impact

Options:
  -k, --top <n>   Number of nearest-neighbor blobs to show per block (default: 3)
  --dump [file]   Output structured JSON

`gitsema impact <path> [options]`

Compute semantically similar blobs across the codebase to highlight refactor impact.

See also: blame, file-diff

Options:
  -k, --top <n>   Number of similar blobs to return (default: 10)
  --chunks        Include chunk-level embeddings for finer-grained coupling
  --dump [file]   Output structured JSON

Concept History

`gitsema evolution <query> [options]`

Alias: gitsema concept-evolution (backward-compatible)

Show how a semantic concept evolved across the entire commit history.

See also: file-evolution, first-seen, diff

Options:
  -k, --top <n>         Number of top-matching blobs to include (default: 50)
  --threshold <n>       Cosine distance threshold for flagging large changes (default: 0.3)
  --dump [file]         Output structured JSON
  --html [file]         Output an interactive HTML visualization
  --include-content     Include stored file content in the JSON dump (requires --dump)

gitsema evolution "authentication"
gitsema concept-evolution "authentication"   # backward-compatible alias

`gitsema diff <ref1> <ref2> <query> [options]`

Compute a conceptual/semantic diff of a topic across two git refs. Shows which blobs matching the topic were gained (new in ref2), lost (removed from ref1), and stable (present in both), each ranked by topic relevance — most relevant files for the topic appear at the top of each group.

See also: evolution, file-diff, cluster-diff

Arguments:
  query             Topic or concept to compare across the two refs

Options:
  -k, --top <n>     Max results per group (gained/lost/stable) (default: 10)
  --dump [file]     Output structured JSON

gitsema diff v1.0.0 HEAD "authentication"
gitsema diff 2024-01-01 2024-06-01 "error handling" --top 5
gitsema diff HEAD~20 HEAD "database access" --dump diff.json

Cluster Analysis

`gitsema clusters [options]`

Cluster all blob embeddings into semantic regions using k-means++ and display a concept graph.

See also: cluster-diff, cluster-timeline

Options:
  --k <n>                 Number of clusters (default: 8)
  --top <n>               Top representative paths per cluster (default: 5)
  --iterations <n>        Max k-means iterations (default: 20)
  --edge-threshold <n>    Cosine similarity threshold for concept graph edges (default: 0.3)
  --dump [file]           Output structured JSON
  --html [file]           Output an interactive HTML visualization
  --enhanced-labels       Enhance cluster labels using TF-IDF path and identifier analysis

`gitsema cluster-diff <ref1> <ref2>`

Compare semantic clusters between two points in history (temporal clustering).

See also: clusters, cluster-timeline, file-diff

gitsema cluster-diff v1.0.0 HEAD
gitsema cluster-diff 2024-01-01 2024-06-01

`gitsema cluster-timeline`

Show how semantic clusters shifted over the commit history — multi-step timeline.

See also: clusters, cluster-diff

Options:
  --k <n>         Number of clusters per step (default: 8)
  --steps <n>     Number of evenly-spaced time checkpoints (default: 5)
  --since <ref>   Start date or git ref for the timeline
  --until <ref>   End date or git ref for the timeline
  --html [file]   Output an interactive HTML visualization

Change Detection

`gitsema change-points <query> [options]`

Detect conceptual change points for a semantic query across the entire commit history. For each indexed commit the command builds a weighted centroid from the top-k matching blobs visible at that point in time and reports commits where the centroid shifted sharply.

See also: concept-evolution, cluster-change-points

Options:
  -k, --top <n>       Top-k blobs used to define concept state per commit (default: 50)
  --threshold <n>     Cosine distance threshold to flag a change point (default: 0.3)
  --top-points <n>    Show top-N largest jumps (default: 5)
  --since <ref>       Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
  --until <ref>       Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted

gitsema change-points "authentication middleware"
gitsema change-points "database connection" --threshold 0.4 --top-points 3
gitsema change-points "error handling" --since 2024-01-01 --dump changes.json

Example JSON output (--dump):

{
  "type": "concept-change-points",
  "query": "authentication middleware",
  "k": 50,
  "threshold": 0.3,
  "range": { "since": null, "until": null },
  "points": [
    {
      "before": { "commit": "a1b2c3d", "date": "2023-06-15", "timestamp": 1686787200, "topPaths": ["src/auth/session.ts"] },
      "after":  { "commit": "e4f5a6b", "date": "2023-09-20", "timestamp": 1695168000, "topPaths": ["src/auth/jwt.ts"] },
      "distance": 0.412
    }
  ]
}

`gitsema file-change-points <path> [options]`

Detect semantic change points in a single file's Git history. Reports commits where the embedding distance between consecutive file versions exceeded the threshold.

See also: file-evolution, change-points

Options:
  --threshold <n>     Cosine distance threshold (default: 0.3)
  --top-points <n>    Show top-N largest jumps (default: 5)
  --since <ref>       Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
  --until <ref>       Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted

gitsema file-change-points src/core/auth/middleware.ts
gitsema file-change-points src/api/router.ts --threshold 0.4 --top-points 3
gitsema file-change-points src/db/schema.ts --since v1.0 --dump schema-changes.json

Example JSON output (--dump):

{
  "type": "file-change-points",
  "path": "src/core/auth/middleware.ts",
  "threshold": 0.3,
  "range": { "since": null, "until": null },
  "points": [
    {
      "before": { "commit": "a1b2c3d", "date": "2023-06-15", "timestamp": 1686787200, "blobHash": "abc1234..." },
      "after":  { "commit": "e4f5a6b", "date": "2023-09-20", "timestamp": 1695168000, "blobHash": "def5678..." },
      "distance": 0.524
    }
  ]
}

`gitsema cluster-change-points [options]`

Detect change points in the repo's cluster structure across commit history. For each sampled commit the command runs k-means clustering over visible blobs, matches clusters between consecutive steps using greedy centroid similarity, and reports steps where the mean centroid shift score exceeded the threshold.

See also: cluster-timeline, change-points

Performance note: By default every indexed commit is evaluated. On large repositories use --max-commits to cap the number of commits sampled (they are selected evenly across the since–until range).

Options:
  --k <n>             Number of clusters per step (default: 8)
  --threshold <n>     Mean centroid shift threshold (default: 0.3)
  --top-points <n>    Show top-N largest shifts (default: 5)
  --since <ref>       Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
  --until <ref>       Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
  --max-commits <n>   Cap commits evaluated; sampled evenly (omit to evaluate every commit)
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted

gitsema cluster-change-points
gitsema cluster-change-points --k 6 --threshold 0.4 --top-points 3
gitsema cluster-change-points --max-commits 200 --dump cluster-changes.json

Repo Insights

`gitsema experts [options]`

Rank contributors by the number of distinct blobs they introduced and show which semantic clusters/concepts they worked on. No embedding provider required — uses data already in the index.

Tip: Run gitsema clusters first to populate cluster labels. Without clusters, semantic areas are shown as cluster-<id>.

See also: author, contributor-profile

Options:
  --top <n>           Number of top contributors to show (default: 10)
  --since <ref>       Only count commits at or after this date (YYYY-MM-DD or ISO 8601)
  --until <ref>       Only count commits at or before this date (YYYY-MM-DD or ISO 8601)
  --min-blobs <n>     Suppress contributors with fewer than this many blobs (default: 1)
  --top-clusters <n>  Max semantic areas to show per contributor (default: 5)
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted
  --html [file]       Output an interactive HTML report; writes to <file> or experts.html

# Top 10 contributors overall
gitsema experts

# Top 5 contributors since 2024, with JSON output
gitsema experts --top 5 --since 2024-01-01 --dump experts.json

# Interactive HTML report
gitsema experts --html experts.html

Example text output:

Top 3 contributors by semantic area (since 2024-01-01)

1. Alice <[email protected]>
   Blobs: 142
   Semantic areas:
     · auth-module  [38 blobs]  (src/auth/jwt.ts, src/auth/session.ts)
     · api-routes   [31 blobs]  (src/routes/auth.ts)
     · db-layer     [12 blobs]  (src/db/users.ts)

2. Bob <[email protected]>
   Blobs: 97
   Semantic areas:
     · db-layer     [44 blobs]  (src/db/schema.ts, src/db/migrations.ts)
     · tests        [29 blobs]  (tests/integration/db.test.ts)

`gitsema pr-report [options]`

Generates a semantic PR report combining semantic diff, impacted modules, change-point highlights, and reviewer suggestions. Designed for CI/bot ingestion.

Flag	Default	Description
`--ref1 <ref>`	`HEAD~1`	Earlier git ref
`--ref2 <ref>`	`HEAD`	Later git ref
`--file <path>`	—	File to compute semantic diff and impact for
`--query <q>`	—	Topic query for change-point highlights
`-k, --top <n>`	`10`	Top-k results per section
`--since <date>`	—	Only include reviewer activity after this date
`--until <date>`	—	Only include reviewer activity before this date
`--dump [file]`	—	Output JSON to `<file>` or stdout if no file given

gitsema pr-report --file src/auth.ts
gitsema pr-report --ref1 main --ref2 feature/auth --dump report.json

`gitsema eval <file> [options]`

Retrieval evaluation harness — measures search quality (P@k, R@k, MRR, latency) against a JSONL file of evaluation cases.

Each line of the JSONL file must be: { "query": "...", "expectedPaths": ["src/foo.ts"] }

Flag	Default	Description
`-k, --top <n>`	`10`	Top-k results per query
`--dump [file]`	—	Write full JSON results to `<file>` or stdout

gitsema eval eval-cases.jsonl --top 10
gitsema eval eval-cases.jsonl --dump eval-results.json

Code Quality

`gitsema code-review [options]`

Semantic code review assistant. Compares the diff between two refs and surfaces analogous blobs from history — prior implementations, related patterns, and known-good precedents — to inform a review.

Options:
  --base <ref>          Base ref (default: main)
  --head <ref>          Head ref (default: HEAD)
  --diff-file <file>    Read diff from a file instead of computing from refs
  --top <n>             Analogues to show per hunk (default: 5)
  --threshold <n>       Minimum similarity score (default: 0.75)
  --format <fmt>        Output format: text (default) or json

gitsema code-review
gitsema code-review --base main --head feature/auth

Workflows

`gitsema workflow run <template> [options]`

Run a productized analysis workflow. Each template bundles multiple commands into a coherent, narrated report.

Template	Description
`pr-review`	Semantic PR review: diff, analogues, reviewer suggestions
`incident`	Incident triage: first-seen, change-points, bisect, experts
`onboarding`	Codebase orientation: clusters, experts, concept map
`release-audit`	Release readiness: health, debt, security, dead-concepts
`ownership-intel`	Ownership heatmap and contributor profiles
`arch-drift`	Architectural drift detection via cluster timeline
`knowledge-portal`	Knowledge discovery portal for a concept area
`regression-forecast`	Predict regression risk from semantic change signals

Options:
  --query <text>      Concept or topic to focus the workflow on
  --file <path>       File to analyze (used by pr-review)
  --base <ref>        Base git ref (used by pr-review, regression-forecast)
  --role <topic>      Alias for --query
  -k, --top <n>       Result limit per section (default: 5)
  --format <fmt>      Output format: markdown (default) or json
  --out <spec>        Output format (repeatable)
  --dump [file]       Output JSON to file or stdout

gitsema workflow run pr-review --base main
gitsema workflow run incident --query "payment timeout"
gitsema workflow run release-audit

`gitsema workflow list`

List all available workflow templates with short descriptions.

gitsema workflow list

Triage & Policy

`gitsema triage <query> [options]`

Incident triage bundle. Runs first-seen, change-points, semantic bisect, and expert suggestions in one pass, then assembles a structured report.

Options:
  --top <n>           Top results per section (default: 10)
  --ref1 <ref>        Earlier bound for bisect / change-points
  --ref2 <ref>        Later bound for bisect / change-points
  --file <path>       File to include semantic diff for
  --dump [file]       Output JSON
  --out <spec>        Output format (repeatable)

gitsema triage "payment timeout error"
gitsema triage "auth regression" --ref1 v2.0 --ref2 HEAD --dump triage.json

`gitsema policy [subcommand] [options]`

CI policy gates. Checks drift, debt, and security thresholds and exits non-zero when any gate fails — suitable for CI pipelines.

# Run all policy checks with defaults
gitsema policy check

# Override individual thresholds
gitsema policy check --max-debt-score 0.4 --max-drift 0.3

Options:
  --max-debt-score <n>    Fail if mean debt score exceeds this (default: 0.6)
  --min-security-score <n> Fail if security similarity score drops below this
  --max-drift <n>         Fail if concept drift exceeds this threshold
  --query <q>             Query to scope drift and change-point checks
  --dump [file]           Output JSON report

Returns HTTP 422 / exit code 1 when any gate fails; 200 / 0 when all pass.

`gitsema ownership <query> [options]`

Ownership heatmap. Shows which authors own blobs that are semantically related to a query, weighted by recency and volume.

Options:
  --top <n>           Top blobs to consider (default: 20)
  --window-days <n>   Rolling window for recency weighting
  --branch <name>     Restrict to this branch
  --dump [file]       Output JSON
  --out <spec>        Output format (repeatable)

gitsema ownership "authentication middleware"
gitsema ownership "database migrations" --window-days 90

Search Performance & AI Reliability

`--early-cut <n>` (on `gitsema search`)

Limits the candidate pool to n randomly-sampled blobs before scoring. Useful for very large indexes (>100K blobs) to trade recall for speed.

gitsema search "authentication middleware" --early-cut 5000

`--explain-llm` (on `gitsema search`)

Outputs a provenance citation block for each result, formatted for injection into LLM prompts. Each block includes the file path, blob hash, first-seen date, score signals, and a content snippet.

gitsema search "authentication middleware" --explain-llm

`--profile <name>` (on `gitsema index start`)

Applies a preset indexing profile that sets coherent defaults for concurrency, embed batch size, and chunker strategy.

Profile	Concurrency	Batch size	Chunker	Best for
`speed`	8	32	file	Fast indexing on fast hardware
`balanced`	4	16	file	Default (auto-tuned)
`quality`	2	4	function	Deep chunk/symbol indexing

gitsema index start --profile speed
gitsema index start --profile quality

`--out <spec>` — unified output (most commands)

Most commands support --out for controlling output format. The flag is repeatable so you can emit multiple formats at once.

Value	Description
`text`	Human-readable terminal output (default)
`json`	JSON to stdout
`json:<file>`	JSON written to `<file>`
`html`	Interactive HTML to stdout
`html:<file>`	Interactive HTML written to `<file>`
`markdown`	Markdown to stdout
`markdown:<file>`	Markdown written to `<file>`

gitsema search "auth" --out json:results.json --out text
gitsema clusters --out html:clusters.html
gitsema evolution "error handling" --out markdown:report.md

--dump [file] is a legacy alias for --out json[:file] and is still accepted.

`--narrate` — LLM summaries (most commands)

Appending --narrate to any supporting command generates a plain-language narrative summary of the results using an LLM. Configure the LLM endpoint with GITSEMA_LLM_URL (OpenAI-compatible).

gitsema evolution "authentication" --narrate
gitsema clusters --narrate
gitsema health --narrate

`GET /api/v1/capabilities` (HTTP server)

Returns a machine-readable JSON manifest of all features supported by the running server, including version, provider models, and enabled features. Useful for client auto-configuration.

curl http://localhost:4242/api/v1/capabilities

HTTP Analysis Routes (`POST /api/v1/analysis/...`)

All analysis commands available in the CLI are also exposed over HTTP. Authentication via GITSEMA_SERVE_KEY applies to all routes.

Route	Description	Key request fields
`POST /analysis/clusters`	K-means cluster snapshot	`k`, `topKeywords`, `branch`
`POST /analysis/change-points`	Concept change-point detection	`query`, `topK`, `threshold`
`POST /analysis/author`	Author attribution for a concept	`query`, `topK`, `topAuthors`
`POST /analysis/impact`	Cross-module coupling for a file	`file`, `topK`
`POST /analysis/semantic-diff`	Semantic diff between two refs	`ref1`, `ref2`, `query`
`POST /analysis/semantic-blame`	Semantic origin of code blocks	`filePath`, `content`, `topK`
`POST /analysis/dead-concepts`	Deleted semantic blobs	`topK`, `since`
`POST /analysis/merge-audit`	Semantic collision detection before merge	`branchA`, `branchB`, `threshold`
`POST /analysis/merge-preview`	Merge semantic impact preview	`branch`, `into`
`POST /analysis/branch-summary`	Branch semantic summary vs base	`branch`, `baseBranch`
`POST /analysis/experts`	Reviewer / expert suggestions	`topN`, `since`, `until`
`POST /analysis/security-scan`	Vulnerability pattern similarity scan	`top`
`POST /analysis/health`	Time-bucketed codebase health timeline	`buckets`, `branch`
`POST /analysis/debt`	Technical debt scoring	`top`, `branch`
`POST /analysis/doc-gap`	Documentation gap analysis	`top`, `threshold`, `branch`
`POST /analysis/contributor-profile`	Contributor semantic profile	`author`, `top`, `branch`
`POST /analysis/triage`	Incident triage bundle (first-seen + change-points + bisect + experts)	`query`, `top`, `ref1`, `ref2`, `file`
`POST /analysis/policy-check`	Automated CI gate (debt / security / drift thresholds)	`maxDebtScore`, `minSecurityScore`, `maxDrift`, `query`
`POST /analysis/ownership`	Ownership heatmap by concept	`query`, `top`, `windowDays`
`POST /analysis/workflow`	Workflow template runner	`template` (`pr-review\|incident\|release-audit`), `query`, `file`, `top`
`POST /analysis/eval`	Inline retrieval evaluation (P@k, R@k, MRR)	`cases` (array of `{query, expectedPaths}`), `top`
`POST /analysis/multi-repo-search`	Search across multiple registered repos	`query`, `repoIds`, `topK`

Note on security-scan: Results are semantic similarity scores, not confirmed vulnerabilities. Always perform manual review.

Note on policy-check: Returns HTTP 200 when all gates pass, 422 when any gate fails — convenient for CI integration.

Automated Indexing (Git Hooks)

You can keep the semantic index in sync with your repository automatically by installing the provided Git hook scripts. Once installed, gitsema index runs in the background after every git commit and every git pull / git merge — no manual intervention required.

How it works

Hook	Trigger	Command run
`post-commit`	After every `git commit`	`gitsema index start --since HEAD~1`
`post-merge`	After every `git pull` / `git merge`	`gitsema index start --since ORIG_HEAD`

Both hooks are safe no-ops when:

gitsema is not on your PATH, or
the index has not been initialised yet (run gitsema index start once first).

Installation (manual)

Copy the scripts into your repository's .git/hooks/ directory and make them executable:

cp scripts/hooks/post-commit  .git/hooks/post-commit
cp scripts/hooks/post-merge   .git/hooks/post-merge
chmod +x .git/hooks/post-commit .git/hooks/post-merge

Alternatively, use symlinks so the scripts stay in sync whenever you pull updates to the scripts/hooks/ directory:

ln -s ../../scripts/hooks/post-commit  .git/hooks/post-commit
ln -s ../../scripts/hooks/post-merge   .git/hooks/post-merge

Toggle via `gitsema config`

The gitsema config command can install or remove the hooks automatically — no manual file copying required:

# Install hooks for the current repository (symlinks into .git/hooks/)
gitsema config set hooks.enabled true

# Remove the managed hooks
gitsema config set hooks.enabled false

The config value is persisted in .gitsema/config.json so hooks are re-enabled automatically when you run gitsema config set hooks.enabled true again after a re-clone. The manual copy/symlink steps above remain a valid alternative if you prefer not to use the config command.

Data storage

The index is stored in .gitsema/index.db (SQLite) in the root of the repository. Add it to .gitignore to avoid committing it:

.gitsema/

Feature catalog

See docs/features.md for the complete, grouped catalog of implemented features including indexing options, all search flags, history/temporal commands, clustering, branch/merge tools, the HTTP API route list, and all MCP tools.

Strategic review

For the latest deep review of bottlenecks, missing features, productization ideas, and AI-assisted coding workflows, see docs/review7.md.

AI skill

A reusable AI-operator playbook is available at skill/gitsema-ai-assistant.md. Use it as a prompt scaffold for coding assistants that interact with gitsema.

Roadmap / Plans

See docs/PLAN.md for the full development roadmap, phase history, and backlog of planned features.

Name		Name	Last commit message	Last commit date
Latest commit History 471 Commits
.github		.github
.vscode		.vscode
action		action
assets		assets
docs		docs
modelserver		modelserver
scripts		scripts
skill		skill
src		src
tests		tests
tmp/search-backups		tmp/search-backups
.env.example		.env.example
.gitignore		.gitignore
.npmignore		.npmignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
ISSUES.md		ISSUES.md
ISSUE_BODY_search_after.md		ISSUE_BODY_search_after.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
index.js		index.js
index.log		index.log
package-lock.json		package-lock.json
package.json		package.json
plan3.md		plan3.md
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts
yarn.lock		yarn.lock

Folders and files

Latest commit

History

Repository files navigation

gitsema

Requirements

Installation

Quick start

Configuration (environment variables)

Provider selection

Ollama provider (GITSEMA_PROVIDER=ollama)

HTTP / OpenAI-compatible provider (GITSEMA_PROVIDER=http)

First-run profiling

Operational settings

Commands

Find the right command by goal

Setup & Infrastructure

gitsema quickstart

gitsema status

gitsema index

gitsema index start [options]

gitsema remote-index <repoUrl>

gitsema index backfill-fts

gitsema index doctor

gitsema index vacuum

gitsema index rebuild-fts

gitsema index gc

gitsema index clear-model <model>

gitsema index update-modules

gitsema index build-vss

gitsema models

gitsema index start --level <level>

gitsema tools mcp

gitsema tools lsp [--tcp <port>]

gitsema tools serve [options]

Search & Discovery

gitsema search <query> [options]

gitsema first-seen <query> [options]

gitsema dead-concepts [options]

gitsema repl

File History

gitsema file-evolution <path> [options]

gitsema file-diff <ref1> <ref2> <path>

gitsema blame <file> [options]

gitsema impact <path> [options]