Skip to content

jsilvanus/gitsema

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

471 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gitsema

Gitsema logo

A content-addressed semantic index synchronized with Git's object model.

Gitsema walks your Git history, embeds every blob, and lets you semantically search your codebase — including across time. It treats blob hashes as the unit of identity, so identical content is only embedded once regardless of how many commits reference it.

Requirements

  • Node.js 20+
  • Git (must be on PATH)
  • An embedding backend — either:
    • Ollama (local, default): ollama.ai with nomic-embed-text pulled
    • HTTP / OpenAI-compatible API: any endpoint that speaks the OpenAI embeddings API

Installation

Install from npm (requires Node.js >=20):

npm install -g gitsema

Or install from source:

git clone https://github.com/jsilvanus/gitsema.git
cd gitsema
pnpm install
pnpm build           # compiles TypeScript → dist/

# Optional: put `gitsema` on your PATH
pnpm setup           # one-time setup; then open a new terminal
pnpm link --global

To use without linking, prefix commands with node dist/cli/index.js instead of gitsema.

Quick start

cd /path/to/your/git/repo

# 1. Start indexing (uses Ollama by default)
gitsema index start

# 2. Search
gitsema search "authentication middleware"

# 3. Check index coverage (per-model, multi-model aware)
gitsema index

Configuration (environment variables)

All configuration is done through environment variables. Set them in your shell or in a .env file loaded before running gitsema.

Provider selection

Variable Default Description
GITSEMA_PROVIDER ollama Embedding backend: ollama, http, or embedeer

Ollama provider (GITSEMA_PROVIDER=ollama)

Variable Default Description
GITSEMA_MODEL nomic-embed-text Ollama model to use for embeddings
GITSEMA_TEXT_MODEL value of GITSEMA_MODEL Model used for text/prose files
GITSEMA_CODE_MODEL value of GITSEMA_TEXT_MODEL Model used for source code files (overrides text model)

Ollama is assumed to be running at http://localhost:11434. Pull the model first:

ollama pull nomic-embed-text

HTTP / OpenAI-compatible provider (GITSEMA_PROVIDER=http)

Variable Default Description
GITSEMA_HTTP_URL (required) Base URL of the embeddings API, e.g. https://api.openai.com
GITSEMA_MODEL nomic-embed-text Model name passed in the request body
GITSEMA_TEXT_MODEL value of GITSEMA_MODEL Model for text files
GITSEMA_CODE_MODEL value of GITSEMA_TEXT_MODEL Model for code files
GITSEMA_API_KEY (optional) Bearer token sent as Authorization: Bearer <key>

Example for OpenAI:

export GITSEMA_PROVIDER=http
export GITSEMA_HTTP_URL=https://api.openai.com
export GITSEMA_MODEL=text-embedding-3-small
export GITSEMA_API_KEY=sk-...
gitsema index start

First-run profiling

Gitsema can generate a CPU profile during the first successful indexing run to help tune embedding concurrency and batchSize.

  • Environment variable: GITSEMA_PROFILE_FIRST_RUN (truthy enables, falsy disables)

  • Repo config: index.profileFirstRun (use gitsema config set index.profileFirstRun false --local to disable)

  • Profiles are written into the indexed repo at .gitsema/profiles/embedeer-profile-<timestamp>.cpuprofile

  • Precedence: the GITSEMA_PROFILE_FIRST_RUN environment variable overrides the repo config index.profileFirstRun.

  • Recommended: disable profiling in CI. Example (GitHub Actions):

    env:
      GITSEMA_PROFILE_FIRST_RUN: '0'

By default profiling is enabled on the first run when no prior embeddings exist. If an index attempt fails, a partial profile is still saved but the "profile-done" marker is only written after a successful, full indexing run.

Operational settings

Variable Default Description
GITSEMA_VERBOSE off Set to 1 for debug logging (same as --verbose)
GITSEMA_REMOTE (optional) Default remote gitsema tools serve URL; overridden per-command by --remote
GITSEMA_LLM_URL (optional) OpenAI-compatible URL for --narrate LLM summaries
GITSEMA_LOG_MAX_BYTES 1048576 Log rotation threshold (1 MB)

Commands

Commands are organised into groups. See docs/features.md for the full feature catalog.

Group Commands
Setup quickstart, config, status, models, repos
Indexing index (status), index start, index doctor, index vacuum, index backfill-fts, index rebuild-fts, index gc, index clear-model, index update-modules, index build-vss, index export, index import, remote-index, watch
Protocol Servers tools mcp, tools serve, tools lsp
Search & Discovery search, code-search, first-seen, dead-concepts, repl
File History file-evolution, file-diff, blame, impact, file-change-points
Concept History evolution, diff, author, lifecycle
Cluster Analysis clusters, cluster-diff, cluster-timeline
Change Detection change-points, file-change-points, cluster-change-points
Branch / Merge branch-summary, merge-audit, merge-preview, cherry-pick-suggest, ci-diff, bisect
Code Quality code-review, security-scan, health, debt, doc-gap, refactor-candidates
Analysis author, contributor-profile, triage, policy, ownership, eval, cross-repo-similarity, pr-report
Workflows workflow run <template>, workflow list
Visualization map, heatmap, project

Backward-compatible aliases: concept-evolutionevolution, semantic-blameblame, gitsema mcp / gitsema serve / gitsema lsp → use gitsema tools mcp / gitsema tools serve / gitsema tools lsp instead. The old DB maintenance commands (gitsema doctor, gitsema vacuum, gitsema gc, gitsema backfill-fts, gitsema rebuild-fts, gitsema update-modules, gitsema build-vss, gitsema clear-model) still work as hidden deprecated aliases and print a migration hint — use the gitsema index <subcommand> forms instead.


Find the right command by goal

Not sure which command to use? Search by what you want to accomplish:

I want to… Command(s)
Get started with guided setup gitsema quickstart
Index and search gitsema index start, gitsema search "query"
See what's indexed / coverage gitsema index, gitsema status
Find where a concept first appeared gitsema first-seen "query"
Track how a file changed semantically over time gitsema file-evolution path/to/file
Compare two versions of a file gitsema file-diff <ref1> <ref2> path/to/file
Understand how a concept evolved gitsema evolution "query"
Find functions or classes by meaning gitsema code-search "query"
Detect when major semantic shifts happened gitsema change-points "query"
See which commits diverged most semantically gitsema cluster-diff <ref1> <ref2>
Understand who "owns" a concept gitsema author "query"
Find stale or dead concepts gitsema dead-concepts
Assess code health over time gitsema health, gitsema debt
Find security-pattern matches gitsema security-scan
Review a PR semantically gitsema code-review, gitsema branch-summary, gitsema merge-audit
Find refactor candidates gitsema refactor-candidates
Find doc coverage gaps gitsema doc-gap
Triage an incident gitsema triage "query"
Run a full analysis workflow gitsema workflow run <template>
Run an interactive search session gitsema repl
Set up a team server gitsema tools serve --port 4242 --key <token>
Expose to Claude / AI assistants gitsema tools mcp
Search across multiple repos gitsema repos add, gitsema search "query" --repos <ids>
Add narrated summaries to any output Append --narrate to most commands
Output JSON / HTML / Markdown Append --out json, --out html, or --out markdown

See docs/playbooks.md for role-based recipes (solo dev, PR reviewer, security engineer, release manager).


Setup & Infrastructure

gitsema quickstart

Interactive setup wizard. Detects your environment, walks through provider configuration (Ollama or HTTP), runs a test embedding, and records settings to .gitsema/config.json.

gitsema quickstart

Use this the first time you set up gitsema in a new repo or on a new machine.


gitsema status

Show index statistics and database path. Also displays embed config provenance (provider, model, dimensions, chunker) recorded from previous index runs.

gitsema status

gitsema index

Show index coverage status — read-only, no writes. Displays Git-reachable blob counts and per-embedding-model coverage, including file-level, chunk-level, symbol-level and module-level stats.

One database can hold embeddings from multiple models simultaneously; this command reports coverage for each.

Output includes:
  DB path and schema version
  Git-reachable blob count (true 100% denominator — all refs)
  DB blob count (what gitsema has seen)
  Per embed-config / model:
    file blobs embedded + coverage %
    chunks, symbols, modules embedded (where present)

gitsema index start [options]

Walk the Git history and embed all blobs into the index. Starts from HEAD first (fastest time-to-first-results) then walks history. Already-indexed blobs are skipped automatically (content-addressed deduplication).

Uses the currently configured embedding model (GITSEMA_MODEL / gitsema config) unless overridden by --model.

Options:
  --since <ref>              Only index commits after this point.
                             Accepts a date (2024-01-01), tag (v1.0), or commit hash.
                             Use "all" to force a full re-index.
  --max-commits <n>          Stop after indexing this many commits.
  --concurrency <n>          Parallel embedding calls (default: 4). Increase on fast
                             hardware; decrease if the embedding server throttles.
  --embed-batch-size <n>     Batch size for embedding API calls.
  --ext <extensions>         Only index files with these extensions, e.g. ".ts,.js,.py"
  --include-glob <patterns>  Only index paths matching these glob patterns (comma-separated).
  --max-size <size>          Skip blobs larger than this (e.g. "200kb", "1mb"; default: 200kb)
  --exclude <patterns>       Skip blobs whose path contains any of these substrings.
  --chunker <strategy>       Chunking strategy: file (default), function, or fixed.
  --level <granularity>      Alias for --chunker: blob/file, function, fixed, multi.
  --window-size <n>          Characters per chunk for the fixed chunker (default: 1500).
  --overlap <n>              Character overlap between adjacent fixed chunks (default: 200).
  --file <paths...>          Index specific file(s) from HEAD (repeatable).
  --model <model>            Override all embedding models for this run.
  --text-model <model>       Override the text/prose embedding model.
  --code-model <model>       Override the code embedding model.
  --quantize                 Enable Int8 scalar quantization of stored vectors.
  --build-vss                Build the HNSW vector index immediately after indexing.
  --auto-build-vss [n]       Auto-build VSS when total blobs exceed n (default: 10000).
  --remote <url>             Proxy embedding calls to a remote gitsema server.
  --branch <name>            Tag indexed blobs as belonging to this branch.
  --profile <preset>         Apply a preset: speed, balanced, or quality.
  --allow-mixed              Skip embed-config compatibility check (allow mixing
                             different models/dimensions in the same index).

Examples:

# Start full index from HEAD first, then walk history
gitsema index start

# Only TypeScript files added since a tag
gitsema index start --since v1.2.0 --ext ".ts,.tsx"

# Use function-level chunking with higher concurrency
gitsema index start --chunker function --concurrency 8

# Index specific files from HEAD
gitsema index start --file docs/PLAN.md src/cli/commands/index.ts --concurrency 2

# Force full re-index with a different model
gitsema index start --since all --model text-embedding-3-small

gitsema remote-index <repoUrl>

Ask a remote gitsema tools serve instance to clone and index a Git repository.


gitsema index backfill-fts

Populate FTS5 content for blobs indexed before Phase 11. Required to use --hybrid search on older index entries.


gitsema index doctor

Run integrity checks and report the health of the index database.

gitsema index doctor

Checks performed:

  • Schema version vs expected version
  • Blob / embedding / FTS row counts
  • Missing FTS rows (suggests gitsema index backfill-fts)
  • Orphan embeddings (suggests gitsema index gc)
  • SQLite integrity check (PRAGMA integrity_check)
  • Stored embed config provenance (provider, model, dimensions, chunker)

Exits with code 1 if critical issues (integrity failures or schema mismatch) are detected.


gitsema index vacuum

Run VACUUM and ANALYZE on the SQLite index database. Compacts the file and refreshes query planner statistics. Safe to run at any time.

gitsema index vacuum

gitsema index rebuild-fts

Rebuild the FTS5 full-text search index from stored data. Use after bulk deletions or if hybrid search returns stale results.

gitsema index rebuild-fts        # prompts for confirmation
gitsema index rebuild-fts --yes  # skip confirmation

gitsema index gc

Garbage collect unreachable blob records from the DB (blobs not reachable from any Git ref).

gitsema index gc
gitsema index gc --dry-run  # preview what would be removed

gitsema index clear-model <model>

Delete all stored embeddings and cache entries for a specific model.

gitsema index clear-model nomic-embed-text
gitsema index clear-model text-embedding-3-small --yes

gitsema index update-modules

Recalculate module (directory) centroid embeddings from stored whole-file embeddings.

gitsema index update-modules

gitsema index build-vss

Build a usearch HNSW ANN index from stored embeddings for fast approximate search. Requires the usearch optional package.

gitsema index build-vss
gitsema index build-vss --model text-embedding-3-small

Note: The old top-level forms (gitsema doctor, gitsema vacuum, gitsema backfill-fts, etc.) still work as deprecated aliases and will print a migration hint.


gitsema models

Manage embedding model configurations. Different models can use different providers, base URLs, and API keys. Model profiles are stored in .gitsema/config.json (local) or ~/.config/gitsema/config.json (global, --global).

Subcommands:

Subcommand Description
gitsema models list List all configured profiles and indexed models
gitsema models info <name> Show provider config + index stats for a model
gitsema models add <name> Configure provider settings for a model
gitsema models remove <name> Remove a model profile from config
# List all models (from index + config profiles)
gitsema models list

# Show detailed info for a model
gitsema models info text-embedding-3-small

# Add an OpenAI model with its own provider config
gitsema models add text-embedding-3-small \
  --provider http \
  --url https://api.openai.com \
  --key sk-... \
  --set-text                        # also set as default text model

# Add a local Ollama model
gitsema models add nomic-embed-text --provider ollama --set-default

# Remove a profile (keep index data)
gitsema models remove text-embedding-3-small

# Remove a profile AND purge all its embeddings from the index
gitsema models remove text-embedding-3-small --purge-index

Per-model provider settings override global GITSEMA_PROVIDER / GITSEMA_HTTP_URL / GITSEMA_API_KEY environment variables, so you can use Ollama for one model and OpenAI for another in the same repo.


gitsema index start --level <level>

The --level flag on gitsema index start is a convenience alias for --chunker:

--level --chunker equivalent Description
blob or file file (default) One embedding per file
function function Function and class boundaries
fixed fixed Fixed-size sliding windows
gitsema index start --level function     # embed at function granularity
gitsema index start --level blob         # one embedding per file (default)
gitsema search "auth middleware" --level function  # search function-level embeddings

Tip: Use --level function on index start and --level function on search together for function-granularity semantic search.


gitsema tools mcp

Start the gitsema MCP server over stdio. Allows AI assistants (Claude, VS Code Copilot, etc.) to query the semantic index via the Model Context Protocol.

gitsema tools mcp

Alias: gitsema mcp still works but is deprecated. Use gitsema tools mcp.

gitsema tools lsp [--tcp <port>]

Start the LSP semantic hover server. Responds to hover requests with nearest-neighbor blobs.

gitsema tools lsp          # stdio (default)
gitsema tools lsp --tcp 7777

gitsema tools serve [options]

Start the gitsema HTTP API server so remote machines can delegate embedding and storage to a central host. Replaces the deprecated top-level gitsema serve command.

Options:
  --port <n>      Port to listen on (default: 4242)
  --key <token>   Require this Bearer token on all requests
  --ui            Serve the embedded 2D codebase map web UI at /ui

P2 operational features exposed by the HTTP server:

Endpoint Description
GET /metrics Prometheus metrics scrape (protected by auth; set GITSEMA_METRICS_PUBLIC=1 to bypass)
GET /openapi.json OpenAPI 3.1 spec (always public)
GET /docs Swagger UI (always public)

Rate limiting env vars:

Variable Default Description
GITSEMA_RATE_LIMIT_RPM 300 Requests per minute per token/IP
GITSEMA_RATE_LIMIT_BURST = RPM Per-window burst allowance
GITSEMA_METRICS_PUBLIC off Set to 1 to expose /metrics without auth
GITSEMA_MAX_BODY_SIZE 1mb Max request body size (e.g. 2mb, 512kb)

For full deployment instructions (systemd, Docker, secrets, backups) see docs/deploy.md.

Alias: gitsema serve still works but is deprecated. Use gitsema tools serve.


Search & Discovery

gitsema search <query> [options]

Semantically search the index.

Options:
  -k, --top <n>           Number of results (default: 10)
  --level <granularity>   Search at: file, chunk, or symbol level (default: symbol)
  --threshold <n>         Minimum similarity score 0–1 to include a result (default: 0)
  --recent                Blend cosine similarity with a recency score
  --alpha <n>             Cosine weight in blended score (0–1, default: 0.8)
  --before <date>         Only blobs first seen before this date (YYYY-MM-DD)
  --after <date>          Only blobs first seen after this date (YYYY-MM-DD)
  --weight-vector <n>     Vector weight in three-signal ranking (default: 0.7)
  --weight-recency <n>    Recency weight (default: 0.2)
  --weight-path <n>       Path-relevance weight (default: 0.1)
  --group <mode>          Group results by: file, module, or commit
  --chunks                Include chunk-level embeddings in results
  --hybrid                Combine vector similarity with BM25 keyword matching
  --bm25-weight <n>       BM25 weight in hybrid score (default: 0.3)
  --branch <name>         Restrict results to blobs seen on this branch
  --model <model>         Override query embedding model
  --vss                   Use the HNSW approximate nearest-neighbour index
  --repos <ids>           Comma-separated repo IDs for multi-repo search
  --narrate               Generate an LLM summary of the results
  --out <spec>            Output format (repeatable): text, json[:file], html[:file],
                          markdown[:file]

Examples:

gitsema search "authentication middleware"
gitsema search "database connection pool" --top 20
gitsema search "rate limiting" --recent --after 2024-01-01
gitsema search "error handling" --hybrid

gitsema first-seen <query> [options]

Find when a concept first appeared in the codebase, sorted chronologically.

See also: search, evolution

Options:
  -k, --top <n>           Number of results (default: 10)
  --hybrid                Combine vector + BM25 search
  --bm25-weight <n>       BM25 weight in hybrid score (default: 0.3)
  --include-commits       Also search commit messages
  --branch <name>         Restrict to this branch
  --model <model>         Override query embedding model
  --narrate               Generate an LLM summary
  --dump [file]           Output JSON to file or stdout
  --out <spec>            Output format (repeatable): text, json[:file], html[:file],
                          markdown[:file]
gitsema first-seen "JWT token validation"
gitsema first-seen "rate limiting" --hybrid --include-commits

gitsema dead-concepts [options]

Find historical concepts that no longer exist in HEAD but are semantically similar to current code.

See also: search, evolution

Options:
  -k, --top <n>       Number of results (default: 10)
  --since <date>      Only consider blobs whose latest commit is on or after this date
  --branch <name>     Restrict to this branch
  --dump [file]       Output structured JSON
  --out <spec>        Output format (repeatable)

gitsema repl

Interactive semantic exploration REPL. Provides a persistent session where you can run search, first-seen, evolution, and other queries without re-embedding the query each time.

gitsema repl

Inside the REPL, type a query to search, or prefix with a command name (e.g. first-seen auth, evolution "error handling"). Type help for available commands, exit to quit.


File History

gitsema file-evolution <path> [options]

Track the semantic drift of a file across its Git history.

See also: file-diff, evolution

Options:
  --threshold <n>       Cosine distance above which a version change is flagged (default: 0.3)
  --dump [file]         Output structured JSON; writes to <file> or stdout if omitted
  --include-content     Include stored file content in the JSON dump (requires --dump)
  --alerts [n]          Show the top-N largest semantic jumps (default: 5)
gitsema file-evolution src/core/auth/middleware.ts
gitsema file-evolution src/core/auth/middleware.ts --dump evolution.json

gitsema file-diff <ref1> <ref2> <path>

Compute the semantic diff between two versions of a file.

See also: file-evolution, cluster-diff, diff

Options:
  --neighbors <n>   Number of nearest-neighbour blobs to show for each version (default: 0)
gitsema file-diff HEAD~10 HEAD src/api/router.ts

gitsema blame <file> [options]

Alias: gitsema semantic-blame (backward-compatible)

Show the semantic origin of each logical block in a file — nearest-neighbour blame.

See also: file-evolution, impact

Options:
  -k, --top <n>   Number of nearest-neighbor blobs to show per block (default: 3)
  --dump [file]   Output structured JSON

gitsema impact <path> [options]

Compute semantically similar blobs across the codebase to highlight refactor impact.

See also: blame, file-diff

Options:
  -k, --top <n>   Number of similar blobs to return (default: 10)
  --chunks        Include chunk-level embeddings for finer-grained coupling
  --dump [file]   Output structured JSON

Concept History

gitsema evolution <query> [options]

Alias: gitsema concept-evolution (backward-compatible)

Show how a semantic concept evolved across the entire commit history.

See also: file-evolution, first-seen, diff

Options:
  -k, --top <n>         Number of top-matching blobs to include (default: 50)
  --threshold <n>       Cosine distance threshold for flagging large changes (default: 0.3)
  --dump [file]         Output structured JSON
  --html [file]         Output an interactive HTML visualization
  --include-content     Include stored file content in the JSON dump (requires --dump)
gitsema evolution "authentication"
gitsema concept-evolution "authentication"   # backward-compatible alias

gitsema diff <ref1> <ref2> <query> [options]

Compute a conceptual/semantic diff of a topic across two git refs. Shows which blobs matching the topic were gained (new in ref2), lost (removed from ref1), and stable (present in both), each ranked by topic relevance — most relevant files for the topic appear at the top of each group.

See also: evolution, file-diff, cluster-diff

Arguments:
  query             Topic or concept to compare across the two refs

Options:
  -k, --top <n>     Max results per group (gained/lost/stable) (default: 10)
  --dump [file]     Output structured JSON
gitsema diff v1.0.0 HEAD "authentication"
gitsema diff 2024-01-01 2024-06-01 "error handling" --top 5
gitsema diff HEAD~20 HEAD "database access" --dump diff.json

Cluster Analysis

gitsema clusters [options]

Cluster all blob embeddings into semantic regions using k-means++ and display a concept graph.

See also: cluster-diff, cluster-timeline

Options:
  --k <n>                 Number of clusters (default: 8)
  --top <n>               Top representative paths per cluster (default: 5)
  --iterations <n>        Max k-means iterations (default: 20)
  --edge-threshold <n>    Cosine similarity threshold for concept graph edges (default: 0.3)
  --dump [file]           Output structured JSON
  --html [file]           Output an interactive HTML visualization
  --enhanced-labels       Enhance cluster labels using TF-IDF path and identifier analysis

gitsema cluster-diff <ref1> <ref2>

Compare semantic clusters between two points in history (temporal clustering).

See also: clusters, cluster-timeline, file-diff

gitsema cluster-diff v1.0.0 HEAD
gitsema cluster-diff 2024-01-01 2024-06-01

gitsema cluster-timeline

Show how semantic clusters shifted over the commit history — multi-step timeline.

See also: clusters, cluster-diff

Options:
  --k <n>         Number of clusters per step (default: 8)
  --steps <n>     Number of evenly-spaced time checkpoints (default: 5)
  --since <ref>   Start date or git ref for the timeline
  --until <ref>   End date or git ref for the timeline
  --html [file]   Output an interactive HTML visualization

Change Detection

gitsema change-points <query> [options]

Detect conceptual change points for a semantic query across the entire commit history. For each indexed commit the command builds a weighted centroid from the top-k matching blobs visible at that point in time and reports commits where the centroid shifted sharply.

See also: concept-evolution, cluster-change-points

Options:
  -k, --top <n>       Top-k blobs used to define concept state per commit (default: 50)
  --threshold <n>     Cosine distance threshold to flag a change point (default: 0.3)
  --top-points <n>    Show top-N largest jumps (default: 5)
  --since <ref>       Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
  --until <ref>       Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted
gitsema change-points "authentication middleware"
gitsema change-points "database connection" --threshold 0.4 --top-points 3
gitsema change-points "error handling" --since 2024-01-01 --dump changes.json

Example JSON output (--dump):

{
  "type": "concept-change-points",
  "query": "authentication middleware",
  "k": 50,
  "threshold": 0.3,
  "range": { "since": null, "until": null },
  "points": [
    {
      "before": { "commit": "a1b2c3d", "date": "2023-06-15", "timestamp": 1686787200, "topPaths": ["src/auth/session.ts"] },
      "after":  { "commit": "e4f5a6b", "date": "2023-09-20", "timestamp": 1695168000, "topPaths": ["src/auth/jwt.ts"] },
      "distance": 0.412
    }
  ]
}

gitsema file-change-points <path> [options]

Detect semantic change points in a single file's Git history. Reports commits where the embedding distance between consecutive file versions exceeded the threshold.

See also: file-evolution, change-points

Options:
  --threshold <n>     Cosine distance threshold (default: 0.3)
  --top-points <n>    Show top-N largest jumps (default: 5)
  --since <ref>       Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
  --until <ref>       Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted
gitsema file-change-points src/core/auth/middleware.ts
gitsema file-change-points src/api/router.ts --threshold 0.4 --top-points 3
gitsema file-change-points src/db/schema.ts --since v1.0 --dump schema-changes.json

Example JSON output (--dump):

{
  "type": "file-change-points",
  "path": "src/core/auth/middleware.ts",
  "threshold": 0.3,
  "range": { "since": null, "until": null },
  "points": [
    {
      "before": { "commit": "a1b2c3d", "date": "2023-06-15", "timestamp": 1686787200, "blobHash": "abc1234..." },
      "after":  { "commit": "e4f5a6b", "date": "2023-09-20", "timestamp": 1695168000, "blobHash": "def5678..." },
      "distance": 0.524
    }
  ]
}

gitsema cluster-change-points [options]

Detect change points in the repo's cluster structure across commit history. For each sampled commit the command runs k-means clustering over visible blobs, matches clusters between consecutive steps using greedy centroid similarity, and reports steps where the mean centroid shift score exceeded the threshold.

See also: cluster-timeline, change-points

Performance note: By default every indexed commit is evaluated. On large repositories use --max-commits to cap the number of commits sampled (they are selected evenly across the since–until range).

Options:
  --k <n>             Number of clusters per step (default: 8)
  --threshold <n>     Mean centroid shift threshold (default: 0.3)
  --top-points <n>    Show top-N largest shifts (default: 5)
  --since <ref>       Limit commits from this point; accepts date (YYYY-MM-DD), tag, or hash
  --until <ref>       Limit commits up to this point; accepts date (YYYY-MM-DD), tag, or hash
  --max-commits <n>   Cap commits evaluated; sampled evenly (omit to evaluate every commit)
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted
gitsema cluster-change-points
gitsema cluster-change-points --k 6 --threshold 0.4 --top-points 3
gitsema cluster-change-points --max-commits 200 --dump cluster-changes.json

Repo Insights

gitsema experts [options]

Rank contributors by the number of distinct blobs they introduced and show which semantic clusters/concepts they worked on. No embedding provider required — uses data already in the index.

Tip: Run gitsema clusters first to populate cluster labels. Without clusters, semantic areas are shown as cluster-<id>.

See also: author, contributor-profile

Options:
  --top <n>           Number of top contributors to show (default: 10)
  --since <ref>       Only count commits at or after this date (YYYY-MM-DD or ISO 8601)
  --until <ref>       Only count commits at or before this date (YYYY-MM-DD or ISO 8601)
  --min-blobs <n>     Suppress contributors with fewer than this many blobs (default: 1)
  --top-clusters <n>  Max semantic areas to show per contributor (default: 5)
  --dump [file]       Output structured JSON; writes to <file> or stdout if omitted
  --html [file]       Output an interactive HTML report; writes to <file> or experts.html
# Top 10 contributors overall
gitsema experts

# Top 5 contributors since 2024, with JSON output
gitsema experts --top 5 --since 2024-01-01 --dump experts.json

# Interactive HTML report
gitsema experts --html experts.html

Example text output:

Top 3 contributors by semantic area (since 2024-01-01)

1. Alice <[email protected]>
   Blobs: 142
   Semantic areas:
     · auth-module  [38 blobs]  (src/auth/jwt.ts, src/auth/session.ts)
     · api-routes   [31 blobs]  (src/routes/auth.ts)
     · db-layer     [12 blobs]  (src/db/users.ts)

2. Bob <[email protected]>
   Blobs: 97
   Semantic areas:
     · db-layer     [44 blobs]  (src/db/schema.ts, src/db/migrations.ts)
     · tests        [29 blobs]  (tests/integration/db.test.ts)

gitsema pr-report [options]

Generates a semantic PR report combining semantic diff, impacted modules, change-point highlights, and reviewer suggestions. Designed for CI/bot ingestion.

Flag Default Description
--ref1 <ref> HEAD~1 Earlier git ref
--ref2 <ref> HEAD Later git ref
--file <path> File to compute semantic diff and impact for
--query <q> Topic query for change-point highlights
-k, --top <n> 10 Top-k results per section
--since <date> Only include reviewer activity after this date
--until <date> Only include reviewer activity before this date
--dump [file] Output JSON to <file> or stdout if no file given
gitsema pr-report --file src/auth.ts
gitsema pr-report --ref1 main --ref2 feature/auth --dump report.json

gitsema eval <file> [options]

Retrieval evaluation harness — measures search quality (P@k, R@k, MRR, latency) against a JSONL file of evaluation cases.

Each line of the JSONL file must be: { "query": "...", "expectedPaths": ["src/foo.ts"] }

Flag Default Description
-k, --top <n> 10 Top-k results per query
--dump [file] Write full JSON results to <file> or stdout
gitsema eval eval-cases.jsonl --top 10
gitsema eval eval-cases.jsonl --dump eval-results.json

Code Quality

gitsema code-review [options]

Semantic code review assistant. Compares the diff between two refs and surfaces analogous blobs from history — prior implementations, related patterns, and known-good precedents — to inform a review.

Options:
  --base <ref>          Base ref (default: main)
  --head <ref>          Head ref (default: HEAD)
  --diff-file <file>    Read diff from a file instead of computing from refs
  --top <n>             Analogues to show per hunk (default: 5)
  --threshold <n>       Minimum similarity score (default: 0.75)
  --format <fmt>        Output format: text (default) or json
gitsema code-review
gitsema code-review --base main --head feature/auth

Workflows

gitsema workflow run <template> [options]

Run a productized analysis workflow. Each template bundles multiple commands into a coherent, narrated report.

Template Description
pr-review Semantic PR review: diff, analogues, reviewer suggestions
incident Incident triage: first-seen, change-points, bisect, experts
onboarding Codebase orientation: clusters, experts, concept map
release-audit Release readiness: health, debt, security, dead-concepts
ownership-intel Ownership heatmap and contributor profiles
arch-drift Architectural drift detection via cluster timeline
knowledge-portal Knowledge discovery portal for a concept area
regression-forecast Predict regression risk from semantic change signals
Options:
  --query <text>      Concept or topic to focus the workflow on
  --file <path>       File to analyze (used by pr-review)
  --base <ref>        Base git ref (used by pr-review, regression-forecast)
  --role <topic>      Alias for --query
  -k, --top <n>       Result limit per section (default: 5)
  --format <fmt>      Output format: markdown (default) or json
  --out <spec>        Output format (repeatable)
  --dump [file]       Output JSON to file or stdout
gitsema workflow run pr-review --base main
gitsema workflow run incident --query "payment timeout"
gitsema workflow run release-audit

gitsema workflow list

List all available workflow templates with short descriptions.

gitsema workflow list

Triage & Policy

gitsema triage <query> [options]

Incident triage bundle. Runs first-seen, change-points, semantic bisect, and expert suggestions in one pass, then assembles a structured report.

Options:
  --top <n>           Top results per section (default: 10)
  --ref1 <ref>        Earlier bound for bisect / change-points
  --ref2 <ref>        Later bound for bisect / change-points
  --file <path>       File to include semantic diff for
  --dump [file]       Output JSON
  --out <spec>        Output format (repeatable)
gitsema triage "payment timeout error"
gitsema triage "auth regression" --ref1 v2.0 --ref2 HEAD --dump triage.json

gitsema policy [subcommand] [options]

CI policy gates. Checks drift, debt, and security thresholds and exits non-zero when any gate fails — suitable for CI pipelines.

# Run all policy checks with defaults
gitsema policy check

# Override individual thresholds
gitsema policy check --max-debt-score 0.4 --max-drift 0.3
Options:
  --max-debt-score <n>    Fail if mean debt score exceeds this (default: 0.6)
  --min-security-score <n> Fail if security similarity score drops below this
  --max-drift <n>         Fail if concept drift exceeds this threshold
  --query <q>             Query to scope drift and change-point checks
  --dump [file]           Output JSON report

Returns HTTP 422 / exit code 1 when any gate fails; 200 / 0 when all pass.


gitsema ownership <query> [options]

Ownership heatmap. Shows which authors own blobs that are semantically related to a query, weighted by recency and volume.

Options:
  --top <n>           Top blobs to consider (default: 20)
  --window-days <n>   Rolling window for recency weighting
  --branch <name>     Restrict to this branch
  --dump [file]       Output JSON
  --out <spec>        Output format (repeatable)
gitsema ownership "authentication middleware"
gitsema ownership "database migrations" --window-days 90

Search Performance & AI Reliability

--early-cut <n> (on gitsema search)

Limits the candidate pool to n randomly-sampled blobs before scoring. Useful for very large indexes (>100K blobs) to trade recall for speed.

gitsema search "authentication middleware" --early-cut 5000

--explain-llm (on gitsema search)

Outputs a provenance citation block for each result, formatted for injection into LLM prompts. Each block includes the file path, blob hash, first-seen date, score signals, and a content snippet.

gitsema search "authentication middleware" --explain-llm

--profile <name> (on gitsema index start)

Applies a preset indexing profile that sets coherent defaults for concurrency, embed batch size, and chunker strategy.

Profile Concurrency Batch size Chunker Best for
speed 8 32 file Fast indexing on fast hardware
balanced 4 16 file Default (auto-tuned)
quality 2 4 function Deep chunk/symbol indexing
gitsema index start --profile speed
gitsema index start --profile quality

--out <spec> — unified output (most commands)

Most commands support --out for controlling output format. The flag is repeatable so you can emit multiple formats at once.

Value Description
text Human-readable terminal output (default)
json JSON to stdout
json:<file> JSON written to <file>
html Interactive HTML to stdout
html:<file> Interactive HTML written to <file>
markdown Markdown to stdout
markdown:<file> Markdown written to <file>
gitsema search "auth" --out json:results.json --out text
gitsema clusters --out html:clusters.html
gitsema evolution "error handling" --out markdown:report.md

--dump [file] is a legacy alias for --out json[:file] and is still accepted.


--narrate — LLM summaries (most commands)

Appending --narrate to any supporting command generates a plain-language narrative summary of the results using an LLM. Configure the LLM endpoint with GITSEMA_LLM_URL (OpenAI-compatible).

gitsema evolution "authentication" --narrate
gitsema clusters --narrate
gitsema health --narrate

GET /api/v1/capabilities (HTTP server)

Returns a machine-readable JSON manifest of all features supported by the running server, including version, provider models, and enabled features. Useful for client auto-configuration.

curl http://localhost:4242/api/v1/capabilities

HTTP Analysis Routes (POST /api/v1/analysis/...)

All analysis commands available in the CLI are also exposed over HTTP. Authentication via GITSEMA_SERVE_KEY applies to all routes.

Route Description Key request fields
POST /analysis/clusters K-means cluster snapshot k, topKeywords, branch
POST /analysis/change-points Concept change-point detection query, topK, threshold
POST /analysis/author Author attribution for a concept query, topK, topAuthors
POST /analysis/impact Cross-module coupling for a file file, topK
POST /analysis/semantic-diff Semantic diff between two refs ref1, ref2, query
POST /analysis/semantic-blame Semantic origin of code blocks filePath, content, topK
POST /analysis/dead-concepts Deleted semantic blobs topK, since
POST /analysis/merge-audit Semantic collision detection before merge branchA, branchB, threshold
POST /analysis/merge-preview Merge semantic impact preview branch, into
POST /analysis/branch-summary Branch semantic summary vs base branch, baseBranch
POST /analysis/experts Reviewer / expert suggestions topN, since, until
POST /analysis/security-scan Vulnerability pattern similarity scan top
POST /analysis/health Time-bucketed codebase health timeline buckets, branch
POST /analysis/debt Technical debt scoring top, branch
POST /analysis/doc-gap Documentation gap analysis top, threshold, branch
POST /analysis/contributor-profile Contributor semantic profile author, top, branch
POST /analysis/triage Incident triage bundle (first-seen + change-points + bisect + experts) query, top, ref1, ref2, file
POST /analysis/policy-check Automated CI gate (debt / security / drift thresholds) maxDebtScore, minSecurityScore, maxDrift, query
POST /analysis/ownership Ownership heatmap by concept query, top, windowDays
POST /analysis/workflow Workflow template runner template (pr-review|incident|release-audit), query, file, top
POST /analysis/eval Inline retrieval evaluation (P@k, R@k, MRR) cases (array of {query, expectedPaths}), top
POST /analysis/multi-repo-search Search across multiple registered repos query, repoIds, topK

Note on security-scan: Results are semantic similarity scores, not confirmed vulnerabilities. Always perform manual review.

Note on policy-check: Returns HTTP 200 when all gates pass, 422 when any gate fails — convenient for CI integration.


Automated Indexing (Git Hooks)

You can keep the semantic index in sync with your repository automatically by installing the provided Git hook scripts. Once installed, gitsema index runs in the background after every git commit and every git pull / git merge — no manual intervention required.

How it works

Hook Trigger Command run
post-commit After every git commit gitsema index start --since HEAD~1
post-merge After every git pull / git merge gitsema index start --since ORIG_HEAD

Both hooks are safe no-ops when:

  • gitsema is not on your PATH, or
  • the index has not been initialised yet (run gitsema index start once first).

Installation (manual)

Copy the scripts into your repository's .git/hooks/ directory and make them executable:

cp scripts/hooks/post-commit  .git/hooks/post-commit
cp scripts/hooks/post-merge   .git/hooks/post-merge
chmod +x .git/hooks/post-commit .git/hooks/post-merge

Alternatively, use symlinks so the scripts stay in sync whenever you pull updates to the scripts/hooks/ directory:

ln -s ../../scripts/hooks/post-commit  .git/hooks/post-commit
ln -s ../../scripts/hooks/post-merge   .git/hooks/post-merge

Toggle via gitsema config

The gitsema config command can install or remove the hooks automatically — no manual file copying required:

# Install hooks for the current repository (symlinks into .git/hooks/)
gitsema config set hooks.enabled true

# Remove the managed hooks
gitsema config set hooks.enabled false

The config value is persisted in .gitsema/config.json so hooks are re-enabled automatically when you run gitsema config set hooks.enabled true again after a re-clone. The manual copy/symlink steps above remain a valid alternative if you prefer not to use the config command.


Data storage

The index is stored in .gitsema/index.db (SQLite) in the root of the repository. Add it to .gitignore to avoid committing it:

.gitsema/

Feature catalog

See docs/features.md for the complete, grouped catalog of implemented features including indexing options, all search flags, history/temporal commands, clustering, branch/merge tools, the HTTP API route list, and all MCP tools.


Strategic review

For the latest deep review of bottlenecks, missing features, productization ideas, and AI-assisted coding workflows, see docs/review7.md.


AI skill

A reusable AI-operator playbook is available at skill/gitsema-ai-assistant.md. Use it as a prompt scaffold for coding assistants that interact with gitsema.


Roadmap / Plans

See docs/PLAN.md for the full development roadmap, phase history, and backlog of planned features.

About

A content-addressed semantic index synchronized with Git’s object model.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages