Semantic linting CLI for AI-generated code redundancy
Echo-Guard is a semantic linting CLI designed to catch the subtle, functional duplication that AI coding agents often introduce.
Unlike traditional linters that focus on syntax errors or style, Echo-Guard analyzes the logic and intent of your code. It identifies "echoes"—blocks of code that perform the same task but might look slightly different—across your entire project, regardless of the file or service they live in.
AI-assisted development (Cursor, Claude Code, Copilot) is incredibly fast, but it has a "memory" problem. Agents often generate fresh code for a task that has already been solved elsewhere in your codebase.
Use Echo-Guard to:
- Kill Hidden Redundancy: Catch duplicate business logic that "grep" or simple string matching would miss.
- Prevent "AI Rot": Stop your codebase from bloating with slightly different versions of the same utility functions.
- Keep Your Data Local: Built for privacy-conscious teams. Echo-Guard runs entirely on your machine—no code is ever uploaded for analysis. Optional, consent-based feedback sharing improves detection for everyone.
- Scale Across Languages: Maintain a DRY (Don't Repeat Yourself) architecture even in polyglot repositories.
pip install "echo-guard[languages,mcp]"To upgrade:
pip install --upgrade "echo-guard[languages,mcp]"echo-guard setupThe setup wizard handles everything:
- Directory selection — choose which directories to scan (interactive arrow-key selector)
- Language detection — auto-detects languages in your selected directories
- MCP registration — detects Claude Code and registers the MCP server automatically
- GitHub Action — optionally generates
.github/workflows/echo-guard-ci.ymlfor PR checks - Initial index + scan — indexes your codebase and runs the first scan
- Data sharing — choose your feedback consent level (defaults based on repo visibility)
One command, fully configured. The wizard generates echo-guard.yml with all settings.
If you prefer to skip the wizard:
echo-guard index # Index your codebase
echo-guard scan # Scan for duplicates
echo-guard review # Walk through findings interactively
echo-guard add-mcp # Register MCP server with Claude Code
echo-guard add-action # Generate GitHub Action for PR checksEcho Guard — Scan Results
18 EXTRACT · 28 REVIEW (892 raw pairs)
Top refactoring targets:
fetchJson() — 13 copies
timeAgo() — 4 copies
schemaTypes() — 4 copies
━━━ EXTRACT NOW (18) ━━━
3+ copies — real DRY violations
● #1 T1/T2 Exact — fetchJson() x13
components/UserList.tsx:10 fetchJson()
components/TeamList.tsx:8 fetchJson()
lib/api.ts:15 fetchJson()
...
→ Extract to shared module under lib/
━━━ WORTH NOTING (28) ━━━
2 exact copies — fix if complex, defer per Rule of Three
● #1 T1/T2 Exact — validate_email() (100%)
services/auth/utils.py:12 → import from services/user/validators.py:8
Echo Guard uses a two-tier detection pipeline:
Tree-sitter parses functions, normalizes identifiers, and computes structural hashes. Two functions with the same hash are exact or renamed clones. O(n) — 100% recall, zero false positives.
A configurable code encoder (default: CodeSage-small, also supports CodeSage-base and UniXcoder) encodes each function into an embedding vector. Cosine similarity search finds modified clones (same structure, different statements) and semantic clones (same intent, completely different implementation). ~15ms per function, ~2ms search at 100K functions.
Intent filters suppress structural false positives (CRUD boilerplate, UI wrapper patterns, observer callbacks, framework-required exports) after candidates are found.
Severity is based on actionability, not just clone confidence:
| Severity | Meaning | CI Behavior |
|---|---|---|
extract |
3+ copies, or multiple duplicates in the same file — extract to shared module | Fails fail_on: extract |
review |
2 copies — worth noting, defer per Rule of Three | Fails fail_on: review |
Report sections are grouped by action type: Extract Now (extract), Worth Noting (review), Cross-Service, and Cross-Language.
Echo Guard ships a first-class VS Code extension that provides real-time duplicate detection directly in the editor.
- Install the
echo-guardPython package:pip install "echo-guard[languages]" - Install the extension from the VS Code Marketplace (search "Echo Guard")
- Open a workspace — the extension activates automatically when
echo-guard.ymlis present
- Real-time squiggles — diagnostics update 1.5s after each file save (configurable debounce)
- Code actions (Ctrl+.) — mark as intentional, dismiss, jump to duplicate, show side-by-side diff, or send to AI for refactoring
- Findings tree view — sidebar panel showing redundancy clusters grouped by severity, with top refactoring targets and hotspot files
- Review panel — "Echo Guard: Review All Findings" webview with severity badges, clone types, similarity scores, and inline verdicts
- Cross-language CodeLens — grey annotations above functions showing matches in other languages (e.g., "↔ Python: handler() in file.py:42")
- Status bar — shows daemon state (Starting/Indexing/Ready/Stopped) with finding count; click to open review panel
- Branch-switch reindex — watches
.git/HEADand automatically reindexes when you switch branches - Periodic reindex — incremental reindex every 5 minutes to catch external changes
The extension spawns a long-lived Python daemon (echo-guard daemon) that communicates via JSON-RPC 2.0 over stdin/stdout. The daemon holds the function index and ONNX model in memory, keeping per-save checks under 500ms. It auto-restarts with exponential backoff (max 5 restarts) if it crashes.
The "Send to AI" action composes a refactoring prompt with both function sources, caller information, and consolidation guidance, then sends it to the terminal (Claude Code / Codex) or copies to clipboard. When the AI resolves a finding via MCP, the VS Code diagnostic clears immediately.
When the VS Code extension is running, the MCP server routes resolve_finding calls through the daemon — so when an AI agent marks a finding as resolved, the VS Code diagnostic clears immediately. The recheck_file MCP tool re-checks a file after an agent modifies it.
Echo Guard includes a built-in MCP server so AI agents can check for duplicates before generating new functions. Supported agents:
- Claude Code — auto-detected and registered via
claude mcp add - Codex — auto-detected and registered via
codex mcp add
The MCP server is registered automatically during echo-guard setup, or manually via echo-guard add-mcp. It provides:
| Tool | Description |
|---|---|
check_for_duplicates |
Check code for duplicates (before/after writing) |
resolve_finding |
Record verdict: resolved, intentional, or dismissed |
recheck_file |
Re-check a file after it's been modified (syncs VS Code too) |
respond_to_probe |
Evaluate a low-confidence match for training data |
get_finding_resolutions |
View resolution history and stats |
search_functions |
Search index by function name, keyword, or language |
suggest_refactor |
Get consolidation suggestions for two functions |
get_index_stats |
View index statistics |
get_codebase_clusters |
Understand code grouping by dependency domain |
ping |
Health check (returns "pong") |
Manual MCP registration
# Claude Code
claude mcp add echo-guard -- python -m echo_guard.mcp_server
# Codex
codex mcp add echo-guard -- python -m echo_guard.mcp_serverPython, JavaScript, TypeScript, Go, Rust, Java, Ruby, C, C++
Cross-language matching is supported.
| Command | Description |
|---|---|
echo-guard setup |
Interactive setup wizard |
echo-guard scan |
Scan for redundant code |
echo-guard scan -v |
Show detailed match table |
echo-guard check FILES |
Check specific files (fast path for pre-commit) |
echo-guard review |
Interactive review of all findings |
echo-guard index |
Index codebase (incremental; --full for rebuild) |
echo-guard watch |
Watch files in real time |
echo-guard health |
Codebase health score (A-F grade, --history) |
echo-guard stats |
Index statistics and dependency graph info |
echo-guard languages |
List supported languages and file extensions |
echo-guard add-mcp |
Register MCP server (Claude/Codex) |
echo-guard add-action |
Generate GitHub Action workflow |
echo-guard install-hook |
Install pre-commit hook configuration |
echo-guard daemon |
Start JSON-RPC daemon (for VS Code extension) |
echo-guard acknowledge |
Acknowledge a single finding by ID |
echo-guard prune |
Remove stale finding suppressions |
echo-guard consent |
View or change feedback data sharing level |
echo-guard feedback-preview |
Preview exactly what data would be uploaded |
echo-guard training-data |
View/export collected training data |
echo-guard clear-index |
Clear index |
Everything lives in echo-guard.yml, generated by echo-guard setup:
# Detection settings
min_function_lines: 3 # Skip functions shorter than this
max_function_lines: 500 # Skip functions longer than this
# Embedding model (default: codesage-small)
# model: codesage-base # Higher Type-4 recall, ~3x slower (~341MB)
# model: unixcoder # Legacy (768-dim, ~125MB)
# Languages to scan
languages:
- python
- javascript
- typescript
# CI behavior (used by GitHub Action)
fail_on: extract # extract, review, or none
# Directories to exclude from scanning
ignore:
- docs/
- tests/
- benchmarks/
# Service boundaries for monorepo-aware suggestions
# service_boundaries:
# - services/worker
# - services/dashboard
# Data sharing: public (code pairs), private (features only), none
feedback_consent: private
# Acknowledged findings — suppressed in CI and future scans
# Run `echo-guard review` to add entries interactively
acknowledged:
- echo_guard/cli.py:scan||echo_guard/cli.py:check| Setting | Default | Description |
|---|---|---|
min_function_lines |
3 |
Functions shorter than this are skipped (getters, one-liners). |
max_function_lines |
500 |
Functions longer than this are skipped (generated code, data dumps). |
model |
codesage-small |
Embedding model: codesage-small (default, best Type-3 recall), codesage-base (higher Type-4 recall, ~3x slower), unixcoder (768-dim, legacy), or a local path to a fine-tuned model. |
languages |
all 9 | Which languages to scan. Restricting this speeds up indexing. |
fail_on |
extract |
Minimum severity that fails the CI check. none = advisory only. |
ignore |
[] |
Directories/patterns to exclude from scanning (gitignore-style). |
feedback_consent |
smart default | public (public repos), private (private repos), or none. Controls what feedback data is shared to improve detection. |
acknowledged |
[] |
Finding IDs that have been reviewed and accepted. These are suppressed in CI and in echo-guard review. |
Local artifacts are stored in .echo-guard/ (gitignored):
.echo-guard/
├── index.duckdb # Function metadata and training data
├── embeddings.npy # Code embedding vectors
├── embedding_meta.json # Embedding store metadata
├── scan-results.txt # Latest scan report
└── model_cache/ # Cached ONNX model (~200MB for CodeSage-small, downloaded on first use)
Generated automatically by echo-guard setup, or add manually to .github/workflows/echo-guard-ci.yml:
name: Echo Guard
on: [pull_request]
permissions:
contents: read
pull-requests: write
jobs:
echo-guard:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- uses: jwizenfeld04/[email protected] # Pin to your installed version
with:
fail-on: "extract" # Only 3+ copy DRY violations fail the check
comment: "true"Tip: Pin the action version to match your installed
echo-guardversion. Runecho-guard --versionto check.
When Echo Guard flags intentional duplication that blocks your PR:
echo-guard reviewThis walks through each finding with code previews:
- a = acknowledge (intentional duplication, suppress in CI)
- f = false positive (not a real clone, suppress and record as training data)
- s = skip (leave unresolved)
Acknowledged findings are saved to the acknowledged list in echo-guard.yml. Commit the file to suppress them in future CI runs.
Echo Guard runs entirely on your machine — the embedding model, AST analysis, and all detection happen locally via ONNX Runtime. No code is sent anywhere for analysis.
Echo Guard collects two kinds of anonymous data to improve detection quality. You choose your
sharing level during echo-guard setup:
Scan events (both tiers) — aggregate counts after every scan and check: total findings,
severity breakdown, function count. No code, paths, or names — just numbers.
Verdict feedback — when you review findings (mark as true positive, false positive, or ignore):
| Level | What's shared | What's NOT shared |
|---|---|---|
| Public (default for public repos) | Scan events + anonymized code pairs + your verdict | File paths, repo name, function names |
| Private (default for private repos) | Scan events + structural features only: language, line counts, param counts, similarity score, verdict | Source code, file paths, function names — nothing that could identify your code |
| None | Nothing. All data stays local. | — |
What this data is used for:
- Understand detection volume and noise levels (scan events)
- Calibrate per-language similarity thresholds (private tier is sufficient)
- Train a false-positive classifier on real decision patterns (private tier)
- Fine-tune the CodeSage embedding model on real clone pairs (requires public tier)
Transparency guarantees:
- Run
echo-guard feedback-previewto see exactly what would be uploaded - Run
echo-guard consentto view or change your tier at any time - Uploads are logged:
↑ 3 feedback records uploadedappears after each session - Set
DO_NOT_TRACK=1orECHO_GUARD_NO_UPLOAD=1to disable uploads via environment - All collection code is open source in
echo_guard/upload.py - Full field-level schema in
docs/FEEDBACK_SCHEMA.md
No cloud dependencies for core functionality — scanning, indexing, and detection never require network access. Data sharing is optional and off by default for private repos.
- GitHub Action — PR annotations, summary comments, severity-based gating
- Semantic detection — CodeSage-small embeddings for Type-3/Type-4 clone detection
- Intent-aware filtering — domain-aware rules suppress CRUD boilerplate, UI wrappers, observer patterns, DRY-based severity
- VS Code extension — Real-time diagnostics, findings tree, code actions, AI refactoring, daemon architecture
- Consent-based feedback — Three-tier data sharing with smart defaults, automatic uploads, transparency tools
- Intra-function detection — Block-level clone detection within function bodies (sliding window AST matching)
- Finding history — Track finding lifecycle, stale detection, regression alerts, trend dashboard
- v1.0 publishing — VS Code Marketplace + GitHub Marketplace once feature set is stable
See ROADMAP.md for the full plan with details and rationale.
- Architecture — Two-tier detection pipeline, clone types, storage, scaling
- Benchmarks — BigCloneBench, GPTCloneBench, POJ-104 results
- Roadmap — Development phases and planned features
- Changelog
MIT
