Echo-Guard

Semantic linting CLI for AI-generated code redundancy

What is Echo-Guard?

Echo-Guard is a semantic linting CLI designed to catch the subtle, functional duplication that AI coding agents often introduce.

Unlike traditional linters that focus on syntax errors or style, Echo-Guard analyzes the logic and intent of your code. It identifies "echoes"—blocks of code that perform the same task but might look slightly different—across your entire project, regardless of the file or service they live in.

Why Echo-Guard?

AI-assisted development (Cursor, Claude Code, Copilot) is incredibly fast, but it has a "memory" problem. Agents often generate fresh code for a task that has already been solved elsewhere in your codebase.

Use Echo-Guard to:

Kill Hidden Redundancy: Catch duplicate business logic that "grep" or simple string matching would miss.
Prevent "AI Rot": Stop your codebase from bloating with slightly different versions of the same utility functions.
Keep Your Data Local: Built for privacy-conscious teams. Echo-Guard runs entirely on your machine—no code is ever uploaded for analysis. Optional, consent-based feedback sharing improves detection for everyone.
Scale Across Languages: Maintain a DRY (Don't Repeat Yourself) architecture even in polyglot repositories.

Install

pip install "echo-guard[languages,mcp]"

To upgrade:

pip install --upgrade "echo-guard[languages,mcp]"

Getting Started

echo-guard setup

The setup wizard handles everything:

Directory selection — choose which directories to scan (interactive arrow-key selector)
Language detection — auto-detects languages in your selected directories
MCP registration — detects Claude Code and registers the MCP server automatically
GitHub Action — optionally generates .github/workflows/echo-guard-ci.yml for PR checks
Initial index + scan — indexes your codebase and runs the first scan
Data sharing — choose your feedback consent level (defaults based on repo visibility)

One command, fully configured. The wizard generates echo-guard.yml with all settings.

Manual workflow

If you prefer to skip the wizard:

echo-guard index        # Index your codebase
echo-guard scan         # Scan for duplicates
echo-guard review       # Walk through findings interactively
echo-guard add-mcp      # Register MCP server with Claude Code
echo-guard add-action   # Generate GitHub Action for PR checks

Example Output

Echo Guard — Scan Results

  18 EXTRACT · 28 REVIEW  (892 raw pairs)

  Top refactoring targets:
    fetchJson()  —  13 copies
    timeAgo()  —  4 copies
    schemaTypes()  —  4 copies

  ━━━ EXTRACT NOW (18) ━━━
  3+ copies — real DRY violations

  ● #1  T1/T2 Exact — fetchJson() x13
       components/UserList.tsx:10  fetchJson()
       components/TeamList.tsx:8  fetchJson()
       lib/api.ts:15  fetchJson()
       ...
       → Extract to shared module under lib/

  ━━━ WORTH NOTING (28) ━━━
  2 exact copies — fix if complex, defer per Rule of Three

  ● #1  T1/T2 Exact — validate_email()  (100%)
       services/auth/utils.py:12  →  import from services/user/validators.py:8

How It Works

Echo Guard uses a two-tier detection pipeline:

Tier 1 — AST Hash Matching (Type-1/Type-2)

Tree-sitter parses functions, normalizes identifiers, and computes structural hashes. Two functions with the same hash are exact or renamed clones. O(n) — 100% recall, zero false positives.

Tier 2 — Code Embeddings (Type-3/Type-4)

A configurable code encoder (default: CodeSage-small, also supports CodeSage-base and UniXcoder) encodes each function into an embedding vector. Cosine similarity search finds modified clones (same structure, different statements) and semantic clones (same intent, completely different implementation). ~15ms per function, ~2ms search at 100K functions.

Intent filters suppress structural false positives (CRUD boilerplate, UI wrapper patterns, observer callbacks, framework-required exports) after candidates are found.

Severity Model (DRY-based)

Severity is based on actionability, not just clone confidence:

Severity	Meaning	CI Behavior
`extract`	3+ copies, or multiple duplicates in the same file — extract to shared module	Fails `fail_on: extract`
`review`	2 copies — worth noting, defer per Rule of Three	Fails `fail_on: review`

Report sections are grouped by action type: Extract Now (extract), Worth Noting (review), Cross-Service, and Cross-Language.

VS Code Extension

Echo Guard ships a first-class VS Code extension that provides real-time duplicate detection directly in the editor.

Installation

Install the echo-guard Python package:
```
pip install "echo-guard[languages]"
```
Install the extension from the VS Code Marketplace (search "Echo Guard")
Open a workspace — the extension activates automatically when echo-guard.yml is present

What you get

Real-time squiggles — diagnostics update 1.5s after each file save (configurable debounce)
Code actions (Ctrl+.) — mark as intentional, dismiss, jump to duplicate, show side-by-side diff, or send to AI for refactoring
Findings tree view — sidebar panel showing redundancy clusters grouped by severity, with top refactoring targets and hotspot files
Review panel — "Echo Guard: Review All Findings" webview with severity badges, clone types, similarity scores, and inline verdicts
Cross-language CodeLens — grey annotations above functions showing matches in other languages (e.g., "↔ Python: handler() in file.py:42")
Status bar — shows daemon state (Starting/Indexing/Ready/Stopped) with finding count; click to open review panel
Branch-switch reindex — watches .git/HEAD and automatically reindexes when you switch branches
Periodic reindex — incremental reindex every 5 minutes to catch external changes

Daemon architecture

The extension spawns a long-lived Python daemon (echo-guard daemon) that communicates via JSON-RPC 2.0 over stdin/stdout. The daemon holds the function index and ONNX model in memory, keeping per-save checks under 500ms. It auto-restarts with exponential backoff (max 5 restarts) if it crashes.

AI refactoring integration

The "Send to AI" action composes a refactoring prompt with both function sources, caller information, and consolidation guidance, then sends it to the terminal (Claude Code / Codex) or copies to clipboard. When the AI resolves a finding via MCP, the VS Code diagnostic clears immediately.

MCP sync

When the VS Code extension is running, the MCP server routes resolve_finding calls through the daemon — so when an AI agent marks a finding as resolved, the VS Code diagnostic clears immediately. The recheck_file MCP tool re-checks a file after an agent modifies it.

MCP Integration

Echo Guard includes a built-in MCP server so AI agents can check for duplicates before generating new functions. Supported agents:

Claude Code — auto-detected and registered via claude mcp add
Codex — auto-detected and registered via codex mcp add

The MCP server is registered automatically during echo-guard setup, or manually via echo-guard add-mcp. It provides:

Tool	Description
`check_for_duplicates`	Check code for duplicates (before/after writing)
`resolve_finding`	Record verdict: `resolved`, `intentional`, or `dismissed`
`recheck_file`	Re-check a file after it's been modified (syncs VS Code too)
`respond_to_probe`	Evaluate a low-confidence match for training data
`get_finding_resolutions`	View resolution history and stats
`search_functions`	Search index by function name, keyword, or language
`suggest_refactor`	Get consolidation suggestions for two functions
`get_index_stats`	View index statistics
`get_codebase_clusters`	Understand code grouping by dependency domain
`ping`	Health check (returns "pong")

Manual MCP registration

# Claude Code
claude mcp add echo-guard -- python -m echo_guard.mcp_server

# Codex
codex mcp add echo-guard -- python -m echo_guard.mcp_server

Supported Languages

Python, JavaScript, TypeScript, Go, Rust, Java, Ruby, C, C++

Cross-language matching is supported.

CLI Reference

Command	Description
`echo-guard setup`	Interactive setup wizard
`echo-guard scan`	Scan for redundant code
`echo-guard scan -v`	Show detailed match table
`echo-guard check FILES`	Check specific files (fast path for pre-commit)
`echo-guard review`	Interactive review of all findings
`echo-guard index`	Index codebase (incremental; `--full` for rebuild)
`echo-guard watch`	Watch files in real time
`echo-guard health`	Codebase health score (A-F grade, `--history`)
`echo-guard stats`	Index statistics and dependency graph info
`echo-guard languages`	List supported languages and file extensions
`echo-guard add-mcp`	Register MCP server (Claude/Codex)
`echo-guard add-action`	Generate GitHub Action workflow
`echo-guard install-hook`	Install pre-commit hook configuration
`echo-guard daemon`	Start JSON-RPC daemon (for VS Code extension)
`echo-guard acknowledge`	Acknowledge a single finding by ID
`echo-guard prune`	Remove stale finding suppressions
`echo-guard consent`	View or change feedback data sharing level
`echo-guard feedback-preview`	Preview exactly what data would be uploaded
`echo-guard training-data`	View/export collected training data
`echo-guard clear-index`	Clear index

Configuration

Everything lives in echo-guard.yml, generated by echo-guard setup:

# Detection settings
min_function_lines: 3 # Skip functions shorter than this
max_function_lines: 500 # Skip functions longer than this

# Embedding model (default: codesage-small)
# model: codesage-base   # Higher Type-4 recall, ~3x slower (~341MB)
# model: unixcoder       # Legacy (768-dim, ~125MB)

# Languages to scan
languages:
  - python
  - javascript
  - typescript

# CI behavior (used by GitHub Action)
fail_on: extract # extract, review, or none

# Directories to exclude from scanning
ignore:
  - docs/
  - tests/
  - benchmarks/

# Service boundaries for monorepo-aware suggestions
# service_boundaries:
#   - services/worker
#   - services/dashboard

# Data sharing: public (code pairs), private (features only), none
feedback_consent: private

# Acknowledged findings — suppressed in CI and future scans
# Run `echo-guard review` to add entries interactively
acknowledged:
  - echo_guard/cli.py:scan||echo_guard/cli.py:check

What each setting does

Setting	Default	Description
`min_function_lines`	`3`	Functions shorter than this are skipped (getters, one-liners).
`max_function_lines`	`500`	Functions longer than this are skipped (generated code, data dumps).
`model`	`codesage-small`	Embedding model: `codesage-small` (default, best Type-3 recall), `codesage-base` (higher Type-4 recall, ~3x slower), `unixcoder` (768-dim, legacy), or a local path to a fine-tuned model.
`languages`	all 9	Which languages to scan. Restricting this speeds up indexing.
`fail_on`	`extract`	Minimum severity that fails the CI check. `none` = advisory only.
`ignore`	`[]`	Directories/patterns to exclude from scanning (gitignore-style).
`feedback_consent`	smart default	`public` (public repos), `private` (private repos), or `none`. Controls what feedback data is shared to improve detection.
`acknowledged`	`[]`	Finding IDs that have been reviewed and accepted. These are suppressed in CI and in `echo-guard review`.

Local artifacts are stored in .echo-guard/ (gitignored):

.echo-guard/
├── index.duckdb        # Function metadata and training data
├── embeddings.npy      # Code embedding vectors
├── embedding_meta.json # Embedding store metadata
├── scan-results.txt    # Latest scan report
└── model_cache/        # Cached ONNX model (~200MB for CodeSage-small, downloaded on first use)

CI Integration

GitHub Action

Generated automatically by echo-guard setup, or add manually to .github/workflows/echo-guard-ci.yml:

name: Echo Guard
on: [pull_request]
permissions:
  contents: read
  pull-requests: write
jobs:
  echo-guard:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - uses: jwizenfeld04/[email protected] # Pin to your installed version
        with:
          fail-on: "extract" # Only 3+ copy DRY violations fail the check
          comment: "true"

Tip: Pin the action version to match your installed echo-guard version. Run echo-guard --version to check.

Acknowledging Findings

When Echo Guard flags intentional duplication that blocks your PR:

echo-guard review

This walks through each finding with code previews:

a = acknowledge (intentional duplication, suppress in CI)
f = false positive (not a real clone, suppress and record as training data)
s = skip (leave unresolved)

Acknowledged findings are saved to the acknowledged list in echo-guard.yml. Commit the file to suppress them in future CI runs.

Privacy & Data Sharing

Echo Guard runs entirely on your machine — the embedding model, AST analysis, and all detection happen locally via ONNX Runtime. No code is sent anywhere for analysis.

Feedback collection

Echo Guard collects two kinds of anonymous data to improve detection quality. You choose your sharing level during echo-guard setup:

Scan events (both tiers) — aggregate counts after every scan and check: total findings, severity breakdown, function count. No code, paths, or names — just numbers.

Verdict feedback — when you review findings (mark as true positive, false positive, or ignore):

Level	What's shared	What's NOT shared
Public (default for public repos)	Scan events + anonymized code pairs + your verdict	File paths, repo name, function names
Private (default for private repos)	Scan events + structural features only: language, line counts, param counts, similarity score, verdict	Source code, file paths, function names — nothing that could identify your code
None	Nothing. All data stays local.	—

What this data is used for:

Understand detection volume and noise levels (scan events)
Calibrate per-language similarity thresholds (private tier is sufficient)
Train a false-positive classifier on real decision patterns (private tier)
Fine-tune the CodeSage embedding model on real clone pairs (requires public tier)

Transparency guarantees:

Run echo-guard feedback-preview to see exactly what would be uploaded
Run echo-guard consent to view or change your tier at any time
Uploads are logged: ↑ 3 feedback records uploaded appears after each session
Set DO_NOT_TRACK=1 or ECHO_GUARD_NO_UPLOAD=1 to disable uploads via environment
All collection code is open source in echo_guard/upload.py
Full field-level schema in docs/FEEDBACK_SCHEMA.md

No cloud dependencies for core functionality — scanning, indexing, and detection never require network access. Data sharing is optional and off by default for private repos.

Roadmap

GitHub Action — PR annotations, summary comments, severity-based gating
Semantic detection — CodeSage-small embeddings for Type-3/Type-4 clone detection
Intent-aware filtering — domain-aware rules suppress CRUD boilerplate, UI wrappers, observer patterns, DRY-based severity
VS Code extension — Real-time diagnostics, findings tree, code actions, AI refactoring, daemon architecture
Consent-based feedback — Three-tier data sharing with smart defaults, automatic uploads, transparency tools
Intra-function detection — Block-level clone detection within function bodies (sliding window AST matching)
Finding history — Track finding lifecycle, stale detection, regression alerts, trend dashboard
v1.0 publishing — VS Code Marketplace + GitHub Marketplace once feature set is stable

See ROADMAP.md for the full plan with details and rationale.

Documentation

Architecture — Two-tier detection pipeline, clone types, storage, scaling
Benchmarks — BigCloneBench, GPTCloneBench, POJ-104 results
Roadmap — Development phases and planned features
Changelog

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.claude		.claude
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
docs		docs
echo_guard		echo_guard
examples		examples
tests		tests
vscode-extension		vscode-extension
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
echo-guard.yml		echo-guard.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Echo-Guard

What is Echo-Guard?

Why Echo-Guard?

Install

Getting Started

Manual workflow

Example Output

How It Works

Tier 1 — AST Hash Matching (Type-1/Type-2)

Tier 2 — Code Embeddings (Type-3/Type-4)

Severity Model (DRY-based)

VS Code Extension

Installation

What you get

Daemon architecture

AI refactoring integration

MCP sync

MCP Integration

Supported Languages

CLI Reference

Configuration

What each setting does

CI Integration

GitHub Action

Acknowledging Findings

Privacy & Data Sharing

Feedback collection

Roadmap

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages