Skip to content

jwizenfeld04/Echo-Guard

Repository files navigation

Echo-Guard Logo

Echo-Guard

Semantic linting CLI for AI-generated code redundancy

PyPI Python License CI

What is Echo-Guard?

Echo-Guard is a semantic linting CLI designed to catch the subtle, functional duplication that AI coding agents often introduce.

Unlike traditional linters that focus on syntax errors or style, Echo-Guard analyzes the logic and intent of your code. It identifies "echoes"—blocks of code that perform the same task but might look slightly different—across your entire project, regardless of the file or service they live in.

Why Echo-Guard?

AI-assisted development (Cursor, Claude Code, Copilot) is incredibly fast, but it has a "memory" problem. Agents often generate fresh code for a task that has already been solved elsewhere in your codebase.

Use Echo-Guard to:

  • Kill Hidden Redundancy: Catch duplicate business logic that "grep" or simple string matching would miss.
  • Prevent "AI Rot": Stop your codebase from bloating with slightly different versions of the same utility functions.
  • Keep Your Data Local: Built for privacy-conscious teams. Echo-Guard runs entirely on your machine—no code is ever uploaded for analysis. Optional, consent-based feedback sharing improves detection for everyone.
  • Scale Across Languages: Maintain a DRY (Don't Repeat Yourself) architecture even in polyglot repositories.

Install

pip install "echo-guard[languages,mcp]"

To upgrade:

pip install --upgrade "echo-guard[languages,mcp]"

Getting Started

echo-guard setup

The setup wizard handles everything:

  1. Directory selection — choose which directories to scan (interactive arrow-key selector)
  2. Language detection — auto-detects languages in your selected directories
  3. MCP registration — detects Claude Code and registers the MCP server automatically
  4. GitHub Action — optionally generates .github/workflows/echo-guard-ci.yml for PR checks
  5. Initial index + scan — indexes your codebase and runs the first scan
  6. Data sharing — choose your feedback consent level (defaults based on repo visibility)

One command, fully configured. The wizard generates echo-guard.yml with all settings.

Manual workflow

If you prefer to skip the wizard:

echo-guard index        # Index your codebase
echo-guard scan         # Scan for duplicates
echo-guard review       # Walk through findings interactively
echo-guard add-mcp      # Register MCP server with Claude Code
echo-guard add-action   # Generate GitHub Action for PR checks

Example Output

Echo Guard — Scan Results

  18 EXTRACT · 28 REVIEW  (892 raw pairs)

  Top refactoring targets:
    fetchJson()  —  13 copies
    timeAgo()  —  4 copies
    schemaTypes()  —  4 copies

  ━━━ EXTRACT NOW (18) ━━━
  3+ copies — real DRY violations

  ● #1  T1/T2 Exact — fetchJson() x13
       components/UserList.tsx:10  fetchJson()
       components/TeamList.tsx:8  fetchJson()
       lib/api.ts:15  fetchJson()
       ...
       → Extract to shared module under lib/

  ━━━ WORTH NOTING (28) ━━━
  2 exact copies — fix if complex, defer per Rule of Three

  ● #1  T1/T2 Exact — validate_email()  (100%)
       services/auth/utils.py:12  →  import from services/user/validators.py:8

How It Works

Echo Guard uses a two-tier detection pipeline:

Tier 1 — AST Hash Matching (Type-1/Type-2)

Tree-sitter parses functions, normalizes identifiers, and computes structural hashes. Two functions with the same hash are exact or renamed clones. O(n) — 100% recall, zero false positives.

Tier 2 — Code Embeddings (Type-3/Type-4)

A configurable code encoder (default: CodeSage-small, also supports CodeSage-base and UniXcoder) encodes each function into an embedding vector. Cosine similarity search finds modified clones (same structure, different statements) and semantic clones (same intent, completely different implementation). ~15ms per function, ~2ms search at 100K functions.

Intent filters suppress structural false positives (CRUD boilerplate, UI wrapper patterns, observer callbacks, framework-required exports) after candidates are found.

Severity Model (DRY-based)

Severity is based on actionability, not just clone confidence:

Severity Meaning CI Behavior
extract 3+ copies, or multiple duplicates in the same file — extract to shared module Fails fail_on: extract
review 2 copies — worth noting, defer per Rule of Three Fails fail_on: review

Report sections are grouped by action type: Extract Now (extract), Worth Noting (review), Cross-Service, and Cross-Language.

VS Code Extension

Echo Guard ships a first-class VS Code extension that provides real-time duplicate detection directly in the editor.

Installation

  1. Install the echo-guard Python package:
    pip install "echo-guard[languages]"
  2. Install the extension from the VS Code Marketplace (search "Echo Guard")
  3. Open a workspace — the extension activates automatically when echo-guard.yml is present

What you get

  • Real-time squiggles — diagnostics update 1.5s after each file save (configurable debounce)
  • Code actions (Ctrl+.) — mark as intentional, dismiss, jump to duplicate, show side-by-side diff, or send to AI for refactoring
  • Findings tree view — sidebar panel showing redundancy clusters grouped by severity, with top refactoring targets and hotspot files
  • Review panel — "Echo Guard: Review All Findings" webview with severity badges, clone types, similarity scores, and inline verdicts
  • Cross-language CodeLens — grey annotations above functions showing matches in other languages (e.g., "↔ Python: handler() in file.py:42")
  • Status bar — shows daemon state (Starting/Indexing/Ready/Stopped) with finding count; click to open review panel
  • Branch-switch reindex — watches .git/HEAD and automatically reindexes when you switch branches
  • Periodic reindex — incremental reindex every 5 minutes to catch external changes

Daemon architecture

The extension spawns a long-lived Python daemon (echo-guard daemon) that communicates via JSON-RPC 2.0 over stdin/stdout. The daemon holds the function index and ONNX model in memory, keeping per-save checks under 500ms. It auto-restarts with exponential backoff (max 5 restarts) if it crashes.

AI refactoring integration

The "Send to AI" action composes a refactoring prompt with both function sources, caller information, and consolidation guidance, then sends it to the terminal (Claude Code / Codex) or copies to clipboard. When the AI resolves a finding via MCP, the VS Code diagnostic clears immediately.

MCP sync

When the VS Code extension is running, the MCP server routes resolve_finding calls through the daemon — so when an AI agent marks a finding as resolved, the VS Code diagnostic clears immediately. The recheck_file MCP tool re-checks a file after an agent modifies it.


MCP Integration

Echo Guard includes a built-in MCP server so AI agents can check for duplicates before generating new functions. Supported agents:

  • Claude Code — auto-detected and registered via claude mcp add
  • Codex — auto-detected and registered via codex mcp add

The MCP server is registered automatically during echo-guard setup, or manually via echo-guard add-mcp. It provides:

Tool Description
check_for_duplicates Check code for duplicates (before/after writing)
resolve_finding Record verdict: resolved, intentional, or dismissed
recheck_file Re-check a file after it's been modified (syncs VS Code too)
respond_to_probe Evaluate a low-confidence match for training data
get_finding_resolutions View resolution history and stats
search_functions Search index by function name, keyword, or language
suggest_refactor Get consolidation suggestions for two functions
get_index_stats View index statistics
get_codebase_clusters Understand code grouping by dependency domain
ping Health check (returns "pong")
Manual MCP registration
# Claude Code
claude mcp add echo-guard -- python -m echo_guard.mcp_server

# Codex
codex mcp add echo-guard -- python -m echo_guard.mcp_server

Supported Languages

Python, JavaScript, TypeScript, Go, Rust, Java, Ruby, C, C++

Cross-language matching is supported.

CLI Reference

Command Description
echo-guard setup Interactive setup wizard
echo-guard scan Scan for redundant code
echo-guard scan -v Show detailed match table
echo-guard check FILES Check specific files (fast path for pre-commit)
echo-guard review Interactive review of all findings
echo-guard index Index codebase (incremental; --full for rebuild)
echo-guard watch Watch files in real time
echo-guard health Codebase health score (A-F grade, --history)
echo-guard stats Index statistics and dependency graph info
echo-guard languages List supported languages and file extensions
echo-guard add-mcp Register MCP server (Claude/Codex)
echo-guard add-action Generate GitHub Action workflow
echo-guard install-hook Install pre-commit hook configuration
echo-guard daemon Start JSON-RPC daemon (for VS Code extension)
echo-guard acknowledge Acknowledge a single finding by ID
echo-guard prune Remove stale finding suppressions
echo-guard consent View or change feedback data sharing level
echo-guard feedback-preview Preview exactly what data would be uploaded
echo-guard training-data View/export collected training data
echo-guard clear-index Clear index

Configuration

Everything lives in echo-guard.yml, generated by echo-guard setup:

# Detection settings
min_function_lines: 3 # Skip functions shorter than this
max_function_lines: 500 # Skip functions longer than this

# Embedding model (default: codesage-small)
# model: codesage-base   # Higher Type-4 recall, ~3x slower (~341MB)
# model: unixcoder       # Legacy (768-dim, ~125MB)

# Languages to scan
languages:
  - python
  - javascript
  - typescript

# CI behavior (used by GitHub Action)
fail_on: extract # extract, review, or none

# Directories to exclude from scanning
ignore:
  - docs/
  - tests/
  - benchmarks/

# Service boundaries for monorepo-aware suggestions
# service_boundaries:
#   - services/worker
#   - services/dashboard

# Data sharing: public (code pairs), private (features only), none
feedback_consent: private

# Acknowledged findings — suppressed in CI and future scans
# Run `echo-guard review` to add entries interactively
acknowledged:
  - echo_guard/cli.py:scan||echo_guard/cli.py:check

What each setting does

Setting Default Description
min_function_lines 3 Functions shorter than this are skipped (getters, one-liners).
max_function_lines 500 Functions longer than this are skipped (generated code, data dumps).
model codesage-small Embedding model: codesage-small (default, best Type-3 recall), codesage-base (higher Type-4 recall, ~3x slower), unixcoder (768-dim, legacy), or a local path to a fine-tuned model.
languages all 9 Which languages to scan. Restricting this speeds up indexing.
fail_on extract Minimum severity that fails the CI check. none = advisory only.
ignore [] Directories/patterns to exclude from scanning (gitignore-style).
feedback_consent smart default public (public repos), private (private repos), or none. Controls what feedback data is shared to improve detection.
acknowledged [] Finding IDs that have been reviewed and accepted. These are suppressed in CI and in echo-guard review.

Local artifacts are stored in .echo-guard/ (gitignored):

.echo-guard/
├── index.duckdb        # Function metadata and training data
├── embeddings.npy      # Code embedding vectors
├── embedding_meta.json # Embedding store metadata
├── scan-results.txt    # Latest scan report
└── model_cache/        # Cached ONNX model (~200MB for CodeSage-small, downloaded on first use)

CI Integration

GitHub Action

Generated automatically by echo-guard setup, or add manually to .github/workflows/echo-guard-ci.yml:

name: Echo Guard
on: [pull_request]
permissions:
  contents: read
  pull-requests: write
jobs:
  echo-guard:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - uses: jwizenfeld04/[email protected] # Pin to your installed version
        with:
          fail-on: "extract" # Only 3+ copy DRY violations fail the check
          comment: "true"

Tip: Pin the action version to match your installed echo-guard version. Run echo-guard --version to check.

Acknowledging Findings

When Echo Guard flags intentional duplication that blocks your PR:

echo-guard review

This walks through each finding with code previews:

  • a = acknowledge (intentional duplication, suppress in CI)
  • f = false positive (not a real clone, suppress and record as training data)
  • s = skip (leave unresolved)

Acknowledged findings are saved to the acknowledged list in echo-guard.yml. Commit the file to suppress them in future CI runs.

Privacy & Data Sharing

Echo Guard runs entirely on your machine — the embedding model, AST analysis, and all detection happen locally via ONNX Runtime. No code is sent anywhere for analysis.

Feedback collection

Echo Guard collects two kinds of anonymous data to improve detection quality. You choose your sharing level during echo-guard setup:

Scan events (both tiers) — aggregate counts after every scan and check: total findings, severity breakdown, function count. No code, paths, or names — just numbers.

Verdict feedback — when you review findings (mark as true positive, false positive, or ignore):

Level What's shared What's NOT shared
Public (default for public repos) Scan events + anonymized code pairs + your verdict File paths, repo name, function names
Private (default for private repos) Scan events + structural features only: language, line counts, param counts, similarity score, verdict Source code, file paths, function names — nothing that could identify your code
None Nothing. All data stays local.

What this data is used for:

  • Understand detection volume and noise levels (scan events)
  • Calibrate per-language similarity thresholds (private tier is sufficient)
  • Train a false-positive classifier on real decision patterns (private tier)
  • Fine-tune the CodeSage embedding model on real clone pairs (requires public tier)

Transparency guarantees:

  • Run echo-guard feedback-preview to see exactly what would be uploaded
  • Run echo-guard consent to view or change your tier at any time
  • Uploads are logged: ↑ 3 feedback records uploaded appears after each session
  • Set DO_NOT_TRACK=1 or ECHO_GUARD_NO_UPLOAD=1 to disable uploads via environment
  • All collection code is open source in echo_guard/upload.py
  • Full field-level schema in docs/FEEDBACK_SCHEMA.md

No cloud dependencies for core functionality — scanning, indexing, and detection never require network access. Data sharing is optional and off by default for private repos.

Roadmap

  • GitHub Action — PR annotations, summary comments, severity-based gating
  • Semantic detection — CodeSage-small embeddings for Type-3/Type-4 clone detection
  • Intent-aware filtering — domain-aware rules suppress CRUD boilerplate, UI wrappers, observer patterns, DRY-based severity
  • VS Code extension — Real-time diagnostics, findings tree, code actions, AI refactoring, daemon architecture
  • Consent-based feedback — Three-tier data sharing with smart defaults, automatic uploads, transparency tools
  • Intra-function detection — Block-level clone detection within function bodies (sliding window AST matching)
  • Finding history — Track finding lifecycle, stale detection, regression alerts, trend dashboard
  • v1.0 publishing — VS Code Marketplace + GitHub Marketplace once feature set is stable

See ROADMAP.md for the full plan with details and rationale.

Documentation

  • Architecture — Two-tier detection pipeline, clone types, storage, scaling
  • Benchmarks — BigCloneBench, GPTCloneBench, POJ-104 results
  • Roadmap — Development phases and planned features
  • Changelog

License

MIT

About

Semantic linting CLI that detects codebase redundancy created by AI coding agents.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors