Skip to content

vlomeli/CloneGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CloneGuard

CloneGuard is a Python command-line tool that inspects a public GitHub repository for suspicious or high-risk patterns using the GitHub API, without cloning files locally.

It is designed to help with fast risk triage, not to prove a repository is safe.

Important Disclaimer

CloneGuard does not guarantee safety. It only flags suspicious behavior based on rule-based heuristics and optional AI summarization.

Features

  • Prompt-based CLI workflow for repository checks
  • GitHub URL validation
  • GitHub API token-based repository inspection
  • Selectable scan mode: quick (faster, reduced coverage) or deep (full coverage)
  • Live progress output with percentages during fetch and scan phases
  • Recursive rule-based scanner with modular pattern definitions
  • Detection of common suspicious patterns, including:
    • curl | bash style execution
    • wget download-and-run flows
    • Encoded PowerShell execution
    • eval / exec with encoded payloads
    • Base64 decode then execute patterns
    • Risky install hooks (postinstall, install)
    • Potential hardcoded secrets
    • Suspicious subprocess shell usage
    • Potential obfuscation/exfiltration indicators
    • Suspicious filenames
  • Risk score + risk level (LOW, MEDIUM, HIGH)
  • Condensed findings output grouped by file and rule to reduce duplicate noise
  • Clean terminal output with colored status messages
  • Optional AI summary when GEMINI_API_KEY is set
  • Optional terminal risk chat with Gemini
  • Optional ElevenLabs cloned-voice text-to-speech for chat responses
  • Graceful error handling for invalid URLs, missing API keys, API failures, and scan issues

Project Structure

  • cloneguard/main.py - CLI entrypoint and workflow orchestration
  • cloneguard/scanner.py - Recursive file scanning and finding generation
  • cloneguard/patterns.py - Regex rules and suspicious filename definitions
  • cloneguard/repo_utils.py - GitHub API and URL utilities
  • cloneguard/risk.py - Risk scoring model
  • cloneguard/formatter.py - Output formatting helpers
  • cloneguard/ai_summary.py - Optional AI-based summary with local fallback
  • cloneguard/risk_chat.py - Optional terminal risk chat (Gemini + ElevenLabs TTS)
  • requirements.txt - Dependencies
  • .gitignore - Ignored local/generated files
  • .env.example - API key template

Installation

  1. Ensure Python 3.10+ is installed.
  2. Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Optional AI summary setup:
cp .env.example .env
# Then edit .env and add your key

If your shell does not auto-load .env, export manually:

export GEMINI_API_KEY="your_key_here"
export GITHUB_API_KEY="your_github_token_here"
export ELEVENLABS_API_KEY="your_elevenlabs_key_here"
export ELEVENLABS_VOICE_ID="your_voice_id_here"
# Optional:
export ELEVENLABS_MODEL_ID="eleven_multilingual_v2"
export CLONEGUARD_CHAT_LANGUAGE="English"

Usage

Run from the project root:

python -m cloneguard.main

Or pass mode directly:

python -m cloneguard.main --mode quick
python -m cloneguard.main --mode deep

Then:

  1. Paste a public GitHub repository URL.
  2. Choose scan mode (quick or deep) if not passed via CLI.
  3. CloneGuard uses GitHub API calls to read repository files in memory and scan them.
  4. It shows progress percentages while fetching/scanning.
  5. It prints findings, risk score, and risk level.
  6. It saves a markdown report in output/.
  7. It automatically opens the generated markdown file.
  8. Optional terminal risk chat lets you type questions.
  9. If chat is enabled, choose mode: text only or voice (text + ElevenLabs playback).
  10. It asks whether to clone the repository.
  11. If yes, it prompts for destination folder and validates:
  • The path exists
  • The path is a directory
  • The target repo folder does not already exist there
  1. It exits and clears only output/audio/ (whether you clone or not).

Risk Chat + Clone Voice

After scan/report output, CloneGuard can start terminal risk chat:

  • Uses Gemini to answer questions about the scanned repository risks.
  • You type questions directly in terminal.
  • Uses ElevenLabs TTS to speak responses using your cloned voice.
  • Saves generated voice audio files in output/audio/.
  • Before playback, you can choose per response:
    • normal speed
    • 2x speed
  • While audio is playing, press s + Enter to skip immediately.

To use voice output, set both:

  • ELEVENLABS_API_KEY
  • ELEVENLABS_VOICE_ID
  • Optional response language (e.g. English): CLONEGUARD_CHAT_LANGUAGE

If ElevenLabs vars are missing, chat still works in text-only mode.

Example Output

Scanning repository...
Warning: Suspicious patterns detected

Risk Score: 78
Risk Level: HIGH

Findings:
1. Encoded PowerShell execution found in scripts/setup.ps1
   Reason: Encoded PowerShell commands can hide malicious behavior.
Do you want to clone this repository anyway? (yes/no)

When no major issues are detected:

Scanning repository...
No major suspicious patterns detected.
Risk Score: 8
Risk Level: LOW
Do you want to clone this repository? (yes/no)

Risk Scoring Model

Each grouped finding contributes points by severity:

  • LOW = 3 points
  • MEDIUM = 8 points
  • HIGH = 18 points

Context weighting is applied before scoring:

  • DOC files (README/docs): lower weight
  • CI files (.github/workflows): medium weight
  • RUNTIME scripts/code: full weight

Repeated matches are capped so one repeated pattern does not dominate the score.

Risk level thresholds:

  • LOW: score < 20
  • MEDIUM: score 20-49
  • HIGH: score >= 50, or multiple high-severity findings

The model is intentionally simple and interpretable so new rules can be added easily.

API Key and Secret Hygiene

  • Never hardcode API keys in source code.
  • Use GEMINI_API_KEY from environment variables.
  • Use GITHUB_API_KEY or GITHUB_TOKEN from environment variables.
  • Use ELEVENLABS_API_KEY and ELEVENLABS_VOICE_ID for voice chat.
  • Keep .env local and uncommitted.
  • .env is already listed in .gitignore.
  • Do not print keys in logs or terminal output.

Limitations

  • Rule-based scanning can miss novel or heavily obfuscated threats.
  • Regex pattern matching may produce false positives.
  • Some behavior only appears at runtime and cannot be identified statically.
  • Large/binary files are skipped for safety and performance.

Future Improvements

  • Add CLI flags (--url, --json, --strict)
  • Add per-language rule packs
  • Add allowlist/ignore configuration
  • Add unit tests and fixture repositories
  • Add SARIF/JSON report output for CI integration

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages