CloneGuard is a Python command-line tool that inspects a public GitHub repository for suspicious or high-risk patterns using the GitHub API, without cloning files locally.
It is designed to help with fast risk triage, not to prove a repository is safe.
CloneGuard does not guarantee safety. It only flags suspicious behavior based on rule-based heuristics and optional AI summarization.
- Prompt-based CLI workflow for repository checks
- GitHub URL validation
- GitHub API token-based repository inspection
- Selectable scan mode:
quick(faster, reduced coverage) ordeep(full coverage) - Live progress output with percentages during fetch and scan phases
- Recursive rule-based scanner with modular pattern definitions
- Detection of common suspicious patterns, including:
curl | bashstyle executionwgetdownload-and-run flows- Encoded PowerShell execution
eval/execwith encoded payloads- Base64 decode then execute patterns
- Risky install hooks (
postinstall,install) - Potential hardcoded secrets
- Suspicious subprocess shell usage
- Potential obfuscation/exfiltration indicators
- Suspicious filenames
- Risk score + risk level (
LOW,MEDIUM,HIGH) - Condensed findings output grouped by file and rule to reduce duplicate noise
- Clean terminal output with colored status messages
- Optional AI summary when
GEMINI_API_KEYis set - Optional terminal risk chat with Gemini
- Optional ElevenLabs cloned-voice text-to-speech for chat responses
- Graceful error handling for invalid URLs, missing API keys, API failures, and scan issues
cloneguard/main.py- CLI entrypoint and workflow orchestrationcloneguard/scanner.py- Recursive file scanning and finding generationcloneguard/patterns.py- Regex rules and suspicious filename definitionscloneguard/repo_utils.py- GitHub API and URL utilitiescloneguard/risk.py- Risk scoring modelcloneguard/formatter.py- Output formatting helperscloneguard/ai_summary.py- Optional AI-based summary with local fallbackcloneguard/risk_chat.py- Optional terminal risk chat (Gemini + ElevenLabs TTS)requirements.txt- Dependencies.gitignore- Ignored local/generated files.env.example- API key template
- Ensure Python 3.10+ is installed.
- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Optional AI summary setup:
cp .env.example .env
# Then edit .env and add your keyIf your shell does not auto-load .env, export manually:
export GEMINI_API_KEY="your_key_here"
export GITHUB_API_KEY="your_github_token_here"
export ELEVENLABS_API_KEY="your_elevenlabs_key_here"
export ELEVENLABS_VOICE_ID="your_voice_id_here"
# Optional:
export ELEVENLABS_MODEL_ID="eleven_multilingual_v2"
export CLONEGUARD_CHAT_LANGUAGE="English"Run from the project root:
python -m cloneguard.mainOr pass mode directly:
python -m cloneguard.main --mode quick
python -m cloneguard.main --mode deepThen:
- Paste a public GitHub repository URL.
- Choose scan mode (
quickordeep) if not passed via CLI. - CloneGuard uses GitHub API calls to read repository files in memory and scan them.
- It shows progress percentages while fetching/scanning.
- It prints findings, risk score, and risk level.
- It saves a markdown report in
output/. - It automatically opens the generated markdown file.
- Optional terminal risk chat lets you type questions.
- If chat is enabled, choose mode:
textonly orvoice(text + ElevenLabs playback). - It asks whether to clone the repository.
- If yes, it prompts for destination folder and validates:
- The path exists
- The path is a directory
- The target repo folder does not already exist there
- It exits and clears only
output/audio/(whether you clone or not).
After scan/report output, CloneGuard can start terminal risk chat:
- Uses Gemini to answer questions about the scanned repository risks.
- You type questions directly in terminal.
- Uses ElevenLabs TTS to speak responses using your cloned voice.
- Saves generated voice audio files in
output/audio/. - Before playback, you can choose per response:
- normal speed
2xspeed
- While audio is playing, press
s+ Enter to skip immediately.
To use voice output, set both:
ELEVENLABS_API_KEYELEVENLABS_VOICE_ID- Optional response language (e.g.
English):CLONEGUARD_CHAT_LANGUAGE
If ElevenLabs vars are missing, chat still works in text-only mode.
Scanning repository...
Warning: Suspicious patterns detected
Risk Score: 78
Risk Level: HIGH
Findings:
1. Encoded PowerShell execution found in scripts/setup.ps1
Reason: Encoded PowerShell commands can hide malicious behavior.
Do you want to clone this repository anyway? (yes/no)
When no major issues are detected:
Scanning repository...
No major suspicious patterns detected.
Risk Score: 8
Risk Level: LOW
Do you want to clone this repository? (yes/no)
Each grouped finding contributes points by severity:
LOW= 3 pointsMEDIUM= 8 pointsHIGH= 18 points
Context weighting is applied before scoring:
DOCfiles (README/docs): lower weightCIfiles (.github/workflows): medium weightRUNTIMEscripts/code: full weight
Repeated matches are capped so one repeated pattern does not dominate the score.
Risk level thresholds:
LOW: score < 20MEDIUM: score 20-49HIGH: score >= 50, or multiple high-severity findings
The model is intentionally simple and interpretable so new rules can be added easily.
- Never hardcode API keys in source code.
- Use
GEMINI_API_KEYfrom environment variables. - Use
GITHUB_API_KEYorGITHUB_TOKENfrom environment variables. - Use
ELEVENLABS_API_KEYandELEVENLABS_VOICE_IDfor voice chat. - Keep
.envlocal and uncommitted. .envis already listed in.gitignore.- Do not print keys in logs or terminal output.
- Rule-based scanning can miss novel or heavily obfuscated threats.
- Regex pattern matching may produce false positives.
- Some behavior only appears at runtime and cannot be identified statically.
- Large/binary files are skipped for safety and performance.
- Add CLI flags (
--url,--json,--strict) - Add per-language rule packs
- Add allowlist/ignore configuration
- Add unit tests and fixture repositories
- Add SARIF/JSON report output for CI integration