AI-powered deduplication and cleanup plugin. Runs near the end of the snapshot pipeline to analyze all extractor outputs, identify duplicates and redundant files, and keep only the best version of each.
Default behavior: Finds groups of similar outputs (e.g. multiple HTML extractions), compares quality, deletes the inferior versions, and writes a detailed report explaining what was removed and why.
| Dependency | Provided by | Notes |
|---|---|---|
claude CLI |
claudecode plugin |
Must have CLAUDECODE_ENABLED=true |
ANTHROPIC_API_KEY |
Environment | Required |
Each variable falls back to the corresponding CLAUDECODE_* default if unset.
| Variable | Type | Default | Fallback | Description |
|---|---|---|---|---|
CLAUDECODECLEANUP_ENABLED |
bool | false |
— | Enable AI cleanup. |
CLAUDECODECLEANUP_PROMPT |
string | (see below) | — | The prompt defining cleanup behavior. |
CLAUDECODECLEANUP_TIMEOUT |
int | 180 |
CLAUDECODE_TIMEOUT |
Timeout in seconds. |
CLAUDECODECLEANUP_MODEL |
string | claude-sonnet-4-6 |
CLAUDECODE_MODEL |
Claude model to use. |
CLAUDECODECLEANUP_MAX_TURNS |
int | 50 |
CLAUDECODE_MAX_TURNS |
Max agentic turns per invocation. |
Default prompt:
Analyze all the extractor output directories in this snapshot. Look for duplicate or redundant outputs across plugins (e.g. multiple HTML extractions, multiple text extractions, multiple URL extraction outputs, etc.). For each group of similar outputs, inspect the content and determine which version is the best quality. Delete the inferior/redundant versions, keeping only the best one. Also remove any unnecessary temporary files, empty directories, or incomplete outputs. Write a summary of what you cleaned up to cleanup_report.txt in your output directory.
| Hook | Event | Priority | Description |
|---|---|---|---|
on_Snapshot__92_claudecodecleanup.py |
Snapshot |
92 | Runs near the end of the pipeline, after all extractors but before hashes (priority 93). |
- Full access (read, write, rename, move, delete) within the snapshot directory (
SNAP_DIR) - The agent cannot access files outside the snapshot directory
- Protected items:
hashes/directory and.jsonmetadata files are never deleted
Files are written to SNAP_DIR/claudecodecleanup/:
| File | Description |
|---|---|
cleanup_report.txt |
Required — detailed report of what was cleaned up and why (always created, even if nothing was removed) |
response.txt |
Raw text response from Claude |
session.json |
Full conversation log (JSON) |
# Enable the cleanup plugin
export CLAUDECODE_ENABLED=true
export CLAUDECODECLEANUP_ENABLED=true
export ANTHROPIC_API_KEY=sk-ant-...
# Use a more capable model for complex cleanup decisions
export CLAUDECODECLEANUP_MODEL=claude-opus-4-6
# Custom prompt example: aggressive cleanup
export CLAUDECODECLEANUP_PROMPT="Delete all extractor outputs except singlefile/ and readability/. Remove any files larger than 10MB. Write a report of what was removed."