Skip to content

void0x14/PROJECT_REFINERY

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

THE REFINERY

Video deduplication and organization for large local libraries. Safe by default, strict on visual matches, simple to run.

THE REFINERY scans a video folder, removes exact duplicates, isolates risky visual duplicates, quarantines broken files, and organizes surviving files by resolution.

The current codebase uses a hexagonal split, but the real priority is behavioral correctness:

  • abort cleanly when pre-flight checks fail
  • avoid false-positive visual matches
  • keep resume state accurate in SQLite
  • preserve practical throughput on large libraries

What It Does

The pipeline runs in 6 stages:

  • Phase 0: pre-flight checks and database sync
  • Phase 1: indexing
  • Phase 2: exact duplicate detection with SHA-256
  • Phase 3: metadata extraction and perceptual hashing
  • Phase 4: visual duplicate matching
  • Phase 5: organization of surviving files

What happens to files:

  • exact duplicates are deleted in live mode
  • visual duplicates are moved to __TRASH_BIN__/
  • broken files are moved to __QUARANTINE__/
  • surviving files are moved into Organized/

Current Scope

What this project is today:

  • local CLI tool
  • Python project with dependencies
  • FFmpeg/FFprobe-based media inspection
  • SQLite-backed resumable workflow

What it is not today:

  • zero-dependency
  • packaged for one-command installation
  • backed by tracked media-fixture integration tests in the repository
  • designed for remote storage, cloud buckets, or distributed workers

Project Structure

PROJECT_REFINERY/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ domain/
β”‚   β”‚   β”œβ”€β”€ model.py          # Enums, config, dataclasses
β”‚   β”‚   β”œβ”€β”€ ports.py          # Domain interfaces
β”‚   β”‚   └── service.py        # Phase orchestration and matching rules
β”‚   β”œβ”€β”€ infrastructure/
β”‚   β”‚   β”œβ”€β”€ analyzer.py       # FFmpeg/Pillow/imagehash adapter
β”‚   β”‚   β”œβ”€β”€ database.py       # SQLite repository adapter
β”‚   β”‚   └── file_manager.py   # Filesystem adapter
β”‚   └── entrypoints/
β”‚       └── cli.py            # Active CLI entrypoint
β”œβ”€β”€ archive/
β”‚   └── the_refinery.py       # Archived monolithic reference
β”œβ”€β”€ memory-bank/              # Project memory/context docs
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ README.md
└── .gitignore

Directory meaning:

  • src/domain/ holds the real rules.
  • src/infrastructure/ talks to SQLite, FFmpeg, Pillow, imagehash, and the filesystem.
  • src/entrypoints/cli.py wires everything together.
  • archive/the_refinery.py is not the active runtime path; it exists as the old reference implementation.

Requirements

Required before running:

  • Python 3.11+
  • ffmpeg on PATH
  • ffprobe on PATH
  • Python packages from requirements.txt

requirements.txt currently contains:

  • Pillow>=10.0.0
  • imagehash>=4.3.0
  • rich>=13.0.0

Installation

  1. Create a virtual environment.
python -m venv .venv
  1. Activate it.

Linux/macOS:

source .venv/bin/activate

Windows PowerShell:

.venv\Scripts\Activate.ps1
  1. Install Python dependencies.
pip install -r requirements.txt
  1. Install FFmpeg if needed.

Linux:

sudo apt install ffmpeg

Windows:

Usage

Dry run:

python -m src.entrypoints.cli --path "/path/to/videos"

Live mode:

python -m src.entrypoints.cli --path "/path/to/videos" --no-dry-run

Custom database path:

python -m src.entrypoints.cli --path "/path/to/videos" --db "/path/to/refinery.db"

Custom log file:

python -m src.entrypoints.cli --path "/path/to/videos" --log refinery.log

Custom pHash threshold:

python -m src.entrypoints.cli --path "/path/to/videos" --threshold 10

CLI Options

Supported CLI flags:

  • --path, -p: source video directory, required
  • --db, -d: custom SQLite database path
  • --log, -l: log file path
  • --no-dry-run: enable live mode
  • --threshold: pHash average-distance threshold, default 10

Exit behavior:

  • exit 0: successful run
  • exit 1: invalid config, invalid path, missing dependency, failed pre-check, or fatal runtime error

Supported Input

Currently recognized video extensions:

  • .mp4
  • .mkv
  • .avi
  • .mov
  • .wmv
  • .flv
  • .webm
  • .m4v
  • .mpeg
  • .mpg
  • .3gp

The scanner skips these generated directories automatically:

  • __TRASH_BIN__/
  • __QUARANTINE__/
  • Organized/

Output Layout

The tool writes these outputs inside the target library unless a custom DB path is supplied:

  • refinery.db: SQLite state database
  • __TRASH_BIN__/: visual duplicates moved aside for review
  • __QUARANTINE__/: broken or unreadable files
  • Organized/: final organized survivors

Organization buckets:

  • HD_1080p+
  • HD_720p
  • SD_480p
  • Low_Quality

Example before:

Videos/
β”œβ”€β”€ clip_a.mp4
β”œβ”€β”€ clip_a_copy.mp4
β”œβ”€β”€ clip_a_720p.mkv
β”œβ”€β”€ broken_file.mp4
└── subdir/
    └── movie_part1.mp4

Example after a live run:

Videos/
β”œβ”€β”€ __TRASH_BIN__/
β”‚   └── clip_a_720p.mkv
β”œβ”€β”€ __QUARANTINE__/
β”‚   └── broken_file.mp4
β”œβ”€β”€ Organized/
β”‚   └── HD_1080p+/
β”‚       β”œβ”€β”€ clip_a.mp4
β”‚       └── movie_part1.mp4
└── refinery.db

Decision Rules

Exact Duplicates

Phase 2:

  • each file gets a SHA-256 hash
  • files with the same SHA-256 are exact duplicates
  • one file is kept, the others are deleted in live mode

Filename keeper heuristic:

  • the tool prefers the most human-readable filename
  • filenames with useful words, dates, or quality markers score higher
  • filenames that look like copies or numbered duplicates score lower

Visual Duplicates

Phase 4 only considers a pair a safe visual match if all of these hold:

  • durations are within the configured duration tolerance
  • file sizes are within the configured size tolerance
  • average pHash distance is below the configured threshold
  • no single sampled frame diverges too far

The pHash signature is built from 3 fixed timeline points:

  • 15%
  • 50%
  • 85%

A file can still fail visual matching even if two of the three frames look close. This is intentional.

Winner Selection

When two files are treated as the same content, the survivor is chosen in this order:

  1. higher resolution
  2. higher bitrate
  3. larger file size

File Status Lifecycle

Each file is tracked in SQLite with one of these states:

  • NEW: indexed but not hashed yet
  • HASHED: SHA-256 completed successfully
  • METADATA_EXTRACTED: metadata and pHash are present
  • PROCESSED: final survivor has been organized
  • DELETED: duplicate removed or moved aside
  • BROKEN: file exists but could not be processed safely
  • MISSING: file existed in prior state but no longer exists on disk
  • PROCESSING_ERROR: a later processing step failed after basic metadata work

Important distinction:

  • BROKEN means the file is there, but unreadable or unusable
  • MISSING means the file is gone outside the tool and Phase 0 synced the DB state back to reality

Runtime Behavior

Dry Run vs Live Mode

Dry run:

  • no file is deleted, moved, or organized
  • the pipeline still computes matches and reports what would happen
  • recommended first pass for every new library

Live mode:

  • exact duplicates are deleted
  • visual duplicates are moved to __TRASH_BIN__/
  • broken files are moved to __QUARANTINE__/
  • survivors are moved to Organized/

Resume Model

The workflow is resumable because file state is stored in SQLite.

On rerun:

  • already-processed rows remain in the database
  • externally removed files are marked MISSING during Phase 0
  • processed survivors can still be used as comparison targets for new visual candidates

Parallelism

The expensive I/O-heavy stages run in parallel worker threads:

  • Phase 2 hashing
  • Phase 3 metadata extraction and pHash generation

This is meant to preserve practical throughput on large libraries without changing the core match rules.

Configuration Defaults

Current defaults from RefineryConfig:

  • pHash threshold: 10
  • duration tolerance: 1.0 second
  • size tolerance ratio: 0.3
  • thumbnail points: 0.15, 0.50, 0.85
  • FFmpeg timeout: 30 seconds
  • DB batch size: 50
  • index batch size: 1000
  • minimum free disk space: 1 GiB
  • SQLite cache size pragma: -64000
  • SQLite busy timeout pragma: 5000 ms
  • default mode: DRY_RUN=True
  • worker count: os.cpu_count() from CLI, fallback 4

Validation rules:

  • pHash threshold must be > 0
  • duration tolerance must be >= 0
  • batch size must be > 0
  • FFmpeg timeout must be > 0
  • size tolerance ratio must be (0, 1]
  • exactly 3 thumbnail points are required

Database Notes

The database contains a single main table:

  • files

Stored columns:

  • id
  • path
  • filename
  • extension
  • size_bytes
  • sha256
  • duration
  • width
  • height
  • bitrate
  • phash
  • status

Indexes:

  • idx_status
  • idx_sha256
  • idx_duration
  • idx_status_duration

SQLite runtime notes:

  • WAL mode is enabled
  • synchronous=NORMAL is used
  • temp_store=MEMORY is used
  • auto_vacuum=INCREMENTAL is enabled

Practical implication:

  • while active, SQLite may also create refinery.db-wal and refinery.db-shm
  • these are normal WAL companion files, not corruption

Dependencies and External Tools

Python packages used directly by the active code:

  • Pillow
  • imagehash
  • rich

System tools used directly by the active code:

  • ffmpeg
  • ffprobe

Media analysis details:

  • ffprobe is used for metadata
  • ffmpeg is used to extract thumbnails for pHash sampling
  • if bitrate is missing from metadata, bitrate is estimated from size_bytes / duration

Failure Modes and Safety Notes

Common failure paths:

  • FFmpeg or FFprobe missing from PATH
  • invalid source path
  • insufficient free disk space in Phase 0
  • unreadable or corrupt media files
  • pHash generation failing because too few valid thumbnails were extracted

Safety choices by design:

  • dry run is default
  • visual duplicates go to trash instead of permanent deletion
  • broken files go to quarantine instead of being silently ignored
  • missing files are explicitly marked in the database instead of being misclassified as broken

Known Limitations

Current limitations:

  • this is not zero-dependency yet
  • exact duplicate deletion is permanent in live mode
  • visual matching is strict, but no automatic media matcher can guarantee perfect results for every edge case
  • subtitle burn-ins, alternate intros/outros, and partial edits can still affect matching outcomes
  • there is no tracked integration-test media corpus in the repository
  • packaging is still basic; installation is manual rather than release-grade
  • the tool assumes a local filesystem and is not built for cloud/object storage workflows

Known Issues to Keep in Mind

Operational realities to expect:

  • very large libraries can still take time because FFmpeg and hashing are real I/O work
  • WAL mode may leave -wal and -shm files next to the database while the DB is active
  • moving files outside the tool between runs is supported, but it changes DB state to MISSING
  • live mode changes the library layout, so external tools pointing at old paths may need to rescan

Roadmap

Near-term priorities:

  • keep architecture simple and behavior-safe
  • tighten docs around schema and operational expectations
  • reduce dependency surface where it makes sense
  • improve packaging and installation ergonomics
  • add real tracked integration coverage with sample media fixtures
  • explore a future leaner distribution path for non-media-core parts of the project

Technical References

Relevant official docs:

Disclaimer

Even with dry-run defaults and reversible handling for visual duplicates, this tool makes automated decisions about real files. Keep backups of anything you cannot afford to lose.

About

Precision video curator. Detects exact & visual duplicates with surgical accuracy using strict multi-point pHash validation. No false positives. πŸŽ₯

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages