Video deduplication and organization for large local libraries. Safe by default, strict on visual matches, simple to run.
THE REFINERY scans a video folder, removes exact duplicates, isolates risky visual duplicates, quarantines broken files, and organizes surviving files by resolution.
The current codebase uses a hexagonal split, but the real priority is behavioral correctness:
- abort cleanly when pre-flight checks fail
- avoid false-positive visual matches
- keep resume state accurate in SQLite
- preserve practical throughput on large libraries
The pipeline runs in 6 stages:
- Phase 0: pre-flight checks and database sync
- Phase 1: indexing
- Phase 2: exact duplicate detection with SHA-256
- Phase 3: metadata extraction and perceptual hashing
- Phase 4: visual duplicate matching
- Phase 5: organization of surviving files
What happens to files:
- exact duplicates are deleted in live mode
- visual duplicates are moved to
__TRASH_BIN__/ - broken files are moved to
__QUARANTINE__/ - surviving files are moved into
Organized/
What this project is today:
- local CLI tool
- Python project with dependencies
- FFmpeg/FFprobe-based media inspection
- SQLite-backed resumable workflow
What it is not today:
- zero-dependency
- packaged for one-command installation
- backed by tracked media-fixture integration tests in the repository
- designed for remote storage, cloud buckets, or distributed workers
PROJECT_REFINERY/
βββ src/
β βββ domain/
β β βββ model.py # Enums, config, dataclasses
β β βββ ports.py # Domain interfaces
β β βββ service.py # Phase orchestration and matching rules
β βββ infrastructure/
β β βββ analyzer.py # FFmpeg/Pillow/imagehash adapter
β β βββ database.py # SQLite repository adapter
β β βββ file_manager.py # Filesystem adapter
β βββ entrypoints/
β βββ cli.py # Active CLI entrypoint
βββ archive/
β βββ the_refinery.py # Archived monolithic reference
βββ memory-bank/ # Project memory/context docs
βββ requirements.txt # Python dependencies
βββ README.md
βββ .gitignore
Directory meaning:
src/domain/holds the real rules.src/infrastructure/talks to SQLite, FFmpeg, Pillow, imagehash, and the filesystem.src/entrypoints/cli.pywires everything together.archive/the_refinery.pyis not the active runtime path; it exists as the old reference implementation.
Required before running:
- Python 3.11+
ffmpegonPATHffprobeonPATH- Python packages from
requirements.txt
requirements.txt currently contains:
Pillow>=10.0.0imagehash>=4.3.0rich>=13.0.0
- Create a virtual environment.
python -m venv .venv- Activate it.
Linux/macOS:
source .venv/bin/activateWindows PowerShell:
.venv\Scripts\Activate.ps1- Install Python dependencies.
pip install -r requirements.txt- Install FFmpeg if needed.
Linux:
sudo apt install ffmpegWindows:
- Download FFmpeg from https://ffmpeg.org/download.html
- Add the
bindirectory toPATH
Dry run:
python -m src.entrypoints.cli --path "/path/to/videos"Live mode:
python -m src.entrypoints.cli --path "/path/to/videos" --no-dry-runCustom database path:
python -m src.entrypoints.cli --path "/path/to/videos" --db "/path/to/refinery.db"Custom log file:
python -m src.entrypoints.cli --path "/path/to/videos" --log refinery.logCustom pHash threshold:
python -m src.entrypoints.cli --path "/path/to/videos" --threshold 10Supported CLI flags:
--path,-p: source video directory, required--db,-d: custom SQLite database path--log,-l: log file path--no-dry-run: enable live mode--threshold: pHash average-distance threshold, default10
Exit behavior:
- exit
0: successful run - exit
1: invalid config, invalid path, missing dependency, failed pre-check, or fatal runtime error
Currently recognized video extensions:
.mp4.mkv.avi.mov.wmv.flv.webm.m4v.mpeg.mpg.3gp
The scanner skips these generated directories automatically:
__TRASH_BIN__/__QUARANTINE__/Organized/
The tool writes these outputs inside the target library unless a custom DB path is supplied:
refinery.db: SQLite state database__TRASH_BIN__/: visual duplicates moved aside for review__QUARANTINE__/: broken or unreadable filesOrganized/: final organized survivors
Organization buckets:
HD_1080p+HD_720pSD_480pLow_Quality
Example before:
Videos/
βββ clip_a.mp4
βββ clip_a_copy.mp4
βββ clip_a_720p.mkv
βββ broken_file.mp4
βββ subdir/
βββ movie_part1.mp4
Example after a live run:
Videos/
βββ __TRASH_BIN__/
β βββ clip_a_720p.mkv
βββ __QUARANTINE__/
β βββ broken_file.mp4
βββ Organized/
β βββ HD_1080p+/
β βββ clip_a.mp4
β βββ movie_part1.mp4
βββ refinery.db
Phase 2:
- each file gets a SHA-256 hash
- files with the same SHA-256 are exact duplicates
- one file is kept, the others are deleted in live mode
Filename keeper heuristic:
- the tool prefers the most human-readable filename
- filenames with useful words, dates, or quality markers score higher
- filenames that look like copies or numbered duplicates score lower
Phase 4 only considers a pair a safe visual match if all of these hold:
- durations are within the configured duration tolerance
- file sizes are within the configured size tolerance
- average pHash distance is below the configured threshold
- no single sampled frame diverges too far
The pHash signature is built from 3 fixed timeline points:
15%50%85%
A file can still fail visual matching even if two of the three frames look close. This is intentional.
When two files are treated as the same content, the survivor is chosen in this order:
- higher resolution
- higher bitrate
- larger file size
Each file is tracked in SQLite with one of these states:
NEW: indexed but not hashed yetHASHED: SHA-256 completed successfullyMETADATA_EXTRACTED: metadata and pHash are presentPROCESSED: final survivor has been organizedDELETED: duplicate removed or moved asideBROKEN: file exists but could not be processed safelyMISSING: file existed in prior state but no longer exists on diskPROCESSING_ERROR: a later processing step failed after basic metadata work
Important distinction:
BROKENmeans the file is there, but unreadable or unusableMISSINGmeans the file is gone outside the tool and Phase 0 synced the DB state back to reality
Dry run:
- no file is deleted, moved, or organized
- the pipeline still computes matches and reports what would happen
- recommended first pass for every new library
Live mode:
- exact duplicates are deleted
- visual duplicates are moved to
__TRASH_BIN__/ - broken files are moved to
__QUARANTINE__/ - survivors are moved to
Organized/
The workflow is resumable because file state is stored in SQLite.
On rerun:
- already-processed rows remain in the database
- externally removed files are marked
MISSINGduring Phase 0 - processed survivors can still be used as comparison targets for new visual candidates
The expensive I/O-heavy stages run in parallel worker threads:
- Phase 2 hashing
- Phase 3 metadata extraction and pHash generation
This is meant to preserve practical throughput on large libraries without changing the core match rules.
Current defaults from RefineryConfig:
- pHash threshold:
10 - duration tolerance:
1.0second - size tolerance ratio:
0.3 - thumbnail points:
0.15,0.50,0.85 - FFmpeg timeout:
30seconds - DB batch size:
50 - index batch size:
1000 - minimum free disk space:
1 GiB - SQLite cache size pragma:
-64000 - SQLite busy timeout pragma:
5000 ms - default mode:
DRY_RUN=True - worker count:
os.cpu_count()from CLI, fallback4
Validation rules:
- pHash threshold must be
> 0 - duration tolerance must be
>= 0 - batch size must be
> 0 - FFmpeg timeout must be
> 0 - size tolerance ratio must be
(0, 1] - exactly 3 thumbnail points are required
The database contains a single main table:
files
Stored columns:
idpathfilenameextensionsize_bytessha256durationwidthheightbitratephashstatus
Indexes:
idx_statusidx_sha256idx_durationidx_status_duration
SQLite runtime notes:
- WAL mode is enabled
synchronous=NORMALis usedtemp_store=MEMORYis usedauto_vacuum=INCREMENTALis enabled
Practical implication:
- while active, SQLite may also create
refinery.db-walandrefinery.db-shm - these are normal WAL companion files, not corruption
Python packages used directly by the active code:
Pillowimagehashrich
System tools used directly by the active code:
ffmpegffprobe
Media analysis details:
ffprobeis used for metadataffmpegis used to extract thumbnails for pHash sampling- if bitrate is missing from metadata, bitrate is estimated from
size_bytes / duration
Common failure paths:
- FFmpeg or FFprobe missing from
PATH - invalid source path
- insufficient free disk space in Phase 0
- unreadable or corrupt media files
- pHash generation failing because too few valid thumbnails were extracted
Safety choices by design:
- dry run is default
- visual duplicates go to trash instead of permanent deletion
- broken files go to quarantine instead of being silently ignored
- missing files are explicitly marked in the database instead of being misclassified as broken
Current limitations:
- this is not zero-dependency yet
- exact duplicate deletion is permanent in live mode
- visual matching is strict, but no automatic media matcher can guarantee perfect results for every edge case
- subtitle burn-ins, alternate intros/outros, and partial edits can still affect matching outcomes
- there is no tracked integration-test media corpus in the repository
- packaging is still basic; installation is manual rather than release-grade
- the tool assumes a local filesystem and is not built for cloud/object storage workflows
Operational realities to expect:
- very large libraries can still take time because FFmpeg and hashing are real I/O work
- WAL mode may leave
-waland-shmfiles next to the database while the DB is active - moving files outside the tool between runs is supported, but it changes DB state to
MISSING - live mode changes the library layout, so external tools pointing at old paths may need to rescan
Near-term priorities:
- keep architecture simple and behavior-safe
- tighten docs around schema and operational expectations
- reduce dependency surface where it makes sense
- improve packaging and installation ergonomics
- add real tracked integration coverage with sample media fixtures
- explore a future leaner distribution path for non-media-core parts of the project
Relevant official docs:
- Python
concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html - Python
pathlib: https://docs.python.org/3/library/pathlib.html - SQLite WAL: https://sqlite.org/wal.html
- FFmpeg docs: https://ffmpeg.org/documentation.html
Even with dry-run defaults and reversible handling for visual duplicates, this tool makes automated decisions about real files. Keep backups of anything you cannot afford to lose.