THE REFINERY

Video deduplication and organization for large local libraries. Safe by default, strict on visual matches, simple to run.

THE REFINERY scans a video folder, removes exact duplicates, isolates risky visual duplicates, quarantines broken files, and organizes surviving files by resolution.

The current codebase uses a hexagonal split, but the real priority is behavioral correctness:

abort cleanly when pre-flight checks fail
avoid false-positive visual matches
keep resume state accurate in SQLite
preserve practical throughput on large libraries

What It Does

The pipeline runs in 6 stages:

Phase 0: pre-flight checks and database sync
Phase 1: indexing
Phase 2: exact duplicate detection with SHA-256
Phase 3: metadata extraction and perceptual hashing
Phase 4: visual duplicate matching
Phase 5: organization of surviving files

What happens to files:

exact duplicates are deleted in live mode
visual duplicates are moved to __TRASH_BIN__/
broken files are moved to __QUARANTINE__/
surviving files are moved into Organized/

Current Scope

What this project is today:

local CLI tool
Python project with dependencies
FFmpeg/FFprobe-based media inspection
SQLite-backed resumable workflow

What it is not today:

zero-dependency
packaged for one-command installation
backed by tracked media-fixture integration tests in the repository
designed for remote storage, cloud buckets, or distributed workers

Project Structure

PROJECT_REFINERY/
├── src/
│   ├── domain/
│   │   ├── model.py          # Enums, config, dataclasses
│   │   ├── ports.py          # Domain interfaces
│   │   └── service.py        # Phase orchestration and matching rules
│   ├── infrastructure/
│   │   ├── analyzer.py       # FFmpeg/Pillow/imagehash adapter
│   │   ├── database.py       # SQLite repository adapter
│   │   └── file_manager.py   # Filesystem adapter
│   └── entrypoints/
│       └── cli.py            # Active CLI entrypoint
├── archive/
│   └── the_refinery.py       # Archived monolithic reference
├── memory-bank/              # Project memory/context docs
├── requirements.txt          # Python dependencies
├── README.md
└── .gitignore

Directory meaning:

src/domain/ holds the real rules.
src/infrastructure/ talks to SQLite, FFmpeg, Pillow, imagehash, and the filesystem.
src/entrypoints/cli.py wires everything together.
archive/the_refinery.py is not the active runtime path; it exists as the old reference implementation.

Requirements

Required before running:

Python 3.11+
ffmpeg on PATH
ffprobe on PATH
Python packages from requirements.txt

requirements.txt currently contains:

Pillow>=10.0.0
imagehash>=4.3.0
rich>=13.0.0

Installation

Create a virtual environment.

python -m venv .venv

Activate it.

Linux/macOS:

source .venv/bin/activate

Windows PowerShell:

.venv\Scripts\Activate.ps1

Install Python dependencies.

pip install -r requirements.txt

Install FFmpeg if needed.

Linux:

sudo apt install ffmpeg

Windows:

Download FFmpeg from https://ffmpeg.org/download.html
Add the bin directory to PATH

Usage

Dry run:

python -m src.entrypoints.cli --path "/path/to/videos"

Live mode:

python -m src.entrypoints.cli --path "/path/to/videos" --no-dry-run

Custom database path:

python -m src.entrypoints.cli --path "/path/to/videos" --db "/path/to/refinery.db"

Custom log file:

python -m src.entrypoints.cli --path "/path/to/videos" --log refinery.log

Custom pHash threshold:

python -m src.entrypoints.cli --path "/path/to/videos" --threshold 10

CLI Options

Supported CLI flags:

--path, -p: source video directory, required
--db, -d: custom SQLite database path
--log, -l: log file path
--no-dry-run: enable live mode
--threshold: pHash average-distance threshold, default 10

Exit behavior:

exit 0: successful run
exit 1: invalid config, invalid path, missing dependency, failed pre-check, or fatal runtime error

Supported Input

Currently recognized video extensions:

.mp4
.mkv
.avi
.mov
.wmv
.flv
.webm
.m4v
.mpeg
.mpg
.3gp

The scanner skips these generated directories automatically:

__TRASH_BIN__/
__QUARANTINE__/
Organized/

Output Layout

The tool writes these outputs inside the target library unless a custom DB path is supplied:

refinery.db: SQLite state database
__TRASH_BIN__/: visual duplicates moved aside for review
__QUARANTINE__/: broken or unreadable files
Organized/: final organized survivors

Organization buckets:

HD_1080p+
HD_720p
SD_480p
Low_Quality

Example before:

Videos/
├── clip_a.mp4
├── clip_a_copy.mp4
├── clip_a_720p.mkv
├── broken_file.mp4
└── subdir/
    └── movie_part1.mp4

Example after a live run:

Videos/
├── __TRASH_BIN__/
│   └── clip_a_720p.mkv
├── __QUARANTINE__/
│   └── broken_file.mp4
├── Organized/
│   └── HD_1080p+/
│       ├── clip_a.mp4
│       └── movie_part1.mp4
└── refinery.db

Decision Rules

Exact Duplicates

Phase 2:

each file gets a SHA-256 hash
files with the same SHA-256 are exact duplicates
one file is kept, the others are deleted in live mode

Filename keeper heuristic:

the tool prefers the most human-readable filename
filenames with useful words, dates, or quality markers score higher
filenames that look like copies or numbered duplicates score lower

Visual Duplicates

Phase 4 only considers a pair a safe visual match if all of these hold:

durations are within the configured duration tolerance
file sizes are within the configured size tolerance
average pHash distance is below the configured threshold
no single sampled frame diverges too far

The pHash signature is built from 3 fixed timeline points:

15%
50%
85%

A file can still fail visual matching even if two of the three frames look close. This is intentional.

Winner Selection

When two files are treated as the same content, the survivor is chosen in this order:

higher resolution
higher bitrate
larger file size

File Status Lifecycle

Each file is tracked in SQLite with one of these states:

NEW: indexed but not hashed yet
HASHED: SHA-256 completed successfully
METADATA_EXTRACTED: metadata and pHash are present
PROCESSED: final survivor has been organized
DELETED: duplicate removed or moved aside
BROKEN: file exists but could not be processed safely
MISSING: file existed in prior state but no longer exists on disk
PROCESSING_ERROR: a later processing step failed after basic metadata work

Important distinction:

BROKEN means the file is there, but unreadable or unusable
MISSING means the file is gone outside the tool and Phase 0 synced the DB state back to reality

Runtime Behavior

Dry Run vs Live Mode

Dry run:

no file is deleted, moved, or organized
the pipeline still computes matches and reports what would happen
recommended first pass for every new library

Live mode:

exact duplicates are deleted
visual duplicates are moved to __TRASH_BIN__/
broken files are moved to __QUARANTINE__/
survivors are moved to Organized/

Resume Model

The workflow is resumable because file state is stored in SQLite.

On rerun:

already-processed rows remain in the database
externally removed files are marked MISSING during Phase 0
processed survivors can still be used as comparison targets for new visual candidates

Parallelism

The expensive I/O-heavy stages run in parallel worker threads:

Phase 2 hashing
Phase 3 metadata extraction and pHash generation

This is meant to preserve practical throughput on large libraries without changing the core match rules.

Configuration Defaults

Current defaults from RefineryConfig:

pHash threshold: 10
duration tolerance: 1.0 second
size tolerance ratio: 0.3
thumbnail points: 0.15, 0.50, 0.85
FFmpeg timeout: 30 seconds
DB batch size: 50
index batch size: 1000
minimum free disk space: 1 GiB
SQLite cache size pragma: -64000
SQLite busy timeout pragma: 5000 ms
default mode: DRY_RUN=True
worker count: os.cpu_count() from CLI, fallback 4

Validation rules:

pHash threshold must be > 0
duration tolerance must be >= 0
batch size must be > 0
FFmpeg timeout must be > 0
size tolerance ratio must be (0, 1]
exactly 3 thumbnail points are required

Database Notes

The database contains a single main table:

files

Stored columns:

id
path
filename
extension
size_bytes
sha256
duration
width
height
bitrate
phash
status

Indexes:

idx_status
idx_sha256
idx_duration
idx_status_duration

SQLite runtime notes:

WAL mode is enabled
synchronous=NORMAL is used
temp_store=MEMORY is used
auto_vacuum=INCREMENTAL is enabled

Practical implication:

while active, SQLite may also create refinery.db-wal and refinery.db-shm
these are normal WAL companion files, not corruption

Dependencies and External Tools

Python packages used directly by the active code:

Pillow
imagehash
rich

System tools used directly by the active code:

ffmpeg
ffprobe

Media analysis details:

ffprobe is used for metadata
ffmpeg is used to extract thumbnails for pHash sampling
if bitrate is missing from metadata, bitrate is estimated from size_bytes / duration

Failure Modes and Safety Notes

Common failure paths:

FFmpeg or FFprobe missing from PATH
invalid source path
insufficient free disk space in Phase 0
unreadable or corrupt media files
pHash generation failing because too few valid thumbnails were extracted

Safety choices by design:

dry run is default
visual duplicates go to trash instead of permanent deletion
broken files go to quarantine instead of being silently ignored
missing files are explicitly marked in the database instead of being misclassified as broken

Known Limitations

Current limitations:

this is not zero-dependency yet
exact duplicate deletion is permanent in live mode
visual matching is strict, but no automatic media matcher can guarantee perfect results for every edge case
subtitle burn-ins, alternate intros/outros, and partial edits can still affect matching outcomes
there is no tracked integration-test media corpus in the repository
packaging is still basic; installation is manual rather than release-grade
the tool assumes a local filesystem and is not built for cloud/object storage workflows

Known Issues to Keep in Mind

Operational realities to expect:

very large libraries can still take time because FFmpeg and hashing are real I/O work
WAL mode may leave -wal and -shm files next to the database while the DB is active
moving files outside the tool between runs is supported, but it changes DB state to MISSING
live mode changes the library layout, so external tools pointing at old paths may need to rescan

Roadmap

Near-term priorities:

keep architecture simple and behavior-safe
tighten docs around schema and operational expectations
reduce dependency surface where it makes sense
improve packaging and installation ergonomics
add real tracked integration coverage with sample media fixtures
explore a future leaner distribution path for non-media-core parts of the project

Technical References

Relevant official docs:

Python concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html
Python pathlib: https://docs.python.org/3/library/pathlib.html
SQLite WAL: https://sqlite.org/wal.html
FFmpeg docs: https://ffmpeg.org/documentation.html

Disclaimer

Even with dry-run defaults and reversible handling for visual duplicates, this tool makes automated decisions about real files. Keep backups of anything you cannot afford to lose.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.agent		.agent
archive		archive
memory-bank		memory-bank
src		src
.agents		.agents
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THE REFINERY

What It Does

Current Scope

Project Structure

Requirements

Installation

Usage

CLI Options

Supported Input

Output Layout

Decision Rules

Exact Duplicates

Visual Duplicates

Winner Selection

File Status Lifecycle

Runtime Behavior

Dry Run vs Live Mode

Resume Model

Parallelism

Configuration Defaults

Database Notes

Dependencies and External Tools

Failure Modes and Safety Notes

Known Limitations

Known Issues to Keep in Mind

Roadmap

Technical References

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

THE REFINERY

What It Does

Current Scope

Project Structure

Requirements

Installation

Usage

CLI Options

Supported Input

Output Layout

Decision Rules

Exact Duplicates

Visual Duplicates

Winner Selection

File Status Lifecycle

Runtime Behavior

Dry Run vs Live Mode

Resume Model

Parallelism

Configuration Defaults

Database Notes

Dependencies and External Tools

Failure Modes and Safety Notes

Known Limitations

Known Issues to Keep in Mind

Roadmap

Technical References

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages