Tool that automatically aligns and merges humorous audio commentary with video files based on spoken word matching.
After choosing a video file and a commentary track, choose the spoken word language to merge into and a subtitle track to use for finding matching sections, and a manual fine-tuning step allows you to optionally make the sync even more perfect.
Finally the commentary track is merged with the chosen audio track with the correct offset to create a new audio track in the video file chosen.
i set this up because i was using apple music for the riff and vlc for the video and that is not a way to have fun
- Extract audio from both video and commentary files
- Transcribe both using OpenAI Whisper (or subtitles if available)
- Align by matching phrases between the two transcripts
- Tune alignment interactively with spectrogram visualization
- Merge commentary with video audio, creating final output
# Clone and install
git clone https://github.com/jareklupinski/auto-riffer.git
cd auto-riffer
pip install -r requirements.txt
# System dependencies (macOS)
brew install ffmpeg tesseract
# For interactive mode (recommended)
pip install PyQt6# Basic - inserts Riff track into original video
python main.py movie.mkv riff.mp3
# Custom output file
python main.py movie.mkv riff.mp3 -o movie_with_riff.mkv
# Audio-only output
python main.py movie.mkv riff.mp3 --audio-only -o merged.m4a
# Skip interactive tuning
python main.py movie.mkv riff.mp3 --no-interactive
# Manual offset (skip alignment detection)
python main.py movie.mkv riff.mp3 --offset 42.5
# Better accuracy with larger model
python main.py movie.mkv riff.mp3 --whisper-model medium| Option | Description |
|---|---|
-o, --output |
Output file path |
--audio-only |
Output merged audio only (no video) |
--whisper-model |
Whisper model: tiny, base, small, medium, large |
--video-volume |
Volume for video audio (default: 0.7) |
--riff-volume |
Volume for commentary (default: 1.0) |
--language |
Language code for recognition (default: en) |
--offset |
Manual offset in seconds |
--no-interactive |
Skip interactive fine-tuning UI |
--audio-track |
Audio track index (0-based) |
--subtitle-track |
Subtitle track index (0-based) |
auto-riffer/
├── main.py # CLI and orchestration
├── models.py # Data classes (Word, AudioTrack, etc.)
├── utils.py # Utilities and dependency checking
├── media.py # Media probing and audio extraction
├── subtitles.py # Subtitle extraction and OCR
├── transcribe.py # Whisper speech recognition
├── align.py # Phrase matching and offset detection
├── interactive.py # Interactive tuning UI
└── merge.py # Audio merging and output creation
- FFmpeg with ffprobe
- Tesseract OCR (for bitmap subtitles)
- Python 3.10+
- openai-whisper
- torch
- pytesseract
- Pillow
- matplotlib
- scipy
- PyQt6 (recommended for interactive mode)
-
Subtitles help: If your video has subtitles, they'll be used instead of transcribing audio—faster and often more accurate.
-
GPU acceleration: Whisper automatically uses CUDA or Apple MPS if available.
-
Interactive tuning: The spectrogram view shows both audio tracks. Adjust the offset until patterns align, then use "Play Preview" to verify.
-
Works best when disembaudios repeat movie dialogue, enabling effective phrase matching.


