Generate engaging videos from text using real YouTube clips with word-level precision.
Key Features:
- π― Intelligent phrase matching - Combines consecutive words from the same video for smoother output
- π΅ Professional audio enhancement - Optional Auphonic integration for noise reduction and normalization
- π Interactive subtitles - Click any word to jump to that moment in the video
- β‘ Fast parallel processing - Concurrent download and processing for speed
- π Full-text search - Find and preview clips before generating
Built with FastAPI, React + Vite, FFmpeg, and SQLite FTS5.
# 1. Backend setup
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # Add your YouTube API key
# 2. Initialize database
python ingest.py # Fetch video transcripts
python ingest_whisperx.py # Import word-level timing data
# 3. Start servers (in separate terminals)
python run.py # Backend on http://localhost:8000
cd ../frontend && npm i && npm run dev # Frontend on http://localhost:5173Make sure you're logged into YouTube in Chrome - the backend uses your browser cookies for authentication.
- Phrase matching: Automatically finds multi-word phrases from the same video
- Word-level precision: Falls back to individual words when phrases aren't available
- Smart clip extraction: Adds subtle padding for natural transitions
- Placeholder cards: Creates title cards for missing words
- Auphonic integration: Professional noise reduction and normalization
- Side-by-side comparison: Keep original audio to hear the difference
- Automatic processing: Extract β enhance β merge workflow
- Free tier available: 2 hours/month on Auphonic
- Real-time sync: Highlights current word as video plays
- Clickable words: Jump to any word instantly
- Auto-scroll: Keeps active word visible
- Accurate timing: Only includes clips that made it into the final video
- Parallel processing: Concurrent downloads and video processing
- Batch concatenation: Fast FFmpeg-based stitching
- Smart caching: Reuses processed segments
- Configurable workers: Adjust parallelism for your hardware
- Python 3.8+
- Node.js 16+
- FFmpeg installed (
brew install ffmpegon macOS) - Chrome browser with YouTube login (for cookie authentication)
- Create
.envfile:
cd backend
cp .env.example .env- Edit
.envwith your settings:
# Required
YOUTUBE_API_KEY=your_youtube_api_key_here
SEED_CHANNEL_IDS=UCYO_jab_esuFRV4b17AJtAw,UC8butISFwT-Wl7EV0hUK0BQ
# Optional - Audio Enhancement
AUPHONIC_API_TOKEN=your_auphonic_token # Get 2 free hours/month
# Optional - Cookie Source
COOKIES_FROM_BROWSER=chrome # or firefox, safari, etc.- Initialize database:
python ingest.py # Fetch transcripts from YouTube
python ingest_whisperx.py # Import word-level timing data (526k words)cd frontend
npm installSet custom API URL if needed:
VITE_API=http://localhost:8000 npm run dev- Open the web interface at
http://localhost:5173 - Enter your text (e.g., "don't worry it's working")
- Adjust phrase length slider (1-50 words)
- Optional: Enable audio enhancement
- Click Generate Video
Generate a stitched video from text.
Request:
{
"text": "hello world",
"max_phrase_length": 10,
"enhance_audio": false,
"keep_original_audio": true,
"add_subtitles": false,
"aspect_ratio": "16:9"
}Response:
{
"status": "success",
"video_url": "/videos/generated_1234567890.mp4",
"word_timings": [
{"word": "hello", "start": 0.0, "end": 0.5},
{"word": "world", "start": 0.5, "end": 1.2}
],
"original_video_url": "/videos/generated_1234567890_original.mp4"
}Search for phrases in video transcripts.
curl 'http://localhost:8000/search?q=hello%20world&lang=en&limit=20'{
"text": "your text here",
"max_phrase_length": 10, # 1-50, longer = smoother videos
"clip_padding_start": 0.15, # Seconds before word
"clip_padding_end": 0.15, # Seconds after word
"add_subtitles": false, # Burn-in subtitles
"aspect_ratio": "16:9", # or "9:16", "1:1"
"watermark_text": null, # Optional watermark
"intro_text": null, # Optional intro card
"outro_text": null, # Optional outro card
"enhance_audio": false, # Auphonic enhancement
"keep_original_audio": true, # Save comparison file
"max_download_workers": 3, # Parallel downloads
"max_processing_workers": 4 # Parallel processing
}- Get Auphonic API token: Sign up (2 free hours/month)
- Add to
.env:AUPHONIC_API_TOKEN=your_token - Enable in UI: Toggle "Audio Enhancement" when generating videos
What it does:
- Noise reduction (dynamic/speech_isolation method)
- Hum removal (50/60 Hz mains)
- Volume leveling
- Loudness normalization (-16 LUFS)
- De-reverb and de-breath processing
π See AUPHONIC_SETUP.md for detailed configuration options.
# Generate video via CLI
python -m video_stitcher.cli \
--text "hello world" \
--database data/youglish.db \
--output test.mp4 \
--max-phrase-length 10 \
--enhance-audio \
--verbose
# Test phrase matching
python test_phrase_matching.pybackend/
βββ app.py # FastAPI server
βββ db.py # Database queries
βββ ingest.py # YouTube transcript fetcher
βββ ingest_whisperx.py # Word-level data importer
βββ video_stitcher/ # Video generation engine
β βββ video_stitcher.py # Main orchestrator
β βββ database.py # Word/phrase lookup
β βββ downloader.py # yt-dlp integration
β βββ video_processor.py # FFmpeg operations
β βββ concatenator.py # Video stitching
β βββ auphonic_client.py # Audio enhancement
βββ data/
βββ youglish.db # SQLite + FTS5 index
βββ live_whisperx_526k_with_seeks.jsonl
frontend/
βββ src/
βββ App.jsx # React UI with interactive subtitles
- Phrase Matching: Searches for longest consecutive word sequences in the same video
- Fallback: Uses individual words when phrases aren't found
- Download: Extracts specific time segments using yt-dlp
- Processing: Normalizes audio, re-encodes for consistency, adds optional effects
- Enhancement: Optional Auphonic processing for professional audio
- Concatenation: Stitches all segments using FFmpeg
- Subtitles: Generates word timings for interactive playback
"Failed to download segment"
- Ensure you're logged into YouTube in Chrome
- Try:
COOKIES_FROM_BROWSER=firefoxin.envif using Firefox - See
YOUTUBE_COOKIE_SETUP.mdfor detailed auth setup
"No clips found for word"
- Word may not exist in the 526k word database
- Try different phrasing or check spelling
- System will create placeholder title cards for missing words
"Auphonic API token not set"
- Add
AUPHONIC_API_TOKENto.env - Get token from https://auphonic.com/
- Run
pip install python-dotenvif missing
Slow generation?
- Increase
max_download_workers(default: 3) - Increase
max_processing_workers(default: 4) - Disable
enhance_audiofor faster results - Use shorter
max_phrase_lengthfor quicker lookups
Out of memory?
- Decrease worker counts
- Process videos sequentially (set workers to 1)
- Clear temp directory:
rm -rf backend/temp/*
Backend:
- FastAPI - Modern Python web framework
- SQLite + FTS5 - Full-text search index
- yt-dlp - YouTube video downloader
- FFmpeg - Video/audio processing
- Auphonic API - Professional audio enhancement
Frontend:
- React 18 + Vite - Fast development and build
- Tailwind CSS - Utility-first styling
- YouTube IFrame API - Video playback
Data:
- WhisperX transcriptions - Word-level timing accuracy
- 526k word database - Comprehensive clip coverage