An autonomous pipeline to create covers with any RVC v2 trained AI voice from YouTube videos or a local audio file. For developers who may want to add a singing functionality into their AI assistant/chatbot/vtuber, or for people who want to hear their favourite characters sing their favourite song.
Hyper-RVC/
├── app.py # Main WebUI entry point (Gradio)
├── main.py # Legacy entry point (redirects to app.py)
├── core.py # Backward-compatible shim (re-exports from main)
├── cli.py # CLI interface
│
├── main/ # Core processing modules
│ ├── __init__.py # Package init + audioop 3.13 shim
│ ├── core.py # Pipeline orchestrator (full_inference_program)
│ │
│ ├── uvr/ # Audio separation (vocal, karaoke, dereverb, deecho, denoise)
│ │ ├── __init__.py
│ │ ├── separator.py # High-level separation functions
│ │ └── models/ # Separation model architectures
│ │ ├── bs_roformer/ # BS-Roformer & Mel-Band-Roformer
│ │ ├── bandit/ # Band-Split RNN v1
│ │ ├── bandit_v2/ # Band-Split RNN v2
│ │ ├── scnet/ # SCNet
│ │ ├── scnet_unofficial/ # SCNet unofficial variant
│ │ ├── demucs4ht.py # Demucs v4 hybrid transformer
│ │ ├── mdx23c_tfc_tdf_v3.py # MDX23C TFC-TDF
│ │ ├── segm_models.py # Segmentation models
│ │ ├── torchseg_models.py # Torch segmentation models
│ │ ├── upernet_swin_transformers.py # UperNet Swin
│ │ ├── ensemble.py # Model ensembling
│ │ ├── inference.py # Inference engine
│ │ └── utils.py # Audio processing utilities
│ │
│ ├── rvc/ # RVC voice conversion
│ │ ├── __init__.py
│ │ ├── converter.py # High-level RVC conversion wrapper
│ │ └── engine/ # Applio RVC inference engine
│ │ ├── configs/config.py
│ │ ├── infer/infer.py
│ │ ├── infer/pipeline.py
│ │ └── lib/
│ │ ├── algorithm/ # Neural network architectures (11 modules)
│ │ ├── predictors/ # F0 extractors (CREPE, FCPE, RMVPE)
│ │ ├── utils.py
│ │ └── tools/ # Model download, TTS, audio split
│ │
│ ├── tts/ # Text-to-Speech (Edge TTS + RVC)
│ │ ├── __init__.py
│ │ └── synthesis.py # TTS generation + optional RVC conversion
│ │
│ ├── whisper/ # Whisper transcription
│ │ ├── __init__.py
│ │ ├── transcriber.py # High-level transcription wrapper
│ │ └── diarization/ # Speaker diarization engine
│ │ ├── whisper.py # Whisper model wrapper
│ │ ├── speechbrain.py # SpeechBrain integration
│ │ ├── ECAPA_TDNN.py # Speaker embedding model
│ │ ├── encoder.py # Speaker encoder
│ │ ├── features.py # Audio feature extraction
│ │ ├── segment.py # Voice activity segmentation
│ │ ├── embedding.py # Speaker embeddings
│ │ ├── audio.py # Audio preprocessing
│ │ └── parameter_transfer.py # Model weight transfer
│ │
│ └── tools/ # Shared utilities
│ ├── __init__.py
│ ├── variables.py # Model definitions, FP16 config
│ ├── config.py # Application configuration management
│ ├── file_utils.py # File search, model lookup, downloads
│ ├── audio_utils.py # Audio effects (Pedalboard), merging (pydub)
│ ├── downloader.py # Model & music download orchestration
│ ├── gdown.py # Google Drive download handler
│ ├── hf.py # HuggingFace download handler
│ ├── mediafire.py # MediaFire download handler
│ └── logger.py # Logging utilities
│
├── tabs/ # Gradio UI tabs
│ ├── full_inference.py # Voice Conversion tab
│ ├── tts_inference.py # TTS Generation tab
│ ├── whisper_transcription.py # Transcription tab
│ ├── download_music.py # Download Music tab
│ ├── download_model.py # Download Model tab
│ └── settings.py # Settings tab
│
├── assets/ # Static assets
│ ├── themes/ # Gradio themes
│ ├── i18n/ # Internationalization (8 languages)
│ ├── config.json # User settings
│ ├── logo.ico # Favicon
│ └── colab.ipynb # Google Colab notebook
│
├── docs/ # Documentation
├── tests/ # Test suite
├── requirements.txt
├── run.sh / run.bat # Launch scripts
└── update.sh / update.bat # Update scripts
# Install dependencies
pip install -r requirements.txt
# Start the WebUI
python app.py
# With custom options
python app.py --port 8080 --share --openpython cli.py list-modelspython cli.py download-model --link https://huggingface.co/username/modelpython cli.py download-music --link https://youtube.com/watch?v=...python cli.py convert --model-path /path/to/model.pth --input-audio song.mp3python cli.py convert --model-path model.pth --index-path index.pth \
--input-audio song.mp3 --pitch 12 --reverb --denoise \
--vocal-model "Mel-Roformer by KimberleyJSN" \
--export-format-final mp3python cli.py add-effects input.wav --room-size 0.8 --wet 0.4 --output-path output.wavpython cli.py merge \
--vocals vocals.flac \
--instrumental instrumental.flac \
--backing-vocals backing.flac \
--format mp3Handles all audio source separation tasks using state-of-the-art deep learning models including Mel-Roformer, BS-Roformer, MDX23C, Demucs v4, Bandit-Split RNN, and SCNet architectures:
- Vocal/instrumental separation
- Karaoke (lead + backing vocal) separation
- Dereverb processing
- Deecho processing
- Denoise processing
- Model ensembling for improved separation quality
Wraps the Applio RVC inference engine for high-quality voice conversion with support for multiple pitch extractors (CREPE, FCPE, RMVPE), embedder models, and various export formats. The engine includes a full pipeline architecture with attention-based generators, discriminators, and synthesizer modules.
Microsoft Edge TTS integration with 400+ voices across 11 languages, with optional RVC voice conversion on the generated audio for creating AI covers from text input alone.
OpenAI Whisper-based speech-to-text with word-level timestamps, multi-language support, and speaker diarization powered by SpeechBrain and ECAPA-TDNN speaker embeddings. Supports SRT, VTT, and JSON export formats.
Shared helpers used across all modules:
- variables: Model definitions, FP16 hardware detection
- config: Application configuration management
- file_utils: File search, model metadata lookup, file downloads
- audio_utils: Reverb effects (Pedalboard), audio merging (pydub), FP16 config patching
- downloader: RVC model download and YouTube music download orchestration
- gdown / hf / mediafire: Platform-specific download handlers
The full_inference_program() function coordinates the complete audio processing pipeline by calling into the specialized sub-modules in sequence: vocal separation → karaoke separation → dereverb → deecho → denoise → RVC conversion → backing vocals → reverb → pitch adjust → merge.
| Role | Member | Description |
|---|---|---|
| 👑 Base Project Owner | ShiromiyaG | Owner of RVC-AI-Cover-Maker-UI which this project is based on |
| 🔧 Base Project Contributor | Eddycrack864 | Contributor to RVC-AI-Cover-Maker-UI |
| 🧩 Fork Owner | BF667-IDLE | Hyper RVC fork owner & maintainer |
| 🧪 Colab UI | Nick088 | Start UI cells in Colab & Kaggle, local setup guide |
| 🧪 QA Testing | FullmatheusBallZ | Google Colab testing & quality assurance |
| Project | Author | Role |
|---|---|---|
| ShiromiyaG | Original UI framework & cover pipeline design (owned by ShiromiyaG) | |
| IAHispano | RVC inference engine, pitch extraction & model management | |
| beveradb | Python audio source separation wrapping UVR models | |
| Anjok07 | Gold standard vocal removal with pretrained model weights | |
| ZFTurbo | BS-Roformer, Mel-Band-Roformer, SCNet, MDX23C, Bandit, Demucs | |
| SociallyIneptWeeb | AI cover generation pipeline & processing concepts | |
| PhamHuynhAnh16 | Base RVC library code, additional F0 predictors & method fixes |
| Library | Author | Purpose |
|---|---|---|
| OpenAI | Speech recognition & transcription | |
| SpeechBrain Team | Speaker diarization & ECAPA-TDNN embeddings | |
| Meta AI | Deep learning framework for all neural networks | |
| HuggingFace | Model loading & pretrained model utilities | |
| NumPy Team | Numerical computing & array operations | |
| Microsoft | High-performance model inference |
| Library | Author | Purpose |
|---|---|---|
| rany2 | 400+ voices in 11 languages via Microsoft Edge | |
| Max Morrison | Neural pitch estimation (F0 extraction) | |
| OpenVPI | Robust vocal pitch estimation | |
| SCToolsystem | Fundamental frequency contour extraction | |
| Meta Research | Voice embedding similarity search & retrieval | |
| Max Morrison | PyTorch-native CREPE implementation |
| Library | Author | Purpose |
|---|---|---|
| Spotify | Studio-quality reverb, EQ & audio effects | |
| James Robert | Audio manipulation, format conversion & merging | |
| librosa Team | Music & audio analysis, feature extraction | |
| FFmpeg Project | Audio/video encoding, decoding & processing | |
| Bastian Bechtold | Audio file I/O via libsndfile | |
| SciPy Team | Signal processing & scientific computing |
| Library | Author | Purpose |
|---|---|---|
| yt-dlp contributors | YouTube & 1000+ site audio/video downloader | |
| HuggingFace | Model & dataset hosting for pretrained RVC models | |
| Kentaro Wada | Google Drive file downloader | |
| Kenneth Reitz | HTTP library for Python | |
| Casper da Costa-Luis | Progress bars for downloads & processing |
| Library | Author | Purpose |
|---|---|---|
| HuggingFace | Web UI framework with tabs, sliders & file uploads | |
| Python Software Foundation | Core language runtime | |
| Freepik | Cyber-themed cover image for the WebUI |
Built with ❤️ by the Hyper-RVC community · Open Source under MIT License