Skip to content

yoongi43/NoiseRobustVRVQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ

Official implementation of “Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ” (INTERSPEECH 2025).

Paper Link: arXiv

🔊 Audio Samples

Model outputs (audio samples & importance maps) are available here:
sample_link


⚙️ Environment Setup

Item Command / Details
Python 3.12
Create Conda env conda create -n noise_vrvq python=3.12
Activate env conda activate noise_vrvq
PyTorch conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
Mamba-SSM pip install mamba-ssm==1.2.0.post1 --no-build-isolation
Other deps pip install -r requirements.txt

📂 Dataset


🚀 Training Example

Training script can be run with the following command:

Ex) 16 kHz VBR model training, with GPU idx 0

## Stage 1 training
$ bash scripts/stage1_train_clean_script.sh VBR_16k 0

## Stage 2 training
$ bash scripts/stage2_train_feat_dn_script.sh VBR_feat_denoise_16k 0

Other experiment configurations can be found in the conf/stage1_clean_recon and conf/stage2_denoising directories.

📂 Checkpoints

Trained model checkpoints can be downloaded from the following link:

conf/stage2_denoising/VBR_feat_denoise_16k_modulation: Google Drive

conf/stage2_denoising/VBR_feat_denoise_48k_modulation: Google Drive

🧪 Evaluation

To evaluate on the EARS-WHAM test set, run:

Ex) 16 kHz VBR model inference, with GPU idx 0

$ bash evaluate/eval_test.sh VBR_feat_denoise_16k 0

Results (CSV of evaluation metrics) will be saved to:

evaluate/results/${EXPNAME}/evaluation_results.csv

🎧 Inference example

See evaluate/infer_audio.py for the inference script. By default it processes samples from the test loader, but you can adapt it to run on a single file. The script:

  1. Reconstructs clean audio from noisy input

  2. Saves output audio to evaluate/output_audio/${EXPNAME}

  3. Saves importance maps at each quantization level

To run the provided example:

bash evaluate/infer_audio_example.sh VBR_feat_denoise_16k 0

📌 Code Base

Codes are mainly based on DAC [2], VRVQ [3] and SEMamba [4].

📝 References

[1] Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann, EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation, INTERSPEECH 2024. Paper Link

[2] Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar,
High-Fidelity Audio Compression with Improved RVQGAN,
Advances in Neural Information Processing Systems, 2023.
Paper Link

[3] Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao, Yuki Mitsufuji, Variable Bitrate Residual Vector Quantization for Audio Coding, ICASSP 2025. Paper Link

[4] Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao, An Investigation of Incorporating Mamba for Speech Enhancement, ICASSP 2025. Paper Link

📚 Citation

If you find our work useful, please cite (To Be Updated):

About

Implementation for the paper "Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ", INTERSPEECH 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors