Skip to content

yuwchen/NRSER

Repository files navigation

NRSER

This repo is the implementation of the paper:
Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement [Arxiv]

Dataset:

Speech emotion dataset:
MSP-PODSCAST Database Release 1.10 (May 3, 2022) from The University of Texas at Dallas, Multimodal Signal Processing (MSP) Laboratory

Background noise dataset:
Audioset
The training, validation, and testing wavfiles list are in audioset-train-80.txt, audioset-train-20.txt, audioset-val.txt.
The excluded labels of environmental noise experiment is in human_generated_noise.csv.
Note: some youtube videos were not available when we downloaded the data, see the above lists for the files that used in this study.

Data preprocessing & preparation

(1) Generate the enhanced signals of all training data (save time for model training).

python CMGAN/enhanced_speech_cpu.py --test_dir /path/to/wavfiles/dir #if you use cpu
python CMGAN/enhanced_speech_gpu.py --test_dir /path/to/wavfiles/dir #if you use gpu

The enhanced signals will be saved in the ./data/{dir}_en directory.

See CMGAN for more details.

(2) Prepare the training txtfile

For SNR-level detection:

  • noise_train.txt
  • noise_val.txt
wavpath_1; snr_level_1
wavpath_2; snr_level_2
e.g.
audio/clean_sampleA.wav; 1
audio/clean_sampleB.wav; 1
noise/_-lXVZ9QpO8.wav; 0
noise/_0bOQtWbqVc.wav; 0

Note: a sample that all values are 0 will cause error during training.

For emotion recognition:

  • emotion_train.txt
  • emotion_val.txt
wavpath_1; emotion_category_1; A:arousal_1; V:valence_1; D:dominance_1;
wavpath_2; emotion_category_2; A:arousal_2; V:valence_2; D:dominance_2;
e.g. 
audio_noisy/noisy_sampleA.wav; N; A:4.500000; V:4.500000; D:5.000000;
audio_noisy/noisy_sampleB.wav; N; A:4.500000; V:4.500000; D:5.000000;
audio/clean_sampleA.wav; N; A:4.500000; V:4.500000; D:5.000000;
audio/clean_sampleB.wav; N; A:4.500000; V:4.500000; D:5.000000;

Note: for each sample in the datalist, there must be a corresponding enhanced signal with the same name in "{dir}_en" directory

For example:
audio_noisy_en/noisy_sampleA.wav #enhanced wav of "audio_noisy/noisy_sampleA.wav"

audio_en/clean_sampleA.wav #enhanced wav of "audio/clean_sampleA.wav"

(3) Download SSL model

Download huber_based_ls960.pt: HuBERT Base

Source code of NRSER

Training:

python snr_model.py                                                           #Training phase1: training of SNR-level detection block
python emotion_model.py  --fairseq_base_model /path/to/hubert_base_ls960.pt   #Training phase2: training of emotion recognition block
python nrser.py --fairseq_base_model /path/to/hubert_base_ls960.pt            #Training phase3: fine-tuning the model

Testing - emotion recognition:

e.g.
python test_gpu.py --datadir ./test_samples --ckptdir emotion_model_s2_ccc2_r1-snr_model_r1 #if you use gpu
python test_cpu.py --datadir ./test_samples --ckptdir emotion_model_s2_ccc2_r1-snr_model_r1 #if you use cpu

Testing - SNR level detection (use the SNR-level detector without fine-tuning with emotion recognition.)

e.g.
python test_snr.py --datadir ./test_samples --ckptdir snr_model_r1 

Evaluation code

evaluation_metric.py # to calculate Concordance Correlation Coefficient.

Pretrain model

Previous Google Drive

Citation

If you use the code in your research, please cite:
.... Thanks :-)!

License

  • The NRSER work is released under MIT License. See LICENSE for more details.

Acknowledgments

  • Speech Lab, CS, Columbia University, New York, United States
  • Bio-ASP Lab, CITI, Academia Sinica, Taipei, Taiwan

About

Code for the NRSER paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages