Welcome to Predicting-Speaker-Quality! This repository contains the code used for my Bachelor's thesis with the title Predicting Speaker Quality Using Embeddings. All of it is research code written by an inexperienced undergraduate student, so please don't expect perfect documentation. However, if you run into any troubles or even want to improve or add to the code base, don't hesitate to reach out to me. Found a mistake? Let me know as well.
Besides just reading this README file, a good idea to delve into the topic might also be to read the resulting thesis itself, which is included in this repository as Predicting Speaker Quality Using Embeddings.pdf.
To set up the project, follow these steps:
- Clone this repository.
- Install the requirements from
requiremente.txtusingpip install -r requirements.txtif they are not already satisfied. If you like, you can do this in a virtual environment to keep things tidy.
- Download the Spoken Wikipedia Corpus (German, with audio) from https://nats.gitlab.io/swc/ and replace the directory
germanwith it. - Navigate into the main project directory and execute the
split.shscript usingbash split.sh -m 10 -d 10 -p, which will generate up to 10 samples of length 10 seconds from each audio file in thewavsdirectory and its subdirectories. This may take a while. To see all available options, typebash split.sh -h. - Generate the GE2E and TRILL embeddings by running the
update_embeddings.pyscript once. If you want to create new embeddings, for example because you have new .wav files in your demo folder, just run it again. It will remember which embeddings have already been created and delete embeddings that are no longer needed. - Navigate into the
feature-scriptsdirectory and execute theupdate_audio_features.shscript usingbash update_audio_features.sh. Just like the previous script, this one does all the bookkeeping for you and tracks new and deleted .wav files.
- In order to train and evaluate the neural network models (DNNs and LSTMs), simply run the
keras_regressors.pyscript. All parameters like network architecture, learning rate, etc. can be modified inside the file itself. - For the kNN and random forest regressor, use the
sklearn_regressors.pyfile. Like before, all parameters can be set inside the script itself.
If you want to create plots from the resulting predictions (just like the ones seen in the thesis), take a look at the individual plotting scripts inside plot-scripts.
In order to evaluate the audio recordings inside wavs/demo, please use the script demo.py.
The code in the encoder directory, which generates the GE2E embeddings, is forked from Corentin Jemine (https://github.com/CorentinJ/Real-Time-Voice-Cloning) and available in a better documented format under the name Resemblyzer.