This repository contains code for synthesizing speech audio from silently mouthed words captured with electromyography (EMG). It is the official repository for the papers Digital Voicing of Silent Speech at EMNLP 2020, An Improved Model for Voicing Silent Speech at ACL 2021, and the dissertation Voicing Silent Speech. The current commit contains only the most recent model, but the versions from prior papers can be found in the commit history. On an ASR-based open vocabulary evaluation, the latest model achieves a WER of approximately 36%. Audio samples can be found here.
The repository also includes code for directly converting silent speech to text. See the section labeled Silent Speech Recognition.
Compared to the original code several changes have been made to improve usability of the codebase:
- We added a script to download the data directly into the expected directory structure.
- The environment setup instructions have been updated to use
uvto simplify installation of PyTorch with CUDA support. You can still use yout preferred method to install dependencies if desired via therequirements.txtorpyproject.tomlfiles. - The audio cleaning step has been improved by running the cleaning in parallel and saving cleaned audio files to avoid re-sampling on each training run.
- Instead of building the dataset on-the-fly during training each time, a script has been added to build the HDF5 dataset once and save it to disk for faster loading during training. This can save a significant amount of time during training depending on your hardware.
- Additional flags to resume training from a checkpoint have been added (in this way you can continue training from a pre-trained model to check performance improvements over downstream tasks).
- The DeepSpeech library has been deprecated in favor of SpeechBrain for ASR evaluation using Wav2Vec2 models due to compatibility issues (mainly associated with CPU architecture incompatibilities).
- The CTC beam search decoder has been changed to use torchaudio's built-in decoder instead of the original ctcdecode library due to compatibility issues.
- Additional instructions have been added to download the KenLM language model and generate the lexicon needed for decoding.
- Tensorboard logging has been added to monitor training progress.
There are some additional improvements that could be made to further improve usability:
- Following Stanford - MONA LISA,
hydracould be used for configuration management to simplify hyperparameter tuning and experiment tracking. Additionally,pytorch lightningcould be used to simplify the training loop and improve reproducibility. - The dataset building process could be further optimized by using a more efficient data storage format or by implementing a more efficient data loading pipeline. Currently, the HDF5 dataset is built to be used with a single GPU setup. Further work could be done to optimize for multi-GPU setups (e.g., using a different data sampler suitable for distributed training).
- Training progress could be monitored more easily using tools like
Weights & Biasesto store experiment results and visualize training metrics over time with less setup required.
The EMG and audio data can be downloaded from https://doi.org/10.5281/zenodo.4064408. The scripts expect the data to be located in a emg_data subdirectory by default, but the location can be overridden with flags (see the data download script section below).
Force-aligned phonemes from the Montreal Forced Aligner have been included as a git submodule, which must be updated using the process described in "Environment Setup" below. Note that there will not be an exception if the directory is not found, but logged phoneme prediction accuracies reporting 100% is a sign that the directory has not been loaded correctly.
Note: A script to download the data directly into the expected directory structure is provided below.
The code has been tested with Python 3.10 and requires a number of Python packages, including PyTorch with CUDA support.
Set up a virtual environment and install the required packages. We suggest using uv to install all the dependencies, including PyTorch with the appropriate CUDA version.
uv syncalternatively, you can manually install the dependencies listed in pyproject.toml, making sure to install the correct version of PyTorch for your CUDA setup from https://pytorch.org/get-started/locally/.
You will also need to pull git submodules for Hifi-GAN and the phoneme alignment data, using the following commands:
git submodule init
git submodule update
tar -xvzf text_alignments/text_alignments.tar.gzNote: Due to compatibility issues,
DeepSpeechlibrary has been deprecated in favor of SpeechBrain for ASR evaluation using Wav2Vec2 models. Such models will be downloaded automatically when running the evaluation script and cached in your Hugging Face cache directory.
For convenience, a script is provided to download the data directly into the expected directory structure.
Before running the script, check the configuration files in the config directory to ensure the data paths are set correctly. You will need to change the $DATA_PATH variable to your desired data directory path or bind it to a valid path in your environment.
Then run:
python download_data.pyThis is an optional step. Training will be faster if you re-run the audio cleaning, which will save re-sampled audio so it doesn't have to be re-sampled every training run. In order to run the cleaning script, use the following command:
python data_collection/clean_audio.pythis script will run in parallel and may take a few minutes to complete, depending on your hardware. It will save cleaned audio files with the correct sample rate in the same directories as the original audio files, with filenames prefixed by cleaned_.
Note: If you do not run this step, the code will re-sample the original audio files on-the-fly during training, which will require more time as the audio files will be resampled on CPU creating many CPU-GPU transfers during training.
In the original code, the dataset was built on-the-fly during training each time, which can be time-consuming. To build the HDF5 dataset from the raw EMG and audio files, run the following command, replacing the output file path as needed:
python build_hdf5.pyin this way, the dataset only needs to be built once and can be loaded quickly during training.
Pre-trained models for the vocoder and transduction model are available at https://doi.org/10.5281/zenodo.6747411.
To train an EMG to speech feature transduction model, use the following command:
python transduction_model.pyall the training parameters can be adjusted in the config/transduction_model.json file. You can specify an output directory for saving models and logs in the configuration file. You can also start the training from a pre-trained model.
At the end of training, an ASR evaluation will be run on the validation set if a HiFi-GAN model checkpoint is provided.
To evaluate a model on the test set, use
python evaluate.py --model ./models/transduction_model/model.pttest set file can be changed in the configuration file.
This section is about converting silent speech directly to text rather than synthesizing speech audio. The speech-to-text model uses the same neural architecture but with a CTC decoder, and achieves a WER of approximately 28% (as described in the dissertation Voicing Silent Speech).
Due to compatibility issues, the original ctcdecode library has been replaced with torchaudio's built-in CTC beam search decoder.
In order to use the CTC beam search decoder with a KenLM language model, you will need to download the KenLM language model as well as generating the lexicon inferred from the data.
Downloading the KenLM language model can be done using the provided script in the KenLM subdirectory.
cd KenLM
python download_LM.py --output_directory .in this way, the language model will be downloaded to the current directory under the lm.bin file.
After downloading the language model, you will need to generate the lexicon file used for decoding. The lexicon can be generated using the provided script get_lexicon.py:
python get_lexicon.pyPre-trained model weights can be downloaded from https://doi.org/10.5281/zenodo.7183877.
To train a model, run
python recognition_model.pyTo run a test set evaluation on a saved model, use
python recognition_model.py --evaluate_saved "./models/recognition_model/model.pt"