PyTorch implementation of the method described in the Voice Synthesis for in-the-Wild Speakers via a Phonological Loop.
VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled in the wild. Some demo samples can be found here.
- Demo Samples (Updated 10/25/2017)
- Quick Start
- Setup
- Training
Follow the instructions in Setup and then simply execute:
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pthResults will be placed in models/vctk/results. It will generate 2 samples:
- The generated sample will be saved with the gen_10.wav extension.
- Its ground-truth (test) sample is also generated and is saved with the orig.wav extension.
You can also generate the same text but with a different speaker, specifically:
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pthWhich will generate the following sample.
Here is the corresponding attention plot:
Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.
Finally, free text is also supported:
python generate.py --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pthRequirements: Linux/OSX, Python2.7 and PyTorch 0.1.12. The current version of the code requires CUDA support for training. Generation can be done on the CPU.
git clone https://github.com/facebookresearch/loop.git
cd loop
pip install -r scripts/requirements.txtThe data used to train the models in the paper can be downloaded via:
bash scripts/download_data.shThe script downloads and preprocesses a subset of VCTK. This subset contains speakers with american accent.
The dataset was preprocessed using Merlin - from each audio clip we extracted vocoder features using the WORLD vocoder. After downloading, the dataset will be located under subfolder data as follows:
loop
├── data
└── vctk
├── norm_info
│ ├── norm.dat
├── numpy_feautres
│ ├── p294_001.npz
│ ├── p294_002.npz
│ └── ...
└── numpy_features_valid
The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.
Pretrainde models can be downloaded via:
bash scripts/download_models.shAfter downloading, the models will be located under subfolder models as follows:
loop
├── data
├── models
├── vctk
│ ├── args.pth
│ └── bestmodel.pth
└── vctk_alt
Update 10/25/2017: Single speaker model coming soon
Finally, speech generation requires SPTK3.9 and WORLD vocoder as done in Merlin. To download the executables:
bash scripts/download_tools.shWhich results the following sub directories:
loop
├── data
├── models
├── tools
├── SPTK-3.9
└── WORLD
Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:
python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90Then, continue training the model using noise level of 2, on full sequences:
python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90If you find this code useful in your research then please cite:
@article{taigman2017voice,
title = {Voice Synthesis for in-the-Wild Speakers via a Phonological Loop},
author = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprinttype = {arxiv},
eprint = {1705.03122},
primaryClass = "cs.CL",
year = {2017}
month = July,
}
Loop has a CC-BY-NC license.


