qianhwan/KaldiBasedSpeakerVerification
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
KaldiBasedSpeakerVerification ======================================== Author: Qianhui Wan Version: 1.0.0 Date : 2018-01-23 Prerequisite ------------ 1. Kaldi 5.3, as well as Altas and OpenFst required by Kaldi. https://github.com/kaldi-asr/kaldi 2. libfvad, Voice activity detection (VAD) library, based on WebRTC's VAD engine. https://github.com/dpirch/libfvad Installation ------------ 1. Install Kaldi 5.3: git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream cd kaldi 2. Install Kaldi's required libraries: cd to /kaldi/tools and follow INSTALL instructions there. 3. Compile and finish Kaldi install: cd to /kaldi/src and follow INSTALL instructions there. 4. Install libfvad: git clone https://github.com/dpirch/libfvad cd libfvad ./bootstrap ./configure make make install (perhaps sudo at this command) 5. Install KaldiBasedSpeakerVerification cd KaldiBasedSpeakerVerification/src *edit makefile; provide the correct locations for this project and the libraries. make (This will output 3 executables under /src: enroll, identifySpeaker and extractFeatures) Project file structure (under KaldiBasedSpeakerVerification folder) ---------------------------------- /examples contains enroll and test examples, along with example data /examples/iv contains i-vector features extracted from enrollment.(this can be empty before enrolling speakers, must have 2 files before testing) /examples/mat contains background model data, must have six files. /scripts contains scripts mainly used to create background model. /src contains code for 3 applications: creating a background model, enrolling speakers and speaker identification. Main applications ------------------------------------------------- /src/enroll.cpp This program is used to extract speech features from one speaker. Usage: enroll speakerId wavefile The output should look like: Not registered speaker: speakerId. Created a new spkid or Found registered speaker: speakerId. Updated speaker model The wavefile should be in .wav format. This will create/update two files in /iv: train_iv.ark and train_num_utts.ark. /src/identifySpeaker.cpp This program process a given audio clip and output person identification every ~3.2 seconds. Usage: identifySpeaker wavefile The output should look like: Family membmer detected! Speaker: 225 Family membmer detected! Speaker: 225 Stanger detected! Family membmer detected! Speaker: 227 Family membmer detected! Speaker: 227 ... It will also output the probability score for each segments -> this could be used to adjust the decision threshold due to different audio condition. Examples ------------------------------------------------- After installing all required applications, you can run the following examples to test if your installation is right. 1. make sure there is three folder in /examples /example_data /iv /mat (due to the file size limit of GitHub, final.ie was zipped into several parts. To unzip, do: cat iepart* -> final.ie) 2. run ./test1Enroll.sh This will enroll all speech files in /example_data/enroll The output should look like: The total active speech is 1.61 seconds. No registered speaker: 174. Create a new spkid Done. The total active speech is 15 seconds. Found registered speaker: 174. Update speaker model Done. The total active speech is 0.88 seconds. No registered speaker: 84. Create a new spkid Done. The total active speech is 3.47 seconds. Found registered speaker: 84. Update speaker model Done. 3. run ./test1Test.sh This will test speech /example_data/test/84/84-121550-0030.wav against all registered speaker The output should look like: Effective speech length: 2.605s.No family member detected. (score: 4.97931) Effective speech length: 5.685s.Family member detected! Speaker: 84 (score: 33.7779) Speech data is finished! Done. *Note: There will also be outputs of kaldi log which look like: LOG ([5.3.96~1-7ee7]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor LOG ([5.3.96~1-7ee7]:ComputeDerivedVars():ivector-extractor.cc:204) Done. This tells you one audio segment has been processed and can be omitted by setting kaldi verbose level. Background Model Training ------------------------------------- /src/extractFeatures The program extracts 20-dim MFCC (with energy), append deltas and double deltas, and apply CMVN Usage: extractFeatures wav.scp ark,scp:feat.ark,feat.scp Input: wav.scp, a text list of speech file name and path Output: feat.ark, feat.scp -> same as kaldi. /scripts/data_prep.sh usage: data_prep.sh path_to_speech path_to_info prepare useful text file for later process, please refer to data_prep.sh for details /scripts/utt2spk_to_spk2utt.pl usage: utt2spk_to_spk2utt.pl utt2spk > spk2utt create the spk2utt file with given utt2spk file /scripts/train_ubm.sh usage: train_ubm.sh path_to_feat path_to_mat output: final.dubm, final.ubm please refer to train_ubm.sh for details /scripts/train_ivextractor.sh usage: train_ivextractor.sh path_to_feat path_to_mat output: final.ie please refer to train_ivextractor.sh for details /scripts/train_comp_plda.sh usage: train_comp_plda.sh path_to_feat path_to_mat output: final.plda, transform.mat, mean_vec please refer to train_comp_plda.sh for details The following folders will be created during running: /dev_data contains development dataset speech information, MFCC features and i-vectors /mat contains all trained models: final.dubm, final.ubm, final.ie, final.plda, transform.mat, mean_vec Note: The whole process can take several hours (e.g. 5 to 6 hours from VirtualBox-run CentOS version). Note: All scripts need to modified manually for the path (same as examples), this can be avoided if you add all paths to environmental variables.