This repo can realize:
- transcribe parrellel with multiple ASR vosk models using CPU.
- compare transcription with utterance_file, which contains all references, and match sentence level result then save to pruned folder. punctuations, capitalizations, multi whitespace are ignored, similarity threshold can be adjusted.
- generate report base on each models wer/cer.
- combine multiple vosk modules' result using sentence level voting to calculate the most commen result. If no commen result, fallback to model priority approach (models with low WER published in VOSK website gets the higher priority)
used model in this example:
- vosk-small-en-us-0.15, 40MB, 9.85 (wer on librispeech)
- vosk-en-us-0.22, 1.8GB, 5.69 (wer on librispeech)
- vosk-model-en-us-0.22-lgraph,127MB, 7.82 (wer on librispeech)
python3.11 -m venv venvs/vosk
module load FFmpeg
pip install --upgrade pip
pip install requests vosk jiwer numpy
source venvs/vosk/bin/activatepython transcribe.py data/task1/Nexdata_demo- more models can be integrated, e.g. asr models from NEMO toolkit also have promising WER. I've tried in my experiment, the problem is to load Nemo models takes longer time than I expected, anyone have an idea how this problem can be tackeled is welcomed to contact.
- For the combined result, currently I only use sentence level voting. I've also tried word level voting and tried the word level confidence score from vosk, but the result is not so good, it causes either word eaten or adding, anyone have an idea how word level voting can be implemented, feel free to raise your idea.
For other models visit Vosk Website.
For original repo visit Vosk repo.