Skip to content

improve the librispeech recipe#354

Merged
sw005320 merged 5 commits intoespnet:masterfrom
sw005320:improve_librispeech
Aug 24, 2018
Merged

improve the librispeech recipe#354
sw005320 merged 5 commits intoespnet:masterfrom
sw005320:improve_librispeech

Conversation

@sw005320
Copy link
Copy Markdown
Contributor

@sw005320 sw005320 commented Aug 14, 2018

Now I'm working on the improvement of the librispeech recipe motivated by the RWTH setup (thanks to Rohit Prabhavalkar and Kazuki Irie)

  • sentencepiece model as a default
  • VGG-BLSTM
  • shallow and wide network (3layer BLSTMS with 1024 units (encoder), 1024 dim attention, unidirectional LSTM with 1024 units (decoder)
  • fast convergence (change the maximum of epochs from 15 to 10)
  • significant WER improvement (from 7.2 to 5.1 in test_clean)

TODO

  • check other configurations (adim, and eprojs may not have to be large?)
  • increase the decoder layer from 3 to x
  • increase the decoder layer from 1 to 2
  • tune some search parameters (rnnlm weight)
  • upload a model (Is there any already trained model? #322)

This was referenced Aug 14, 2018
@chenzhehuai
Copy link
Copy Markdown

It is consistent with my observation except that adding more layers up to 6 with 1024 units still obtains some improvement.

@sw005320
Copy link
Copy Markdown
Contributor Author

Thanks. Will follow your suggestions (when GPUs are available). Could you share your WER results? If it is better than https://github.com/espnet/espnet/blob/d51e76c0baa556e28a3e090335944478828fbc65/egs/librispeech/asr1/RESULTS, then, I may follow your network architecture, or ask you to make a PR.

@sw005320
Copy link
Copy Markdown
Contributor Author

The trained models can be provided through the release (e.g., https://github.com/espnet/espnet/releases/download/untagged-f5ccde023841a43380a9/librispeech_asr1.tgz)

@chenzhehuai
Copy link
Copy Markdown

No, my observation is from train_100. I think you system is the best til now.

@sw005320 sw005320 changed the title [WIP] improve the librispeech recipe improve the librispeech recipe Aug 24, 2018
@sw005320 sw005320 merged commit 168a9e9 into espnet:master Aug 24, 2018
@ruizhilijhu
Copy link
Copy Markdown

@sw005320 Hi Shinji, for RWTH setup, is there any published paper for this? I couldn't find one online.

@sw005320
Copy link
Copy Markdown
Contributor Author

sw005320 commented Sep 7, 2018

https://arxiv.org/pdf/1805.03294
This is not exactly same as what we're now using, but our set up is based on the discussion with them what would be most dominant given our current implementation.

@ruizhilijhu
Copy link
Copy Markdown

From this paper, their system played with the subsampling factor 32 in the pertaining and 8 for fine-tuning and they showed some improvement.

In our babel-10 setting, we used subsampling factor 4, and we allowed the minimum number of frames in an utterance to be 10 frames.

Is there any specific reason use our current setting?

@sw005320
Copy link
Copy Markdown
Contributor Author

We don't have a specific reason, and we may test further subsampling like the RWTH paper, but I internally found that further subsampling (8) slightly degrades the performance in the Librispeech task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants