improve the librispeech recipe#354
Conversation
|
It is consistent with my observation except that adding more layers up to 6 with 1024 units still obtains some improvement. |
|
Thanks. Will follow your suggestions (when GPUs are available). Could you share your WER results? If it is better than https://github.com/espnet/espnet/blob/d51e76c0baa556e28a3e090335944478828fbc65/egs/librispeech/asr1/RESULTS, then, I may follow your network architecture, or ask you to make a PR. |
|
The trained models can be provided through the release (e.g., https://github.com/espnet/espnet/releases/download/untagged-f5ccde023841a43380a9/librispeech_asr1.tgz) |
|
No, my observation is from train_100. I think you system is the best til now. |
|
@sw005320 Hi Shinji, for RWTH setup, is there any published paper for this? I couldn't find one online. |
|
https://arxiv.org/pdf/1805.03294 |
|
From this paper, their system played with the subsampling factor 32 in the pertaining and 8 for fine-tuning and they showed some improvement. In our babel-10 setting, we used subsampling factor 4, and we allowed the minimum number of frames in an utterance to be 10 frames. Is there any specific reason use our current setting? |
|
We don't have a specific reason, and we may test further subsampling like the RWTH paper, but I internally found that further subsampling (8) slightly degrades the performance in the Librispeech task. |
Now I'm working on the improvement of the librispeech recipe motivated by the RWTH setup (thanks to Rohit Prabhavalkar and Kazuki Irie)
test_clean)TODO
adim, andeprojsmay not have to be large?)