Skip to content

Adapting transducer greedy decoding#2975

Merged
TParcollet merged 10 commits intospeechbrain:developfrom
younessdkhissi:patch-1
Nov 2, 2025
Merged

Adapting transducer greedy decoding#2975
TParcollet merged 10 commits intospeechbrain:developfrom
younessdkhissi:patch-1

Conversation

@younessdkhissi
Copy link
Copy Markdown
Contributor

@younessdkhissi younessdkhissi commented Sep 25, 2025

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

What does this PR do?

Fixes the transducer greedy decoding (this probably fixes the issue #2753)

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)
Set buffer_chunk_size to -1 to prevent dropping first chunks. This is the case especially when a small chunk size is used (e.g., 160ms)
@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Hi @younessdkhissi,

Thanks for the PR.

Could you please develop a bit more what are your suggested changes, and why do you think they are necessary? Furthermore, can you report the general improvements you've got in a streaming setting? Ideally, we would like to imrpove our transfucer decoding interfaces by being more aligned to the literature, so if you could provide some references it would be great as well.

Thanks Youness!

@younessdkhissi
Copy link
Copy Markdown
Contributor Author

Thanks @Adel-Moumen for taking time to look at my PR.
The idea behind the changes made in the greedy search is to represent all the alignments we learn with the RNN-T loss during training. In the figure below, we see that many of the RNN-T alignment paths suppose that many tokens could be decoded in the same timestamp.
In the current speechbrain greedy search, we could decode only one token from each timestep (similar to CTC) but this is different from RNN-Transducer objective.
In Streaming ASR, transducer has more tendency to delay the predictions until it is very confident by visiting more future frames (http://arxiv.org/abs/2111.01690). This push the model to learn decoding many tokens at the end. With the current algorithm, a number of deletions is expected at the end of th transcriptions. That why I mentionned the issue #2753 where there is a difference between the evaluation inside the training script and the evaluation using StreamingASR class: in this class, there is a number of zero chunks injected at the end of the input stream which gives more chance to decode more tokens at the end.

I have tried to implement the following paper "DUAL-MODE ASR: UNIFY AND IMPROVE STREAMINGASR WITH FULL-CONTEXT MODELING" https://openreview.net/pdf?id=Pz_dcqfcKW8 (that could be a future PR) and these are the results I get on LibriSpeech using the streaming mode:

  • Dual-mode ASR (streaming):
  • Using the current greedy search:
    • test-clean: 4.32%
    • test-other: 11.04%
  • Using the greedy search proposed in the PR:
    • test-clean: 3.88%
    • test-other: 9.88%
  • original paper results:
    • test-clean: 3.7%
    • test-other: 9.2%
rnn_t_alignment_paths

@TParcollet TParcollet added this to the v1.1.0 milestone Oct 9, 2025
@TParcollet TParcollet self-assigned this Oct 9, 2025
@TParcollet
Copy link
Copy Markdown
Collaborator

Hi @younessdkhissi thanks for the work. I am really not sure what this PR does. I see that there is a new arbitrary for loop for each time steps. Do you have a paper describing what is being done here formally? This would be important to attach to such a change to greedy decoding.

@younessdkhissi
Copy link
Copy Markdown
Contributor Author

Hello @TParcollet
In this PR, I changed the greedy search so that for each timestep, We decode until we predict the blank token because the original Transducer paper (https://arxiv.org/abs/1211.3711?utm_source=openai) defines alignments over a T×U lattice (see the figure in the previous comment) with a blank symbol, allowing multiple label emissions inthe same time step.
To avoid infinite loop, I defined a for loop of 'max_iterations' which consiste that for each timetep we could decode at most "max_itarations" of non-blank token.
Here is a paper of NVIDIA where they describe their greedy search algorithm (https://www.isca-archive.org/interspeech_2024/galvez24_interspeech.pdf): the 9th line of Algorithm 1 is equilevent to the for loop that I have added in my PR.
The choice of "max_iterations" value was inpired from this paper too.

@TParcollet
Copy link
Copy Markdown
Collaborator

Hi @younessdkhissi and thank you. I'll take a closer look into this asap. In the meantime, could you please provide a few measurements of how the decoding speed is impacted by this change? Many thanks!

@younessdkhissi
Copy link
Copy Markdown
Contributor Author

younessdkhissi commented Oct 15, 2025

Hi @TParcollet
I measured the inference time using both decoding methods and take the mean of 5 different runs with different seeds.
I used the RTX 2080 Ti GPU card. The measurements have been made on 12-layer conformer transducer (similar architecture to the existing SpeechBrain recipe) using a chunk size of 320ms without taking any left context. The inference has been made on LibriSpeech test set using batch size of 4.

  • With the current greedy decoding(test-clean/test-other):
    • WER : 3.85%/10.36%
    • Inference time : 2min 29s / 2mins 18s
  • With the proposed greedy decoding:
    • WER : 3.79%/10.30%
    • Inference time : 2min 47s / 2mins 35s

If you want more measurements let me know :)

@TParcollet
Copy link
Copy Markdown
Collaborator

@younessdkhissi thanks. Can you fix the tests and then we'll merge.

Added spaces for improved readability in the transducer.py file.
Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@younessdkhissi one last thing, could you make the max_steps an aargument of the function isntead?

@TParcollet
Copy link
Copy Markdown
Collaborator

@younessdkhissi sorry I should have been more precise. I think "max_steps" is a bit to generic and may be interpreted as "max number of decoding steps" which may be confusing. Thanks for adding it to the arguments, but could you give it a more adequate name?

Thanks!

@younessdkhissi
Copy link
Copy Markdown
Contributor Author

@TParcollet It's my fault for making such a generic name for this variable. I propose "max_symbols_per_step" to avoid any confusion. Let me know if there are more changes to do :)

@TParcollet TParcollet merged commit 43c8ad1 into speechbrain:develop Nov 2, 2025
5 checks passed
@TParcollet
Copy link
Copy Markdown
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants