Adapting transducer greedy decoding by younessdkhissi · Pull Request #2975 · speechbrain/speechbrain

younessdkhissi · 2025-09-25T15:16:38Z

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

What does this PR do?

Fixes the transducer greedy decoding (this probably fixes the issue #2753)

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

Set buffer_chunk_size to -1 to prevent dropping first chunks. This is the case especially when a small chunk size is used (e.g., 160ms)

Adel-Moumen · 2025-10-08T12:23:10Z

Hi @younessdkhissi,

Thanks for the PR.

Could you please develop a bit more what are your suggested changes, and why do you think they are necessary? Furthermore, can you report the general improvements you've got in a streaming setting? Ideally, we would like to imrpove our transfucer decoding interfaces by being more aligned to the literature, so if you could provide some references it would be great as well.

Thanks Youness!

younessdkhissi · 2025-10-08T13:56:39Z

Thanks @Adel-Moumen for taking time to look at my PR.
The idea behind the changes made in the greedy search is to represent all the alignments we learn with the RNN-T loss during training. In the figure below, we see that many of the RNN-T alignment paths suppose that many tokens could be decoded in the same timestamp.
In the current speechbrain greedy search, we could decode only one token from each timestep (similar to CTC) but this is different from RNN-Transducer objective.
In Streaming ASR, transducer has more tendency to delay the predictions until it is very confident by visiting more future frames (http://arxiv.org/abs/2111.01690). This push the model to learn decoding many tokens at the end. With the current algorithm, a number of deletions is expected at the end of th transcriptions. That why I mentionned the issue #2753 where there is a difference between the evaluation inside the training script and the evaluation using StreamingASR class: in this class, there is a number of zero chunks injected at the end of the input stream which gives more chance to decode more tokens at the end.

I have tried to implement the following paper "DUAL-MODE ASR: UNIFY AND IMPROVE STREAMINGASR WITH FULL-CONTEXT MODELING" https://openreview.net/pdf?id=Pz_dcqfcKW8 (that could be a future PR) and these are the results I get on LibriSpeech using the streaming mode:

Dual-mode ASR (streaming):
Using the current greedy search:
- test-clean: 4.32%
- test-other: 11.04%
Using the greedy search proposed in the PR:
- test-clean: 3.88%
- test-other: 9.88%
original paper results:
- test-clean: 3.7%
- test-other: 9.2%

TParcollet · 2025-10-09T16:15:48Z

Hi @younessdkhissi thanks for the work. I am really not sure what this PR does. I see that there is a new arbitrary for loop for each time steps. Do you have a paper describing what is being done here formally? This would be important to attach to such a change to greedy decoding.

younessdkhissi · 2025-10-10T09:34:04Z

Hello @TParcollet
In this PR, I changed the greedy search so that for each timestep, We decode until we predict the blank token because the original Transducer paper (https://arxiv.org/abs/1211.3711?utm_source=openai) defines alignments over a T×U lattice (see the figure in the previous comment) with a blank symbol, allowing multiple label emissions inthe same time step.
To avoid infinite loop, I defined a for loop of 'max_iterations' which consiste that for each timetep we could decode at most "max_itarations" of non-blank token.
Here is a paper of NVIDIA where they describe their greedy search algorithm (https://www.isca-archive.org/interspeech_2024/galvez24_interspeech.pdf): the 9th line of Algorithm 1 is equilevent to the for loop that I have added in my PR.
The choice of "max_iterations" value was inpired from this paper too.

TParcollet · 2025-10-10T09:52:19Z

Hi @younessdkhissi and thank you. I'll take a closer look into this asap. In the meantime, could you please provide a few measurements of how the decoding speed is impacted by this change? Many thanks!

younessdkhissi · 2025-10-15T11:59:11Z

Hi @TParcollet
I measured the inference time using both decoding methods and take the mean of 5 different runs with different seeds.
I used the RTX 2080 Ti GPU card. The measurements have been made on 12-layer conformer transducer (similar architecture to the existing SpeechBrain recipe) using a chunk size of 320ms without taking any left context. The inference has been made on LibriSpeech test set using batch size of 4.

With the current greedy decoding(test-clean/test-other):
- WER : 3.85%/10.36%
- Inference time : 2min 29s / 2mins 18s
With the proposed greedy decoding:
- WER : 3.79%/10.30%
- Inference time : 2min 47s / 2mins 35s

If you want more measurements let me know :)

TParcollet · 2025-10-20T12:32:13Z

@younessdkhissi thanks. Can you fix the tests and then we'll merge.

Added spaces for improved readability in the transducer.py file.

TParcollet

@younessdkhissi one last thing, could you make the max_steps an aargument of the function isntead?

TParcollet · 2025-10-21T13:38:32Z

@younessdkhissi sorry I should have been more precise. I think "max_steps" is a bit to generic and may be interpreted as "max number of decoding steps" which may be confusing. Thanks for adding it to the arguments, but could you give it a more adequate name?

Thanks!

younessdkhissi · 2025-10-21T14:23:43Z

@TParcollet It's my fault for making such a generic name for this variable. I propose "max_symbols_per_step" to avoid any confusion. Let me know if there are more changes to do :)

TParcollet · 2025-11-02T20:24:30Z

Thanks!

younessdkhissi added 2 commits September 25, 2025 17:12

Adapting transducer greedy decoding

b5d6eb7

Adding a decoding loop on each decoding timestep. It is more adapted to RNN-T loss and it improves significantly the results (especially for streaming recipes)

Fixing StreamingASR class

8ebaeec

Set buffer_chunk_size to -1 to prevent dropping first chunks. This is the case especially when a small chunk size is used (e.g., 160ms)

TParcollet added this to the v1.1.0 milestone Oct 9, 2025

TParcollet self-assigned this Oct 9, 2025

Merge branch 'develop' into patch-1

aef4ae4

Merge branch 'develop' into patch-1

c4fdcbe

younessdkhissi added 2 commits October 20, 2025 14:50

Improve code readability in transducer.py

c473b95

Added spaces for improved readability in the transducer.py file.

Fix comment formatting in ASR.py

16ff335

TParcollet approved these changes Oct 21, 2025

View reviewed changes

Adding max_steps to the greedy search arguments

f621a2b

younessdkhissi and others added 3 commits October 21, 2025 16:23

Changing variable name

5f7061c

fixing ruff format

8dead1c

Merge branch 'develop' into patch-1

363cef6

TParcollet merged commit 43c8ad1 into speechbrain:develop Nov 2, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapting transducer greedy decoding#2975

Adapting transducer greedy decoding#2975
TParcollet merged 10 commits intospeechbrain:developfrom
younessdkhissi:patch-1

younessdkhissi commented Sep 25, 2025 •

edited

Loading

Uh oh!

Adel-Moumen commented Oct 8, 2025

Uh oh!

younessdkhissi commented Oct 8, 2025

Uh oh!

TParcollet commented Oct 9, 2025

Uh oh!

younessdkhissi commented Oct 10, 2025

Uh oh!

TParcollet commented Oct 10, 2025

Uh oh!

younessdkhissi commented Oct 15, 2025 •

edited

Loading

Uh oh!

TParcollet commented Oct 20, 2025

Uh oh!

TParcollet left a comment •

edited

Loading

Uh oh!

TParcollet commented Oct 21, 2025

Uh oh!

younessdkhissi commented Oct 21, 2025

Uh oh!

Uh oh!

TParcollet commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

younessdkhissi commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Adel-Moumen commented Oct 8, 2025

Uh oh!

younessdkhissi commented Oct 8, 2025

Uh oh!

TParcollet commented Oct 9, 2025

Uh oh!

younessdkhissi commented Oct 10, 2025

Uh oh!

TParcollet commented Oct 10, 2025

Uh oh!

younessdkhissi commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TParcollet commented Oct 20, 2025

Uh oh!

TParcollet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TParcollet commented Oct 21, 2025

Uh oh!

younessdkhissi commented Oct 21, 2025

Uh oh!

Uh oh!

TParcollet commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

younessdkhissi commented Sep 25, 2025 •

edited

Loading

younessdkhissi commented Oct 15, 2025 •

edited

Loading

TParcollet left a comment •

edited

Loading