ASR training from scratch #2967

nata-kostina · 2025-08-22T08:18:36Z

nata-kostina
Aug 22, 2025

I am trying to retrain an ASR model on LibriSpeech from scratch.

We are training on an A100 GPU, and even with a low batch size of 8, the model is consuming a very large amount of memory. Could you advise why this might be happening?

We are observing that the training seems to not actually progress — the train loss doesn’t change, and CER and WER remain very high. Any insights on what could be causing this?

Thank you.

Below I share ASR configuration:

# Paste your ASR configuration here
# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN
# Decoder: GRU + beamsearch + RNNLM
# Tokens: 1000 BPE
# losses: CTC+ NLL
# Training: mini-librispeech
# Pre-Training: librispeech 960h
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga, Samuele Cornell 2020
# # ############################################################################

# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 2602
__set_seed: !apply:speechbrain.utils.seed_everything [!ref <seed>]

# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set data_folder with the local path
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.

data_folder: /data/LibriSpeech # In this case, data will be automatically downloaded here.
data_folder_noise: !ref <data_folder>/noise # The noisy sequences for data augmentation will automatically be downloaded here.
data_folder_rir: !ref <data_folder>/rir # The impulse responses used for data augmentation will automatically be downloaded here.

# Data for augmentation
NOISE_DATASET_URL: https://www.dropbox.com/scl/fi/a09pj97s5ifan81dqhi4n/noises.zip?rlkey=j8b0n9kdjdr32o1f06t0cw5b7&dl=1
RIR_DATASET_URL: https://www.dropbox.com/scl/fi/linhy77c36mu10965a836/RIRs.zip?rlkey=pg9cu8vrpn2u173vhiqyu743u&dl=1

output_folder: !ref /output/LibriSpeech/asr_5
test_wer_file: !ref <output_folder>/wer_test.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# Language model (LM) pretraining
# NB: To avoid mismatch, the speech recognizer must be trained with the same
# tokenizer used for LM training. Here, we download everything from the
# speechbrain HuggingFace repository. However, a local path pointing to a
# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
pretrained_path: /output/LibriSpeech/pretrained


# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: !ref <data_folder>/train.json
valid_annotation: !ref <data_folder>/valid.json
test_annotation: !ref <data_folder>/test.json
noise_annotation: !ref <data_folder>/noise.csv
rir_annotation: !ref <data_folder>/rir.csv

skip_prep: False

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 0.001
ctc_weight: 0.5
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1

#Optimal Number of Workers for Data Reading
#The ideal value depends on your machine's hardware, such as the number of available CPUs.
num_workers: 4

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>
    num_workers: !ref <num_workers>

valid_dataloader_opts:
    batch_size: !ref <batch_size>
    num_workers: !ref <num_workers>

test_dataloader_opts:
    batch_size: !ref <batch_size>
    num_workers: !ref <num_workers>

# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# NOTE ON DATA AUGMENTATION
# This template demonstrates the use of all available data augmentation strategies
# to illustrate how they work and how you can combine them with the augmenter.
# In practical applications (e.g., refer to other recipes), it is usually advisable
# to select a subset of these strategies for better performance.

# Waveform Augmentation Functions
snr_low: 0  # Min SNR for noise augmentation
snr_high: 15  # Max SNR for noise augmentation
speed_changes: [85, 90, 95, 105, 110, 115]  # List of speed changes for time-stretching
drop_freq_low: 0  # Min frequency band dropout probability
drop_freq_high: 1  # Max frequency band dropout probability
drop_freq_count_low: 1  # Min number of frequency bands to drop
drop_freq_count_high: 3  # Max number of frequency bands to drop
drop_freq_width: 0.05  # Width of frequency bands to drop
drop_chunk_count_low: 1  # Min number of audio chunks to drop
drop_chunk_count_high: 3  # Max number of audio chunks to drop
drop_chunk_length_low: 1000  # Min length of audio chunks to drop
drop_chunk_length_high: 2000  # Max length of audio chunks to drop
clip_low: 0.1  # Min amplitude to clip
clip_high: 0.5  # Max amplitude to clip
amp_low: 0.05  # Min waveform amplitude
amp_high: 1.0  # Max waveform amplitude
babble_snr_low: 5  # Min SNR for babble (batch sum noise)
babble_snr_high: 15  # Max SNR for babble (batch sum noise)

# Feature Augmentation Functions
min_time_shift: 0  # Min random shift of spectrogram in time
max_time_shift: 15  # Max random shift of spectrogram in time
min_freq_shift: 0  # Min random shift of spectrogram in frequency
max_freq_shift: 5  # Max random shift of spectrogram in frequency
time_drop_length_low: 5  # Min length for temporal chunk to drop in spectrogram
time_drop_length_high: 15  # Max length for temporal chunk to drop in spectrogram
time_drop_count_low: 1  # Min number of chunks to drop in time in the spectrogram
time_drop_count_high: 3  # Max number of chunks to drop in time in the spectrogram
time_drop_replace: "zeros"  # Method of dropping chunks
freq_drop_length_low: 1  # Min length for chunks to drop in frequency in the spectrogram
freq_drop_length_high: 5  # Max length for chunks to drop in frequency in the spectrogram
freq_drop_count_low: 1  # Min number of chunks to drop in frequency in the spectrogram
freq_drop_count_high: 3  # Max number of chunks to drop in frequency in the spectrogram
freq_drop_replace: "zeros"  # Method of dropping chunks
time_warp_window: 20  # Length of time warping window
time_warp_mode: "bicubic"  # Time warping method
freq_warp_window: 4  # Length of frequency warping window
freq_warp_mode: "bicubic"  # Frequency warping method

# Enable Waveform Augmentation Flags (useful for hyperparameter tuning)
enable_codec_augment: False
enable_add_reverb: True
enable_add_noise: True
enable_speed_perturb: True
enable_drop_freq: True
enable_drop_chunk: True
enable_clipping: True
enable_rand_amp: True
enable_babble_noise: True
enable_drop_resolution: True

# Enable Feature Augmentations Flags (useful for hyperparameter tuning)
enable_time_shift: True
enable_freq_shift: True
enable_time_drop: True
enable_freq_drop: True
enable_time_warp: True
enable_freq_warp: True

# Waveform Augmenter (combining augmentations)
time_parallel_augment: False  # Apply augmentations in parallel if True, or sequentially if False
time_concat_original: True  # Concatenate original signals to the training batch if True
time_repeat_augment: 1  # Number of times to apply augmentation
time_shuffle_augmentations: True  # Shuffle order of augmentations if True, else use specified order
time_min_augmentations: 1  # Min number of augmentations to apply
time_max_augmentations: 10  # Max number of augmentations to apply
time_augment_prob: 1.0     # Probability to apply time augmentation

# Feature Augmenter (combining augmentations)
fea_parallel_augment: False  # Apply feature augmentations in parallel if True, or sequentially if False
fea_concat_original: True  # Concatenate original signals to the training batch if True
fea_repeat_augment: 1  # Number of times to apply feature augmentation
fea_shuffle_augmentations: True  # Shuffle order of feature augmentations if True, else use specified order
fea_min_augmentations: 1  # Min number of feature augmentations to apply
fea_max_augmentations: 6  # Max number of feature augmentations to app
fea_augment_prob: 1.0     # Probability to apply feature augmentation

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

# Feature normalization (mean and std)
normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global


# Download and prepare the dataset of noisy sequences for augmentation
prepare_noise_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URL
    URL: !ref <NOISE_DATASET_URL>
    dest_folder: !ref <data_folder_noise>
    ext: wav
    csv_file: !ref <noise_annotation>

# Download and prepare the dataset of room impulse responses for augmentation
prepare_rir_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URL
    URL: !ref <RIR_DATASET_URL>
    dest_folder: !ref <data_folder_rir>
    ext: wav
    csv_file: !ref <rir_annotation>


# ----- WAVEFORM AUGMENTATION ----- #

# Codec augmentation
codec_augment: !new:speechbrain.augment.codec.CodecAugment
    sample_rate: !ref <sample_rate>

# Add reverberation to input signal
add_reverb: !new:speechbrain.augment.time_domain.AddReverb
    csv_file: !ref <rir_annotation>
    reverb_sample_rate: !ref <sample_rate>
    clean_sample_rate: !ref <sample_rate>
    num_workers: !ref <num_workers>

# Add noise to input signal
add_noise: !new:speechbrain.augment.time_domain.AddNoise
    csv_file: !ref <noise_annotation>
    snr_low: !ref <snr_low>
    snr_high: !ref <snr_high>
    noise_sample_rate: !ref <sample_rate>
    clean_sample_rate: !ref <sample_rate>
    num_workers: !ref <num_workers>

# Speed perturbation
speed_perturb: !new:speechbrain.augment.time_domain.SpeedPerturb
    orig_freq: !ref <sample_rate>
    speeds: !ref <speed_changes>

# Frequency drop: randomly drops a number of frequency bands to zero.
drop_freq: !new:speechbrain.augment.time_domain.DropFreq
    drop_freq_low: !ref <drop_freq_low>
    drop_freq_high: !ref <drop_freq_high>
    drop_freq_count_low: !ref <drop_freq_count_low>
    drop_freq_count_high: !ref <drop_freq_count_high>
    drop_freq_width: !ref <drop_freq_width>

# Time drop: randomly drops a number of temporal chunks.
drop_chunk: !new:speechbrain.augment.time_domain.DropChunk
    drop_length_low: !ref <drop_chunk_length_low>
    drop_length_high: !ref <drop_chunk_length_high>
    drop_count_low: !ref <drop_chunk_count_low>
    drop_count_high: !ref <drop_chunk_count_high>

# Clipping
clipping: !new:speechbrain.augment.time_domain.DoClip
    clip_low: !ref <clip_low>
    clip_high: !ref <clip_high>

# Random Amplitude
rand_amp: !new:speechbrain.augment.time_domain.RandAmp
    amp_low: !ref <amp_low>
    amp_high: !ref <amp_high>

# Noise sequence derived by summing up all the signals in the batch
# It is similar to babble noise
sum_batch: !name:torch.sum
    dim: 0
    keepdim: True

babble_noise: !new:speechbrain.augment.time_domain.AddNoise
    snr_low: !ref <babble_snr_low>
    snr_high: !ref <babble_snr_high>
    noise_funct: !ref <sum_batch>

drop_resolution: !new:speechbrain.augment.time_domain.DropBitResolution
    target_dtype: 'random'


# Augmenter: Combines previously defined augmentations to perform data augmentation
wav_augment: !new:speechbrain.augment.augmenter.Augmenter
    parallel_augment: !ref <time_parallel_augment>
    concat_original: !ref <time_concat_original>
    repeat_augment: !ref <time_repeat_augment>
    shuffle_augmentations: !ref <time_shuffle_augmentations>
    min_augmentations: !ref <time_min_augmentations>
    max_augmentations: !ref <time_max_augmentations>
    augment_prob: !ref <time_augment_prob>
    augmentations: [
        !ref <codec_augment>,
        !ref <add_reverb>,
        !ref <add_noise>,
        !ref <babble_noise>,
        !ref <speed_perturb>,
        !ref <clipping>,
        !ref <drop_freq>,
        !ref <drop_chunk>,
        !ref <rand_amp>,
        !ref <drop_resolution>]
    enable_augmentations: [
        !ref <enable_codec_augment>,
        !ref <enable_add_reverb>,
        !ref <enable_add_noise>,
        !ref <enable_babble_noise>,
        !ref <enable_speed_perturb>,
        !ref <enable_clipping>,
        !ref <enable_drop_freq>,
        !ref <enable_drop_chunk>,
        !ref <enable_rand_amp>,
        !ref <enable_drop_resolution>]


# ----- FEATURE AUGMENTATION ----- #

# Time shift
time_shift: !new:speechbrain.augment.freq_domain.RandomShift
    min_shift: !ref <min_time_shift>
    max_shift: !ref <max_time_shift>
    dim: 1

# Frequency shift
freq_shift: !new:speechbrain.augment.freq_domain.RandomShift
    min_shift: !ref <min_freq_shift>
    max_shift: !ref <max_freq_shift>
    dim: 2

# Time Drop
time_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
    drop_length_low: !ref <time_drop_length_low>
    drop_length_high: !ref <time_drop_length_high>
    drop_count_low: !ref <time_drop_count_low>
    drop_count_high: !ref <time_drop_count_high>
    replace: !ref <time_drop_replace>
    dim: 1

# Frequency Drop
freq_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
    drop_length_low: !ref <freq_drop_length_low>
    drop_length_high: !ref <freq_drop_length_high>
    drop_count_low: !ref <freq_drop_count_low>
    drop_count_high: !ref <freq_drop_count_high>
    replace: !ref <freq_drop_replace>
    dim: 2

# Time warp
time_warp: !new:speechbrain.augment.freq_domain.Warping
    warp_window: !ref <time_warp_window>
    warp_mode: !ref <time_warp_mode>
    dim: 1

freq_warp: !new:speechbrain.augment.freq_domain.Warping
    warp_window: !ref <freq_warp_window>
    warp_mode: !ref <freq_warp_mode>
    dim: 2

fea_augment: !new:speechbrain.augment.augmenter.Augmenter
    parallel_augment: !ref <fea_parallel_augment>
    concat_original: !ref <fea_concat_original>
    repeat_augment: !ref <fea_repeat_augment>
    shuffle_augmentations: !ref <fea_shuffle_augmentations>
    min_augmentations: !ref <fea_min_augmentations>
    max_augmentations: !ref <fea_max_augmentations>
    augment_start_index: !ref <batch_size> # This leaves original inputs unchanged
    concat_end_index: !ref <batch_size> # This leaves original inputs unchanged
    augment_prob: !ref <fea_augment_prob>
    augmentations: [
        !ref <time_shift>,
        !ref <freq_shift>,
        !ref <time_drop>,
        !ref <freq_drop>,
        !ref <time_warp>,
        !ref <freq_warp>]
    enable_augmentations: [
        !ref <enable_time_shift>,
        !ref <enable_freq_shift>,
        !ref <enable_time_drop>,
        !ref <enable_freq_drop>,
        !ref <enable_time_warp>,
        !ref <enable_freq_warp>]

# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>
    use_rnnp: False

# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

# Attention-based RNN decoder.
decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

# Linear transformation on the top of the encoder.
ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

# Linear transformation on the top of the decoder.
seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
    blank_index: !ref <blank_index>


# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
    encoder: !ref <encoder>
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    ctc_lin: !ref <ctc_lin>
    seq_lin: !ref <seq_lin>
    normalize: !ref <normalize>
    lm_model: !ref <lm_model>

# Gathering all the submodels in a single model object.
model: !new:torch.nn.ModuleList
    - - !ref <encoder>
      - !ref <embedding>
      - !ref <decoder>
      - !ref <ctc_lin>
      - !ref <seq_lin>

# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

# Define scorers for beam search

# If ctc_scorer is set, the decoder uses CTC + attention beamsearch. This
# improves the performance, but slows down decoding.
ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

# If coverage_scorer is set, coverage penalty is applied based on accumulated
# attention weights during beamsearch.
coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

# If the lm_scorer is set, a language model
# is applied (with a weight specified in scorer).
rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

# Gathering all scorers in a scorer instance for beamsearch:
# - full_scorers are scorers which score on full vocab set, while partial_scorers
# are scorers which score on pruned tokens.
# - The number of pruned tokens is decided by scorer_beam_scale * beam_size.
# - For some scorers like ctc_scorer, ngramlm_scorer, putting them
# into full_scorers list would be too heavy. partial_scorers are more
# efficient because they score on pruned tokens at little cost of
# performance drop. For other scorers, please see the speechbrain.decoders.scorer.
test_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

valid_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    full_scorers: [!ref <coverage_scorer>]
    weights:
        coverage: !ref <coverage_penalty>

# Beamsearch is applied on the top of the decoder. For a description of
# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearcher.

# It makes sense to have a lighter search during validation. In this case,
# we don't use scorers during decoding.
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <valid_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <valid_scorer>

# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well,
# which are defined in scorer.
# Please, remove scorer if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <test_scorer>

# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
    lr: !ref <lr>
    rho: 0.95
    eps: 1.e-8

# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
    split_tokens: True

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        normalizer: !ref <normalize>
        counter: !ref <epoch_counter>

# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
        # model: !ref <model>
    paths:
        lm: !ref <pretrained_path>/lm.ckpt
        tokenizer: !ref <pretrained_path>/tokenizer.ckpt
        # model: !ref <pretrained_path>/asr.ckpt

TParcollet · 2025-08-22T08:44:05Z

TParcollet
Aug 22, 2025
Maintainer

Hi, could you open an issue instead?

This recipe is very heavy on the augmentation side, mostly used to showcase it rather than train a really good model. That's also maybe why it's using a CRDNN which is an old model. If you open an issue, we'll take a look.

Many thanks!

2 replies

TParcollet Aug 22, 2025
Maintainer

Also could you show where this recipe comes from?

nata-kostina Aug 22, 2025
Author

Hello,

The issue is created: #2968

We used the recipe /speechbrain/templates/speech_recognition/ASR.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASR training from scratch #2967

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ASR training from scratch #2967

Uh oh!

nata-kostina Aug 22, 2025

Replies: 1 comment · 2 replies

Uh oh!

TParcollet Aug 22, 2025 Maintainer

Uh oh!

TParcollet Aug 22, 2025 Maintainer

Uh oh!

nata-kostina Aug 22, 2025 Author

nata-kostina
Aug 22, 2025

Replies: 1 comment 2 replies

TParcollet
Aug 22, 2025
Maintainer

TParcollet Aug 22, 2025
Maintainer

nata-kostina Aug 22, 2025
Author