This repository contains the official implementation of the following publications:
- Target Speaker Whisper — IEEE Xplore
- DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition — ScienceDirect
- SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper — arXiv:2601.19194
This repository provides handy tools to prepare data, train, publish, and evaluate your models with just a few commands, see on picture below.
DiCoW (Diarization-Conditioned Whisper) enhances Whisper for target-speaker ASR by conditioning the model on frame-level diarization probabilities.
These probabilities are converted into Silence–Target–Non-Target–Overlap (STNO) masks and injected into each encoder layer through Frame-level Diarization-Dependent Transformations (FDDT).
This approach enables Whisper to focus on the desired speaker without explicit speaker embeddings, making it robust to unseen speakers and diverse acoustic conditions.
SE-DiCoW (Self-Enrolled DiCoW) resolves ambiguities in overlapping speech regions by introducing a self-enrollment mechanism.
An enrollment segment—automatically selected where the diarizer predicts the target speaker as most active—is used as a reference through cross-attention conditioning at encoder layers to further bias the model toward the target speaker.
Note: For inference-only usage without training, see our dedicated inference repository with a streamlined browser interface.
Note 2: For older training configurations and models, please refer to the v1 branch.
| Model | Description | Link |
|---|---|---|
| CTC Whisper large-v3-turbo | Pre-trained encoder model | Download |
| DiCoW Models | Fine-tuned diarization-conditioned Whisper models | Hugging Face Collection |
git clone https://github.com/BUTSpeechFIT/TS-ASR-Whisper
cd TS-ASR-WhisperUse conda or venv:
Conda
conda create -n ts_asr_whisper python=3.11
conda activate ts_asr_whisperVirtual Environment
python -m venv ts_asr_whisper
source ts_asr_whisper/bin/activatepip install -r requirements.txt(Optional) To accelerate training and inference:
pip install flash-attn==2.7.2.post1
⚠️ flash-attnrequirestorchto be installed beforehand and is therefore not included inrequirements.txt.
Edit configs/local_paths.sh according to your environment.
All variables are documented directly within the script.
Ensure that ffmpeg and sox are available:
conda install -c conda-forge ffmpeg sox
# or
sudo apt install ffmpeg soxIf you intend to run the full pipeline including diarization, you must set up the DiariZen toolkit.
- Clone the DiariZen repository alongside this project:
git clone https://github.com/BUTSpeechFIT/DiariZen.git- Follow the installation instructions provided in the DiariZen README.
- Crucial Step: Ensure
lhotseis installed in the DiariZen environment:
# Activate your DiariZen environment first
pip install lhotseBefore training or decoding, datasets must be prepared. We provide a dedicated repository for this purpose: 👉 mt-asr-data-prep
Follow its instructions, then update MANIFEST_DIR in configs/local_paths.sh.
The codebase uses Hydra for configuration management.
All configuration files are located in ./configs, with default parameters in configs/base.yaml.
| Mode | Description |
|---|---|
| pre-train | Pre-train the Whisper encoder using CTC |
| fine-tune | Fine-tune Whisper with diarization conditioning for target-speaker ASR |
| decode | Decode using a pre-trained or fine-tuned model |
Scripts are provided for SLURM-based systems.
To run locally, simply omit the sbatch prefix.
# Pre-train Whisper encoder
sbatch ./scripts/training/submit_slurm.sh +pretrain=turbo
# Fine-tune DiCoW
sbatch ./scripts/training/submit_slurm.sh +train=dicow_v3
# Decode with a trained model
sbatch ./scripts/training/submit_slurm.sh +decode=dicow_v3_greedyHydra configurations are modular and rely on config groups instead of direct YAML file paths. Each configuration file typically begins with:
# @package _global_This ensures that its parameters override global defaults from configs/base.yaml.
Configurations can also inherit from others using the defaults field, for example:
# @package _global_
defaults:
- /train/dicow_v3This means the configuration inherits all parameters from /train/dicow_v3 and can override specific values.
This design ensures consistency and reusability across different training and evaluation setups.
Defined and described in configs/local_paths.sh.
All configuration options are described in src/utils/training_args.py.
Trained models can be exported directly to the Hugging Face Hub using the provided export utility.
Before running the export, make sure you have:
- Created a corresponding model card file named
<HUB_MODEL_NAME>.mdinexport_sources/readmes/. - Optionally updated
export_sources/generation_config.jsonif your model requires custom decoding parameters.
Once prepared, run the following command:
python ./utils/export_dicow.py \
--model_path <MODEL_DIR> \
--model_name <HUB_MODEL_NAME> \
--org <HUB_ORG> \
--base_whisper_model openai/whisper-large-v3-turboWhere:
<MODEL_DIR>— path to the directory containing the trained model checkpoint.<HUB_MODEL_NAME>— name of the target model repository on the Hugging Face Hub.<HUB_ORG>— Hugging Face organization or user under which the model will be published.
The script packages the checkpoint, configuration, and model card, then uploads them to the specified Hub repository for easy sharing and reproducibility.
For transparent and reproducible evaluation, we host a public benchmark leaderboard on Hugging Face: 👉 EMMA JSALT25 Benchmark
This step expects the evaluated model to be available on Hugging Face Hub.
If you do not wish to export your model but still want to submit results, you can initialize it locally using the reinit_from option under the model.setup section in your YAML configuration.
When using reinit_from, make sure to specify all model initialization arguments exactly as they were during training so the model is reconstructed correctly.
To generate a submission file, use the helper script:
./scripts/create_emma_submission.shThis script collects all decoding hypotheses and saves them in a JSON file formatted for leaderboard submission. Once created, simply upload this file to the Hugging Face space linked above to appear on the leaderboard.
Source codes in this repository are licensed under the Apache License 2.0.
If you use our models or code, please cite the following works:
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Target Speaker {ASR} with {Whisper}},
year={2025},
pages={1-5},
doi={10.1109/ICASSP49660.2025.10887683}
}
@article{POLOK2026101841,
title = {{DiCoW}: Diarization-conditioned {Whisper} for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation}
}Contributions are welcome. If you’d like to improve the code, add new features, or extend the training pipeline, please open an issue or submit a pull request.
For questions or collaboration, please contact:
