Target Speaker ASR with Whisper

This repository contains the official implementation of the following publications:

Target Speaker Whisper — IEEE Xplore
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition — ScienceDirect
SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper — arXiv:2601.19194

This repository provides handy tools to prepare data, train, publish, and evaluate your models with just a few commands, see on picture below.

🎯 Project Overview

DiCoW (Diarization-Conditioned Whisper) enhances Whisper for target-speaker ASR by conditioning the model on frame-level diarization probabilities.

These probabilities are converted into Silence–Target–Non-Target–Overlap (STNO) masks and injected into each encoder layer through Frame-level Diarization-Dependent Transformations (FDDT).

This approach enables Whisper to focus on the desired speaker without explicit speaker embeddings, making it robust to unseen speakers and diverse acoustic conditions.

SE-DiCoW (Self-Enrolled DiCoW) resolves ambiguities in overlapping speech regions by introducing a self-enrollment mechanism.

An enrollment segment—automatically selected where the diarizer predicts the target speaker as most active—is used as a reference through cross-attention conditioning at encoder layers to further bias the model toward the target speaker.

Note: For inference-only usage without training, see our dedicated inference repository with a streamlined browser interface.

Note 2: For older training configurations and models, please refer to the v1 branch.

📦 Checkpoints

Model	Description	Link
CTC Whisper large-v3-turbo	Pre-trained encoder model	Download
DiCoW Models	Fine-tuned diarization-conditioned Whisper models	Hugging Face Collection

⚙️ Setup and Installation

1. Clone the Repository

git clone https://github.com/BUTSpeechFIT/TS-ASR-Whisper
cd TS-ASR-Whisper

2. Create a Python Environment

Use conda or venv:

Conda

conda create -n ts_asr_whisper python=3.11
conda activate ts_asr_whisper

Virtual Environment

python -m venv ts_asr_whisper
source ts_asr_whisper/bin/activate

3. Install Dependencies

pip install -r requirements.txt

(Optional) To accelerate training and inference:

pip install flash-attn==2.7.2.post1

⚠️ flash-attn requires torch to be installed beforehand and is therefore not included in requirements.txt.

4. Configure Paths

Edit configs/local_paths.sh according to your environment. All variables are documented directly within the script.

5. Install Additional Tools

Ensure that ffmpeg and sox are available:

conda install -c conda-forge ffmpeg sox
# or
sudo apt install ffmpeg sox

6. Set Up Diarization (Optional)

If you intend to run the full pipeline including diarization, you must set up the DiariZen toolkit.

Clone the DiariZen repository alongside this project:

git clone https://github.com/BUTSpeechFIT/DiariZen.git

Follow the installation instructions provided in the DiariZen README.
Crucial Step: Ensure lhotse is installed in the DiariZen environment:

# Activate your DiariZen environment first
pip install lhotse

🎧 Data Preparation

Before training or decoding, datasets must be prepared. We provide a dedicated repository for this purpose: 👉 mt-asr-data-prep

Follow its instructions, then update MANIFEST_DIR in configs/local_paths.sh.

🚀 Usage

The codebase uses Hydra for configuration management. All configuration files are located in ./configs, with default parameters in configs/base.yaml.

Run Modes

Mode	Description
pre-train	Pre-train the Whisper encoder using CTC
fine-tune	Fine-tune Whisper with diarization conditioning for target-speaker ASR
decode	Decode using a pre-trained or fine-tuned model

Example Commands

Scripts are provided for SLURM-based systems. To run locally, simply omit the sbatch prefix.

# Pre-train Whisper encoder
sbatch ./scripts/training/submit_slurm.sh +pretrain=turbo

# Fine-tune DiCoW
sbatch ./scripts/training/submit_slurm.sh +train=dicow_v3

# Decode with a trained model
sbatch ./scripts/training/submit_slurm.sh +decode=dicow_v3_greedy

🧩 Configuration Details

Hydra configurations are modular and rely on config groups instead of direct YAML file paths. Each configuration file typically begins with:

# @package _global_

This ensures that its parameters override global defaults from configs/base.yaml.

Configurations can also inherit from others using the defaults field, for example:

# @package _global_
defaults:
  - /train/dicow_v3

This means the configuration inherits all parameters from /train/dicow_v3 and can override specific values. This design ensures consistency and reusability across different training and evaluation setups.

Bash Variables

Defined and described in configs/local_paths.sh.

YAML Config Parameters

All configuration options are described in src/utils/training_args.py.

🚢 Model Export

Trained models can be exported directly to the Hugging Face Hub using the provided export utility.

Before running the export, make sure you have:

Created a corresponding model card file named <HUB_MODEL_NAME>.md in export_sources/readmes/.
Optionally updated export_sources/generation_config.json if your model requires custom decoding parameters.

Once prepared, run the following command:

python ./utils/export_dicow.py \
  --model_path <MODEL_DIR> \
  --model_name <HUB_MODEL_NAME> \
  --org <HUB_ORG> \
  --base_whisper_model openai/whisper-large-v3-turbo

Where:

<MODEL_DIR> — path to the directory containing the trained model checkpoint.
<HUB_MODEL_NAME> — name of the target model repository on the Hugging Face Hub.
<HUB_ORG> — Hugging Face organization or user under which the model will be published.

The script packages the checkpoint, configuration, and model card, then uploads them to the specified Hub repository for easy sharing and reproducibility.

📊 Evaluation

For transparent and reproducible evaluation, we host a public benchmark leaderboard on Hugging Face: 👉 EMMA JSALT25 Benchmark

This step expects the evaluated model to be available on Hugging Face Hub. If you do not wish to export your model but still want to submit results, you can initialize it locally using the reinit_from option under the model.setup section in your YAML configuration. When using reinit_from, make sure to specify all model initialization arguments exactly as they were during training so the model is reconstructed correctly.

To generate a submission file, use the helper script:

./scripts/create_emma_submission.sh

This script collects all decoding hypotheses and saves them in a JSON file formatted for leaderboard submission. Once created, simply upload this file to the Hugging Face space linked above to appear on the leaderboard.

📜 License

Source codes in this repository are licensed under the Apache License 2.0.

📚 Citation

If you use our models or code, please cite the following works:

@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
}

@INPROCEEDINGS{10887683,
  author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
  booktitle={ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Target Speaker {ASR} with {Whisper}},
  year={2025},
  pages={1-5},
  doi={10.1109/ICASSP49660.2025.10887683}
}

@article{POLOK2026101841,
  title = {{DiCoW}: Diarization-conditioned {Whisper} for target speaker automatic speech recognition},
  journal = {Computer Speech & Language},
  volume = {95},
  pages = {101841},
  year = {2026},
  doi = {10.1016/j.csl.2025.101841},
  url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
  author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
  keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation}
}

🤝 Contributing

Contributions are welcome. If you’d like to improve the code, add new features, or extend the training pipeline, please open an issue or submit a pull request.

📬 Contact

For questions or collaboration, please contact:

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
configs		configs
export_sources		export_sources
scripts		scripts
src		src
utils		utils
.amlignore		.amlignore
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
figure.png		figure.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Target Speaker ASR with Whisper

🎯 Project Overview

📦 Checkpoints

⚙️ Setup and Installation

1. Clone the Repository

2. Create a Python Environment

3. Install Dependencies

4. Configure Paths

5. Install Additional Tools

6. Set Up Diarization (Optional)

🎧 Data Preparation

🚀 Usage

Run Modes

Example Commands

🧩 Configuration Details

Bash Variables

YAML Config Parameters

🚢 Model Export

📊 Evaluation

📜 License

📚 Citation

🤝 Contributing

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Target Speaker ASR with Whisper

🎯 Project Overview

📦 Checkpoints

⚙️ Setup and Installation

1. Clone the Repository

2. Create a Python Environment

3. Install Dependencies

4. Configure Paths

5. Install Additional Tools

6. Set Up Diarization (Optional)

🎧 Data Preparation

🚀 Usage

Run Modes

Example Commands

🧩 Configuration Details

Bash Variables

YAML Config Parameters

🚢 Model Export

📊 Evaluation

📜 License

📚 Citation

🤝 Contributing

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages