In this repo, you can find a Robust Dysarthric Speech Quality Assessment (DSQA) model, which is trained with data augmentation (DA) methods. The model is trained with the SAP dataset. However, a large portion of the SAP dataset (more than 90%) is unlabeled. Our motivation is to utilize this unlabeled part of the SAP dataset and enhance the robustness of the DSQA model. Furthermore, we propose a way to use the large-scale typical speech dataset (LibriSpeech).
For further details, please refer to our paper:
Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech [Paper link].
HuggingFace: jaesungbae/da-dsqa
conda create -n da-dsqa python=3.10 -y
conda activate da-dsqaconda install pytorch torchaudio -c pytorch -yFor a GPU build with a specific CUDA version, see pytorch.org for the appropriate command.
pip install -r requirements.txtNote: Silero VAD is loaded automatically at runtime via
torch.hub— no separate installation needed.
The repository includes three-stage checkpoints under checkpoints/:
stage1/— Baseline checkpoint trained only with the SAP labeled datasetstage2/— Contrastive pre-training checkpoints (checkpoint_epoch_002.pt)stage3/— Trained probe checkpoints (model.safetensors)
| Checkpoint | Contrastive Loss | τ |
|---|---|---|
simclr_tau0.1 |
SimCLR | 0.1 |
rank-n-contrast_tau100.0 |
Rank-N-Contrast | 100.0 |
proposed_L_dis_tau1.0 |
Proposed ( |
1.0 |
proposed_L_cont_tau0.1 |
Proposed ( |
0.1 |
proposed_L_coarse_tau{0.1,1.0,10.0,50.0,100.0} |
Proposed ( |
0.1–100.0 |
Predict severity from a single WAV file. Handles VAD preprocessing and Whisper feature extraction internally. You may want to modify this inference code for batch-wise inference:
python inference.py \
--wav /path/to/audio.wav \
--checkpoint ./checkpoints/stage3/proposed_L_coarse_tau10.0/averageYou can also run inference directly from our HuggingFace model:
from huggingface_hub import snapshot_download
model_dir = snapshot_download("jaesungbae/da-dsqa")
from pipeline import PreTrainedPipeline
pipe = PreTrainedPipeline(model_dir)
# Single file
result = pipe("/path/to/audio.wav")
print(result)
# {"severity_score": 4.25, "raw_score": 4.2483, "model_name": "proposed_L_coarse_tau100.0"}
# Batch inference
results = pipe.batch_inference([
"/path/to/audio1.wav",
"/path/to/audio2.wav",
"/path/to/audio3.wav",
])
for r in results:
print(f"{r['file']}: {r['severity_score']}")
# Switch checkpoint
pipe.switch_model("simclr_tau0.1")
result = pipe("/path/to/audio.wav")Each dataset requires a JSON metadata file per split (train.json, dev.json, test.json). The JSON is a dictionary keyed by filename:
Training dataset (SAP / pseudo-labeled) — used by probe-whisper.py and pretrain_contrastive.py:
{
"speaker1/utterance_001.wav": {
"ratings": {
"Intelligibility": 3.5,
"Naturalness": 4.0,
"Average": 3.75
}
},
"speaker2/utterance_002.wav": {
"ratings": {
"Intelligibility": 6.0,
"Naturalness": 5.5,
"Average": 5.75
}
}
}Required keys:
ratings.Intelligibility— float, 1.0 (most severe) to 7.0 (typical)ratings.Naturalness— float, 1.0 to 7.0ratings.Average— float, mean of Intelligibility and Naturalness (used as default target)
Typical speech dataset (LibriSpeech) — used by pretrain_contrastive.py for contrastive pre-training. All ratings are set to 1.0 (most typical):
{
"1234-5678-0001.wav": {
"ratings": {
"Intelligibility": 1.0,
"Naturalness": 1.0,
"Average": 1.0
}
}
}Cross-domain datasets — each uses a dataset-specific native label field:
| Dataset | Label field | Type |
|---|---|---|
| DysArinVox | mos |
float (MOS score) |
| EasyCall | tom_score |
int (TOM 1–5, raw) |
| UASpeech | intelligibility_pct |
float (0–100%) |
| EWA-DB | moca |
float (MoCA score) |
| NeuroVoz | hy_stadium |
float (Hoehn-Yahr stage, HC fallback: 0.0) |
Important
From here, we exemplify the training process with open datasets, EasyCall and LibriSpeech dev-other. You can replicate this process with the SAP dataset and the full LibriSpeech train datasets.
Download the dataset:
# EasyCall (In our paper, we use SAP dataset)
wget http://neurolab.unife.it/easycallcorpus/EasyCall.zip
unzip EasyCall.zip
# LibriSpeech dev-other (In our paper, we use train-clean-100, train-clean-360, and train-other-500)
wget https://openslr.trmal.net/resources/12/dev-other.tar.gz
tar -xzf dev-other.tar.gzGenerate EasyCall metadata:
python data_prepare/create_metadata_easycall.py \
--easycall_dir ./EasyCall \
--output_dir_labeled ./dataset_easycall_labeled \
--output_dir_unlabeled ./dataset_easycall_unlabeledThis produces two output directories:
dataset_easycall_labeled/— speakers with TOM scores + healthy controls, split into{train,dev,test}.jsonby speaker (no overlap). To map the TOM score to be similar as SAP datasets' score range, we remapped TOM score toratings.Average: 1.0 = typical (HC), 2.0–7.0 = dysarthric.dataset_easycall_unlabeled/— speakers without TOM scores (e.g., F04), saved astrain.json.
Generate LibriSpeech metadata:
python data_prepare/create_metadata_librispeech.py \
--librispeech_dir ./LibriSpeech/dev-other \
--output_dir ./dataset_librispeechThis creates dataset_librispeech/dev-other.json (named after the input directory) with all ratings set to 1.0 (typical speech).
Extract Whisper encoder features with VAD preprocessing using extract_features_with_vad.py. This applies Silero VAD to strip silence, then saves the last-layer hidden states as .npy files (float16).
WARNING: Some files may not generate features correctly. You may want to filter them out from the metadata files.
The extracted .npy files mirror the same directory structure as the source wav files within each split directory.
# EasyCall (labeled)
python extract_features_with_vad.py \
--model_name whisper-large-v3 \
--wav_dir ./EasyCall \
--data ./dataset_easycall_labeled/train.json \
--dump_dir ./features/easycall/whisper_large_v3_vad/train \
--vad_threshold 0.2 \
--min_speech_duration_ms 100 \
--min_silence_duration_ms 100 \
--speech_pad_ms 30 \
--max_duration 30.0
# Repeat for dev/test splits:
# --data ./dataset_easycall_labeled/dev.json --dump_dir .../dev
# --data ./dataset_easycall_labeled/test.json --dump_dir .../test
# EasyCall (unlabeled)
python extract_features_with_vad.py \
--model_name whisper-large-v3 \
--wav_dir ./EasyCall \
--data ./dataset_easycall_unlabeled/train.json \
--dump_dir ./features/easycall/whisper_large_v3_vad/train \
--vad_threshold 0.2 \
--min_speech_duration_ms 100 \
--min_silence_duration_ms 100 \
--speech_pad_ms 30 \
--max_duration 30.0
# LibriSpeech dev-other
python extract_features_with_vad.py \
--model_name whisper-large-v3 \
--wav_dir ./LibriSpeech/dev-other \
--data ./dataset_librispeech/dev-other.json \
--dump_dir ./features/librispeech/whisper_large_v3_vad/dev-other \
--vad_threshold 0.2 \
--min_speech_duration_ms 100 \
--min_silence_duration_ms 100 \
--speech_pad_ms 30 \
--max_duration 30.0Some files may fail during feature extraction. Use validate_features.py to remove entries with missing .npy files from the metadata:
# EasyCall (labeled)
python data_prepare/validate_features.py \
--feature_dir ./features/easycall/whisper_large_v3_vad \
--data_dir ./dataset_easycall_labeled \
--splits train dev test
# EasyCall (unlabeled)
python data_prepare/validate_features.py \
--feature_dir ./features/easycall/whisper_large_v3_vad \
--data_dir ./dataset_easycall_unlabeled \
--splits train
# LibriSpeech
python data_prepare/validate_features.py \
--feature_dir ./features/librispeech/whisper_large_v3_vad \
--data_dir ./dataset_librispeech \
--splits dev-otherTrain a baseline probe on the labeled EasyCall data (no contrastive pre-training):
python probe-whisper.py \
--feature_dir ./features/easycall_labeled/whisper_large_v3_vad \
--data_dir ./dataset_easycall_labeled \
--out_dir ./experiments/stage1 \
--exp_name baseline \
--target_type average \
--proj_dim 320 \
--dropout 0.1 \
--micro_batch_size 16 \
--accum_steps 2 \
--lr 1e-4 \
--epochs 10 \
--save_strategy epoch \
--seed 42This trains a WhisperFeatureProbeV2 model to predict ratings.Average from pre-extracted Whisper features. The best checkpoint is saved to ./experiments/stage1/baseline/average/.
Use the trained baseline model to generate pseudo-labels for the unlabeled speakers (e.g., F04):
python probe-whisper-pseudo-rating.py \
--checkpoint ./experiments/stage1/baseline/average \
--target_type average \
--feature_dir ./features/easycall_unlabeled/whisper_large_v3_vad \
--data_dir ./dataset_easycall_unlabeled \
--split train \
--output_dir ./dataset_easycall_pseudoMerge labeled and pseudo-labeled datasets for use in contrastive pre-training:
python data_prepare/merge_metadata.py \
--inputs ./dataset_easycall_labeled/train.json ./dataset_easycall_pseudo/train.json \
--output ./dataset_easycall_total/train.json
cp ./dataset_easycall_labeled/dev.json ./dataset_easycall_total/dev.json
cp ./dataset_easycall_labeled/test.json ./dataset_easycall_total/test.jsonPre-train the feature projector with contrastive losses. Config files for each method are in configs/stage2/:
| Config | Loss | τ |
|---|---|---|
simclr_tau0.1.json |
SimCLR | 0.1 |
rank-n-contrast_tau100.0.json |
Rank-N-Contrast | 100.0 |
proposed_L_dis_tau1.0.json |
Proposed L_dis | 1.0 |
proposed_L_cont_tau0.1.json |
Proposed L_cont | 0.1 |
proposed_L_coarse_tau{0.1,1.0,10.0,50.0,100.0}.json |
Proposed L_coarse | 0.1–100.0 |
python pretrain_contrastive.py \
--config ./configs/stage2/proposed_L_coarse_tau10.0.json \
--atypical_feature_dir ./features/easycall/whisper_large_v3_vad \
--atypical_data_dir ./dataset_easycall_total \
--typical_feature_dir ./features/librispeech/whisper_large_v3_vad \
--typical_data_dir ./dataset_librispeech \
--typical_splits dev-other \
--output_dir ./experiments/stage2CLI arguments override config values. The best checkpoint is selected by training loss and saved as pretrained_prenet.pt under the experiment output directory. No dev set is used for model selection; the --eval_split and --eval_typical_splits options control optional t-SNE visualization of the learned representations during training.
Fine-tune the probe with the pre-trained projector from Stage 2:
python probe-whisper.py \
--feature_dir ./features/easycall/whisper_large_v3_vad \
--data_dir ./dataset_easycall_labeled \
--out_dir ./experiments/stage3 \
--exp_name proposed_L_coarse_tau10.0 \
--pretrained_prenet ./experiments/stage2/proposed_L_coarse_tau10.0/checkpoint_epoch_002.pt \
--target_type average \
--proj_dim 320 \
--dropout 0.1 \
--micro_batch_size 16 \
--accum_steps 2 \
--lr 1e-4 \
--epochs 10 \
--save_strategy epoch \
--seed 42By default, the loaded projector weights are frozen. To fine-tune them jointly, add --finetune_prenet.
Evaluate the trained probe on dev/test splits:
python probe-whisper.py \
--test_only \
--checkpoint ./experiments/stage3/proposed_L_coarse_tau10.0/average \
--feature_dir ./features/easycall/whisper_large_v3_vad \
--data_dir ./dataset_easycall_labeled \
--target_type average \
--proj_dim 320- Jaesung Bae ([email protected])
- Xiuwen Zheng ([email protected])
If you use this code, please cite our paper with the following BibTeX. Thank you!
@misc{bae2026something,
title = {Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech},
author = {Jaesung Bae and Xiuwen Zheng and Minje Kim and Chang D. Yoo and Mark Hasegawa-Johnson},
year = {2026},
eprint = {2603.15988},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2603.15988}
}
