CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR

This repository contains the dataset, code, and models for our paper:

CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR, accepted at Interspeech 2025.

🗂️ Repository Structure

CHSER/
├── code/                      # Source code for dataset generation, model training, and evaluation
│   ├── analysis/              # Scripts for baseline wer generation 
│   ├── dataset_gen/           # Scripts for creating the CHSER dataset from raw hypotheses
│   └── gensec/                # Core modules for generative speech error correction (GenSEC) (T5 and Llama based)
├── dataset/                   # CHSER dataset splits
│   ├── dev/                   
│   ├── test/                  
│   └── train/                 
├── models/                    # Pretrained and fine-tuned model checkpoints
│   ├── 3gram/                 # n-gram baseline (for comparison or decoding)
│   ├── llama2/                # Adapter weights for Llama2 model fine-tuned on CHSER
│   ├── t5/                    # Adapter weights for T5 model fine-tuned on CHSER
│   ├── t5_myst/               # Adapter weights for T5 model fine-tuned on MyST data
│   └── transformer/           # Transformer LM baseline model (non-pretrained)

📊 Dataset

The CHSER dataset consists of child ASR hypotheses paired with human-verified reference transcripts. Hypotheses were generated using Whisper-base.en in a zero-shot beam search setting.

🧠 Models

We provide checkpoints of GenSEC models trained on adult speech (HyPoradise) and fine-tuned on CHSER. Models include:

Llama-based correction model
T5-based correction models

📜 Citation

If you found this work useful in your research, please cite:

@misc{shankar2025chser,
      title={CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR}, 
      author={Natarajan Balaji Shankar and Zilai Wang and Kaiyuan Zhang and Mohan Shi and Abeer Alwan},
      year={2025},
      eprint={2505.18463},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2505.18463}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
dataset		dataset
models		models
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR

🗂️ Repository Structure

📊 Dataset

🧠 Models

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR

🗂️ Repository Structure

📊 Dataset

🧠 Models

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages