End-to-end Hindi Automatic Speech Recognition pipeline: dataset engineering, Whisper-small fine-tuning, disfluency detection, and lattice-based evaluation.
This project builds a complete Hindi ASR system from scratch — from raw conversational recordings to a deployed REST API. It covers four research and engineering stages:
- Dataset Engineering — Constructing a high-quality utterance-level corpus from 104 long-form conversational recordings (~10 hrs)
- Model Fine-Tuning — Adapting Whisper-small to conversational Hindi using HuggingFace Seq2SeqTrainer
- Disfluency Detection — Rule-based pipeline for detecting fillers, repetitions, and prolongations in Hindi speech
- Lattice Evaluation — Novel multi-system consensus framework for fairer WER computation
Result: WER reduced from 48.26% → 31.51% on conversational Hindi.
Raw Audio (.wav) ──► Preprocessing Pipeline ──► Fine-tuned Whisper-small ──► Hindi Transcript
│
├── URL reconstruction
├── JSON-aligned segmentation (pydub)
├── 16kHz mono standardization (librosa)
└── Log-mel spectrogram extraction
FastAPI App
│
├── POST /transcribe ──► chunked audio ──► model.generate() ──► transcript + confidence
├── POST /transcribe/stream ──► long audio, per-chunk results with timestamps
├── POST /transcribe/batch ──► multiple files ──► list of transcripts
├── POST /detect-language ──► Whisper language detection ──► language code + warning
├── GET /metrics ──► Prometheus metrics
├── GET /health
└── GET /docs ──► interactive Swagger UI
| Model | WER (%) |
|---|---|
| Whisper-small (pretrained baseline) | 48.26 |
| Fine-tuned — Epoch 1 | 39.24 |
| Fine-tuned — Epoch 2 | 33.01 |
| Fine-tuned — Epoch 3 (best) | 31.51 |
| Fine-tuned — Epoch 4 | 32.48 |
The fine-tuned model also outperformed the pretrained baseline on an external clean read-speech Hindi benchmark, confirming cross-domain generalization.
Try it instantly in your browser — no setup needed:
🔗 huggingface.co/spaces/joshuavinith/hindi-asr-demo
Upload or record Hindi audio and get a transcript in real time.
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
import torch
processor = WhisperProcessor.from_pretrained("joshuavinith/whisper-small-hindi")
model = WhisperForConditionalGeneration.from_pretrained("joshuavinith/whisper-small-hindi")
model.eval()
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(inputs["input_features"], language="hi", task="transcribe")
print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])# Clone and build
git clone https://github.com/joshuvavinith/hindi-asr-whisper
cd hindi-asr-whisper
docker build -t hindi-asr-api .
# Run
docker run -p 8000:8000 hindi-asr-apiAPI is now live at http://localhost:8000
Interactive docs: http://localhost:8000/docs
Transcribe via curl:
curl -X POST "http://localhost:8000/transcribe" \
-H "accept: application/json" \
-F "file=@your_hindi_audio.wav"Response:
{
"transcript": "आपका स्वागत है",
"model": "joshuavinith/whisper-small-hindi"
}pip install fastapi uvicorn transformers torch librosa python-multipart huggingface_hub
uvicorn main:app --host 0.0.0.0 --port 8000hindi-asr-whisper/
├── main.py # FastAPI app
├── Dockerfile # Docker container definition
├── requirements.txt # Python dependencies
├── notebooks/
│ ├── final_josh_preprocessing.ipynb # Data pipeline
│ ├── Finetune_model_1.ipynb # Fine-tuning run 1
│ └── Finetune_model_2_new.ipynb # Fine-tuning run 2 (best)
└── README.md
| Parameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Dataset | ~10 hrs conversational Hindi (5,732 utterances) |
| Epochs | 4 (best checkpoint: epoch 3) |
| Learning rate | 1e-5 |
| Optimizer | AdamW + warmup (500 steps) |
| Mixed precision | FP16 |
| Framework | HuggingFace Seq2SeqTrainer |
- Custom dataset pipeline — Programmatic reconstruction of 104 conversational Hindi recordings with JSON-aligned segmentation
- Disfluency detection — Rule-based Hindi disfluency pipeline (fillers, repetitions, prolongations) producing a labeled clip-level dataset
- Spelling analysis — 65.5% of 7,457 unique transcript tokens classified as orthographically noisy using Devanagari script validation
- Lattice consensus evaluation — Novel framework aligning 6 ASR hypotheses via Levenshtein dynamic programming for fairer WER computation
📄 Full technical report (18 pages) — ArXiv preprint in preparation.
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
API status + active device (CPU/GPU) |
| GET | /health |
Health check |
| POST | /transcribe |
Transcribe a Hindi audio file (with confidence score) |
| POST | /transcribe/stream |
Transcribe long audio — returns per-chunk results with timestamps |
| POST | /transcribe/batch |
Transcribe multiple audio files in one request |
| POST | /detect-language |
Detect audio language; warns if not Hindi |
| GET | /metrics |
Prometheus metrics (request count, latency, …) |
| GET | /docs |
Interactive Swagger UI |
- Python 3.9+
- Docker (for containerized deployment)
- CPU inference supported (no GPU required); GPU auto-detected at runtime
pip install -r requirements.txt
python demo.pyOpen http://localhost:7860 in your browser, upload a .wav or .mp3 file, and see the transcript.
pip install pytest httpx pytest-asyncio
pytest tests/docker build -t hindi-asr-api .
docker run -p 8000:8000 hindi-asr-apidocker build \
--build-arg PYTORCH_INDEX_URL=https://download.pytorch.org/whl/cu121 \
-t hindi-asr-api-gpu .
# Run with GPU access
docker run --gpus all -p 8000:8000 hindi-asr-api-gpuThe application automatically detects the GPU at startup via torch.cuda.is_available().
- Optimized for conversational Hindi; may underperform on formal/domain-specific speech
- Whisper-small (244M params) — larger variants would yield lower WER
- No speaker diarization support
Joshuva Vinith
B.Tech — Artificial Intelligence & Data Science
📧 [email protected]
🔗 HuggingFace | LinkedIn
MIT License — see LICENSE for details.