Skip to content

joshuvavinith/hindi-asr-whisper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ Hindi ASR — Fine-tuned Whisper

End-to-end Hindi Automatic Speech Recognition pipeline: dataset engineering, Whisper-small fine-tuning, disfluency detection, and lattice-based evaluation.

Build & Push Docker Python 3.11 Model on HuggingFace Live Demo FastAPI Docker License: MIT


Overview

This project builds a complete Hindi ASR system from scratch — from raw conversational recordings to a deployed REST API. It covers four research and engineering stages:

  1. Dataset Engineering — Constructing a high-quality utterance-level corpus from 104 long-form conversational recordings (~10 hrs)
  2. Model Fine-Tuning — Adapting Whisper-small to conversational Hindi using HuggingFace Seq2SeqTrainer
  3. Disfluency Detection — Rule-based pipeline for detecting fillers, repetitions, and prolongations in Hindi speech
  4. Lattice Evaluation — Novel multi-system consensus framework for fairer WER computation

Result: WER reduced from 48.26% → 31.51% on conversational Hindi.


Architecture

Raw Audio (.wav) ──► Preprocessing Pipeline ──► Fine-tuned Whisper-small ──► Hindi Transcript
                          │
                          ├── URL reconstruction
                          ├── JSON-aligned segmentation (pydub)
                          ├── 16kHz mono standardization (librosa)
                          └── Log-mel spectrogram extraction
FastAPI App
    │
    ├── POST /transcribe          ──► chunked audio ──► model.generate() ──► transcript + confidence
    ├── POST /transcribe/stream   ──► long audio, per-chunk results with timestamps
    ├── POST /transcribe/batch    ──► multiple files ──► list of transcripts
    ├── POST /detect-language     ──► Whisper language detection ──► language code + warning
    ├── GET  /metrics             ──► Prometheus metrics
    ├── GET  /health
    └── GET  /docs                ──► interactive Swagger UI

Results

Model WER (%)
Whisper-small (pretrained baseline) 48.26
Fine-tuned — Epoch 1 39.24
Fine-tuned — Epoch 2 33.01
Fine-tuned — Epoch 3 (best) 31.51
Fine-tuned — Epoch 4 32.48

The fine-tuned model also outperformed the pretrained baseline on an external clean read-speech Hindi benchmark, confirming cross-domain generalization.


Live Demo

Try it instantly in your browser — no setup needed:

🔗 huggingface.co/spaces/joshuavinith/hindi-asr-demo

Upload or record Hindi audio and get a transcript in real time.


Quickstart

Option 1 — Python (direct inference)

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
import torch

processor = WhisperProcessor.from_pretrained("joshuavinith/whisper-small-hindi")
model = WhisperForConditionalGeneration.from_pretrained("joshuavinith/whisper-small-hindi")
model.eval()

audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"], language="hi", task="transcribe")

print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0])

Option 2 — REST API (Docker)

# Clone and build
git clone https://github.com/joshuvavinith/hindi-asr-whisper
cd hindi-asr-whisper
docker build -t hindi-asr-api .

# Run
docker run -p 8000:8000 hindi-asr-api

API is now live at http://localhost:8000

Interactive docs: http://localhost:8000/docs

Transcribe via curl:

curl -X POST "http://localhost:8000/transcribe" \
  -H "accept: application/json" \
  -F "file=@your_hindi_audio.wav"

Response:

{
  "transcript": "आपका स्वागत है",
  "model": "joshuavinith/whisper-small-hindi"
}

Option 3 — Run API without Docker

pip install fastapi uvicorn transformers torch librosa python-multipart huggingface_hub
uvicorn main:app --host 0.0.0.0 --port 8000

Project Structure

hindi-asr-whisper/
├── main.py                          # FastAPI app
├── Dockerfile                       # Docker container definition
├── requirements.txt                 # Python dependencies
├── notebooks/
│   ├── final_josh_preprocessing.ipynb   # Data pipeline
│   ├── Finetune_model_1.ipynb           # Fine-tuning run 1
│   └── Finetune_model_2_new.ipynb       # Fine-tuning run 2 (best)
└── README.md

Training Configuration

Parameter Value
Base model openai/whisper-small
Dataset ~10 hrs conversational Hindi (5,732 utterances)
Epochs 4 (best checkpoint: epoch 3)
Learning rate 1e-5
Optimizer AdamW + warmup (500 steps)
Mixed precision FP16
Framework HuggingFace Seq2SeqTrainer

Research Contributions

  • Custom dataset pipeline — Programmatic reconstruction of 104 conversational Hindi recordings with JSON-aligned segmentation
  • Disfluency detection — Rule-based Hindi disfluency pipeline (fillers, repetitions, prolongations) producing a labeled clip-level dataset
  • Spelling analysis — 65.5% of 7,457 unique transcript tokens classified as orthographically noisy using Devanagari script validation
  • Lattice consensus evaluation — Novel framework aligning 6 ASR hypotheses via Levenshtein dynamic programming for fairer WER computation

📄 Full technical report (18 pages) — ArXiv preprint in preparation.


API Endpoints

Method Endpoint Description
GET / API status + active device (CPU/GPU)
GET /health Health check
POST /transcribe Transcribe a Hindi audio file (with confidence score)
POST /transcribe/stream Transcribe long audio — returns per-chunk results with timestamps
POST /transcribe/batch Transcribe multiple audio files in one request
POST /detect-language Detect audio language; warns if not Hindi
GET /metrics Prometheus metrics (request count, latency, …)
GET /docs Interactive Swagger UI

Requirements

  • Python 3.9+
  • Docker (for containerized deployment)
  • CPU inference supported (no GPU required); GPU auto-detected at runtime

Running the Gradio Demo

pip install -r requirements.txt
python demo.py

Open http://localhost:7860 in your browser, upload a .wav or .mp3 file, and see the transcript.


Running Tests

pip install pytest httpx pytest-asyncio
pytest tests/

Docker — CPU Build (default)

docker build -t hindi-asr-api .
docker run -p 8000:8000 hindi-asr-api

Docker — GPU Build (CUDA 12.1)

docker build \
  --build-arg PYTORCH_INDEX_URL=https://download.pytorch.org/whl/cu121 \
  -t hindi-asr-api-gpu .

# Run with GPU access
docker run --gpus all -p 8000:8000 hindi-asr-api-gpu

The application automatically detects the GPU at startup via torch.cuda.is_available().


Limitations

  • Optimized for conversational Hindi; may underperform on formal/domain-specific speech
  • Whisper-small (244M params) — larger variants would yield lower WER
  • No speaker diarization support

Author

Joshuva Vinith
B.Tech — Artificial Intelligence & Data Science

📧 [email protected]
🔗 HuggingFace | LinkedIn


License

MIT License — see LICENSE for details.

About

End-to-end Hindi ASR: fine-tuned Whisper-small (WER 48% → 31.5%), FastAPI REST API, Docker deployment, disfluency detection & lattice evaluation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors