Team Envisage | BUET CSE Fest 2026 | Kaggle Competition
| Competition | Task | Metric | Link |
|---|---|---|---|
| DL Sprint 4.0 — Bengali Long-Form Speech Recognition | ASR | Word Error Rate (WER) | Kaggle ↗ |
| DL Sprint 4.0 — Bengali Speaker Diarization | Diarization | Diarization Error Rate (DER) | Kaggle ↗ |
- Overview
- Competition Context
- Repository Structure
- Task 1: Bengali Long-Form Speech Recognition
- Task 2: Bengali Speaker Diarization
- Trained Model & Live Demo
- Conference Paper
- Tech Stack
- Getting Started
- Results
- Acknowledgements
- Citation
This repository contains Team Envisage's complete submission to the DL Sprint 4.0 competition, organized as part of BUET CSE Fest 2026. The competition consisted of two challenging Bengali speech processing tracks:
- Bengali Long-Form ASR — Transcribing long-duration Bengali audio (lectures, interviews, conversations) into accurate text.
- Bengali Speaker Diarization — Answering "who spoke when?" by producing time-stamped speaker segments from multi-speaker Bengali audio.
Bengali, despite being one of the most widely spoken languages globally, remains significantly underrepresented in long-form speech technology — making this both a technically demanding and socially impactful challenge.
- Published Model: Our fine-tuned Bengali diarization model is publicly available on Hugging Face →
AdilShamim8/Bangla_Diarizz - Live Demo: Try our Bengali Speaker Diarization system interactively → Space Demo
- IEEE Conference Paper: Full research paper included in the repository
| Detail | Info |
|---|---|
| Organizer | AI@BUET — BUET CSE Fest 2026 |
| Platform | Kaggle |
| Duration | Jan 29, 2026 – Feb 21, 2026 |
| Participants | 718 Entrants · 321 Participants · 107 Teams · 1,464 Submissions |
| Scoring | 0.70 × Online Score + 0.30 × Offline Score |
Competition Phases:
- Phase I (Online): Kaggle submission evaluated on Public/Private test sets
- Phase II (Final): Hidden test set + On-site presentation for top teams
BUET-CSE-Fest-2026/
│
├── 📂 Bengali Long-form Speech Recognition/
│ └── bengali-long-form-speech-recognition.ipynb # ASR inference notebook
│
├── 📂 Bengali Speaker Diarization/
│ ├── bangla-diarizz.ipynb # Diarization inference notebook
│ ├── bengali-diarization-training.ipynb # Segmentation model fine-tuning
│ ├── segmentation-3-0-finetuned-bangla-*.tar.gz # Fine-tuned segmentation weights
│ └── wespeaker-voxceleb-resnet34-lm-*.tar.gz # Speaker embedding model weights
│
├── 📂 BUET_Conference_paper/
│ └── Bangla Diarizz.pdf # IEEE-format research paper
│
├── BUET CSE FEST 2026.pptx # On-site presentation slides
├── DL_Sprint_4.0_Team_Envisage_Submission_Summary.pdf # Submission summary document
└── README.md
The fine-tuned diarization model is also hosted on Hugging Face for easy access:
AdilShamim8/Bangla_Diarizz
Given long-form Bengali audio recordings (40–87 min each), produce an accurate Bangla text transcript.
Audio Input: test_001.wav
↓
Model Output: আজ আমরা দীর্ঘ অডিও ট্রান্সক্রিপশন নিয়ে আলোচনা করব ...
graph LR
A[Raw Audio] --> B[Vocal Source Separation]
B --> C[Voice Activity Detection]
C --> D[Silence-Boundary Chunking]
D --> E[Whisper-Medium Bengali ASR]
E --> F[Unicode NFC Normalization]
F --> G[Final Transcript]
| Stage | Technique | Details |
|---|---|---|
| Preprocessing | Demucs (htdemucs) | Vocal source separation to isolate speech from background music/noise |
| Segmentation | Silero VAD | Voice Activity Detection for intelligent chunking at silence boundaries |
| ASR Backbone | Whisper-Medium (Bengali fine-tuned) | BengaliAI fine-tuned Whisper model via WhisperForConditionalGeneration |
| Decoding | Autoregressive | Max 256 generated tokens per window with tuned generation hyperparameters |
| Post-processing | Unicode NFC normalization | Removal of zero-width characters, whitespace cleanup, formatting standardization |
Weighted Mean Word Error Rate (WER):
- WER computed per instance
- Weighted by word count per sentence
- All text comparisons use Unicode NFC normalization for consistency
Given multi-speaker Bengali audio, predict time-stamped segments identifying who spoke when.
Audio Input: conversation.wav
↓
Model Output:
[0.0 - 5.2 ] → SPEAKER_1
[5.3 - 12.8] → SPEAKER_2
[13.0 - 18.5] → SPEAKER_1
[18.6 - 25.0] → SPEAKER_3
graph LR
A[Raw Audio] --> B[Fine-tuned Segmentation Model]
B --> C[Speaker Embedding Extraction]
C --> D[Agglomerative Clustering]
D --> E[Overlap Mitigation & Segment Merging]
E --> F[RTTM Output]
| Component | Model / Technique | Details |
|---|---|---|
| Segmentation | pyannote/segmentation-3.0 (fine-tuned) |
Fine-tuned on the official competition dataset to capture Bengali conversational patterns |
| Speaker Embeddings | wespeaker-voxceleb-resnet34-LM |
ResNet34-based speaker embedding model (6.6M params) trained on VoxCeleb |
| Clustering | Centroid-based Agglomerative Clustering | Groups speaker segments by embedding similarity |
| Post-processing | Overlap mitigation + Segment merging | Heuristic refinement of segment boundaries |
The final fine-tuned model is published at:
AdilShamim8/Bangla_Diarizz
Diarization Error Rate (DER):
| Error Type | Description |
|---|---|
| False Alarm (FA) | Predicted speech where there was silence |
| Missed Speech (MISS) | Failed to detect actual speech |
| Speaker Confusion (CONF) | Correct speech detection, wrong speaker assigned |
Score =
100 × (1 − DER), clipped to [0, 100]. Lower DER → Higher score.
Speaker IDs are automatically mapped via optimal assignment — label naming does not need to match ground truth.
Models were also scored on Real-Time Factor (RTF):
- RTF = processing time / audio duration
- Participants ranked by average RTF (lower = better)
- Percentile-based scoring: 1st percentile → 100 pts
Our fine-tuned Bengali speaker diarization model is publicly available on the Hugging Face Hub. It is built on top of pyannote/segmentation-3.0, fine-tuned on the official DL Sprint 4.0 Bengali diarization competition dataset, and paired with wespeaker-voxceleb-resnet34-LM embeddings for speaker clustering.
| 🔗 Model Hub | huggingface.co/AdilShamim8/Bangla_Diarizz |
| Base Model | pyannote/segmentation-3.0 |
| Embeddings | wespeaker-voxceleb-resnet34-LM |
| Framework | PyTorch + pyannote.audio 3.x |
| Task | Speaker Diarization (Bengali) |
| Input | Mono audio, 16kHz |
| Output | RTTM-format speaker segments |
Quick Usage:
from pyannote.audio import Pipeline
# Load the fine-tuned Bengali diarization pipeline
pipeline = Pipeline.from_pretrained(
"AdilShamim8/Bangla_Diarizz",
use_auth_token="YOUR_HF_TOKEN"
)
# Run diarization on a Bengali audio file
diarization = pipeline("bangla_audio.wav")
# Print speaker segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{turn.start:7.1f}s - {turn.end:7.1f}s → {speaker}")
# Export to RTTM format
with open("output.rttm", "w") as rttm:
diarization.write_rttm(rttm)Try our model instantly — no setup required! Upload any Bengali audio file and get speaker-wise time segments in real time.
| Space URL | huggingface.co/spaces/AdilShamim8/Bengali_Speaker_Diarization |
| Interface | Gradio Web UI |
| Input | Upload audio file (WAV, MP3, etc.) |
| Output | Time-stamped speaker segments with speaker IDs |
| Backend | AdilShamim8/Bangla_Diarizz model pipeline |
Features:
- Upload any Bengali audio (interviews, conversations, lectures, meetings)
- Real-time inference with speaker-wise timestamps
- Automatic speaker labeling & segment visualization
- Downloadable RTTM output for downstream tasks
- No local installation needed — runs entirely in-browser
An IEEE-format research paper is included in the BUET_Conference_paper/ directory, detailing our methodology, experimental design, and findings for the Bengali Speaker Diarization challenge. This was a required deliverable for the offline evaluation component.
Offline Score =
0.20 × Presentation + 0.20 × Paper + 0.60 × Novelty
| Category | Technologies |
|---|---|
| Deep Learning | PyTorch, Transformers (HuggingFace), pyannote.audio |
| ASR | OpenAI Whisper (Bengali fine-tuned), Silero VAD |
| Diarization | pyannote/segmentation-3.0, wespeaker-voxceleb-resnet34-LM |
| Audio Processing | Demucs, librosa, soundfile |
| Deployment | 🤗 Hugging Face Hub (Model) + 🤗 Spaces (Gradio Demo) |
| Compute | Kaggle Notebooks (GPU P100 / T4 × 2) |
| Language | Python 3.10+ |
pip install torch torchaudio transformers pyannote.audio
pip install demucs silero-vad librosa soundfile
pip install onnxruntime pandas numpyThe fastest way to get started — load the model from the Hugging Face Hub:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"AdilShamim8/Bangla_Diarizz",
use_auth_token="YOUR_HF_TOKEN"
)
diarization = pipeline("your_bangla_audio.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{turn.start:.1f}s - {turn.end:.1f}s → {speaker}")Just visit the Space — no code, no installation:
👉 huggingface.co/spaces/AdilShamim8/Bengali_Speaker_Diarization
- Open
Bengali Long-form Speech Recognition/bengali-long-form-speech-recognition.ipynbon Kaggle - Attach the competition dataset
- Run all cells — outputs the submission CSV with Bengali transcripts
- Open
Bengali Speaker Diarization/bangla-diarizz.ipynbon Kaggle - Attach the competition dataset and model weight files:
segmentation-3-0-finetuned-bangla-pytorch-default-v1.tar.gzwespeaker-voxceleb-resnet34-lm-pytorch-default-v1.tar.gz
- Run all cells — outputs RTTM-format speaker segments
Bengali Speaker Diarization/bengali-diarization-training.ipynb
This notebook walks through fine-tuning pyannote/segmentation-3.0 on the official Bengali competition dataset to better capture Bengali conversational patterns and speaker turn dynamics.
Detailed scores and leaderboard rankings can be found in
DL_Sprint_4.0_Team_Envisage_Submission_Summary.pdf.
| Task | Metric | Description |
|---|---|---|
| ASR | WER ↓ | Weighted Mean Word Error Rate (lower is better) |
| Diarization | DER ↓ | Diarization Error Rate (lower is better) |
| Diarization | RTF ↓ | Real-Time Factor for inference efficiency |
- AI@BUET — Competition organizers & BUET CSE Fest 2026 hosts
- pyannote.audio — State-of-the-art speaker diarization toolkit by Hervé Bredin
- WeSpeaker — Speaker embedding model toolkit
- OpenAI Whisper — Multilingual ASR foundation model
- BengaliAI — Bengali Whisper fine-tuning & community resources
- Hugging Face — Model hosting, Spaces deployment, and open-source ML infrastructure
If you find this work useful, please cite the competitions and our model:
@misc{shamim2026bangladiarizz,
author = {Adil Shamim},
title = {Bangla Diarizz: Fine-tuned Bengali Speaker Diarization Model},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/AdilShamim8/Bangla_Diarizz}
}
@misc{dlsprint4-asr-2026,
author = {Abdullah Muhammed Amimul Ehsan and Istiak Ahmmed Rifti and
HM Shadman Tabib and Anik Saha and Masnoon Muztahid and Shahriar Kabir},
title = {DL Sprint 4.0 | Bengali Long-form Speech Recognition},
year = {2026},
publisher = {Kaggle},
url = {https://kaggle.com/competitions/dl-sprint-4-0-bengali-long-form-speech-recognition}
}
@misc{dlsprint4-diarization-2026,
author = {Abdullah Muhammed Amimul Ehsan and Istiak Ahmmed Rifti and
HM Shadman Tabib and Anik Saha and Masnoon Muztahid and Shahriar Kabir},
title = {DL Sprint 4.0 | Bengali Speaker Diarization},
year = {2026},
publisher = {Kaggle},
url = {https://kaggle.com/competitions/dl-sprint-4-0-bengali-speaker-diarization-challenge}
}Built by Team Envisage for BUET CSE Fest 2026