π§π© Bangla NLP Dataset
A comprehensive collection of Bangla NLP datasets for researchers and developers
β οΈ IMPORTANT NOTICES β οΈ
π OUR DATASET IS IN LFS MODE! SO YOU HAVE TO CLONE IT FOR GETTING DATA!
π WE WILL SOON UPLOAD ALL DEEP LEARNING BASED DATASETS!
Bangla NLP dataset repository containing sbnltk datasets, which were used in Bangla nlp toolkit - sbnltk .
This repository also serves as a comprehensive collection of existing Bangla NLP datasets created by the amazing Bangla NLP research community.
π― sbnltk Dataset List (DUMP & HUMAN Evaluated) (sbnltk Dataset)
π€ Pre-trained Language Models
Model
Description
Parameters
Link
BanglaBERT
ELECTRA-based model, state-of-the-art Bangla NLU
110M
π€ HuggingFace
BanglishBERT
Bilingual (Bangla+English) BERT
110M
π€ HuggingFace
BanglaBERT (Small)
Lightweight version for resource-constrained environments
13M
π€ HuggingFace
BanglaBERT (Large)
Large variant with enhanced performance
335M
π€ HuggingFace
Bangla BERT Base
Another popular BERT implementation
110M
π€ HuggingFace
Bangla Electra
ELECTRA-based model for Bangla
13.5M
π€ HuggingFace
Generative Models (T5/GPT-style)
Model
Description
Performance
Link
Wav2Vec2-Bangla-300M
Self-supervised speech recognition
17.8% WER
π€ HuggingFace
Whisper-Bangla
OpenAI Whisper fine-tuned for Bangla
Various sizes
π€ HuggingFace
BanglaASR
Fine-tuned ASR model
14.73% WER
π GitHub
Multilingual Models with Strong Bangla Support
Model
Description
Languages
Link
MuRIL
Google's multilingual model with Bangla support
17 Indian
π€ HuggingFace
IndicBERT
BERT for Indian languages including Bangla
12 Indian
π€ HuggingFace
sahajBERT
ALBERT-based model for Bangla
18M
π€ HuggingFace
Latest Research (2024-2025)
π§ Knowledge Graphs and Semantic Analysis
BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering - π LREC-COLING 2024 | π» Code
First framework for automatic Bangla KG construction using multilingual LLMs
GNN-based semantic filtering for improved accuracy
Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis - π IEEE Access 2024
FAIR-compliant agricultural knowledge graph for sustainable farming
π£οΈ Speech and Multimodal Processing
BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization - π arXiv 2024
First end-to-end pipeline for Bangla dialect standardization
Achieved 0.8% CER and 1.5% WER for Noakhali dialect
Wav2Vec2-Bangla (300M) - π€ HuggingFace
Self-supervised speech model with 17.8% WER
Trained on OpenSLR Bangla dataset
π Large Language Models and Generation
BanglaByT5: Byte-Level Modelling for Bangla - π arXiv
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking - π arXiv
TigerLLM: A Family of Bangla Large Language Models - π arXiv
Bangla/Bengali Seed Dataset for WMT24 - π Paper
π Evaluation and Benchmarking
BLUB: A Comprehensive Evaluation Benchmark for Bangla Language Understanding - π Research
First comprehensive Bangla NLP benchmark with 15+ tasks
BanglaBook: Large-scale Bangla Dataset for Sentiment Analysis - π ACL 2023
158K+ book reviews for sentiment analysis
Cross-lingual Transfer Learning for Bangla: What Works and What Doesn't - π Findings of ACL 2024
ποΈ Language Models and Pretraining
π Task-Specific Research
π£οΈ Speech and Multimodal
π Cross-lingual and Multilingual Studies
π§ Modern NLP Tools and Libraries
Library
Description
Features
Link
BNLP
Bengali Natural Language Processing Toolkit
Tokenization, Embedding, POS, NER
π GitHub
BNLTK
Bangla Natural Language Processing Toolkit
Tokenization, Stemming, POS Tagging
π GitHub
sbnltk
Bangla NLP toolkit (this repository's toolkit)
Comprehensive NLP suite
π GitHub
bnunicode
Unicode normalization for Bangla text
Bijoy to Unicode, normalization
π GitHub
pyBanglaKit
Comprehensive Bangla text processing
Tokenization, spell checking, sentiment
π GitHub
Indic NLP Library
Multi-Indic language processing
Script conversion, transliteration
π GitHub
BanglaTextProcessor
Advanced text processing pipeline
Dependency parsing, coreference
π GitHub
Tool
Description
Features
Link
BanglaOCR
Comprehensive OCR system for Bangla
Print & handwriting recognition
π GitHub
EasyOCR-Bangla
Ready-to-use OCR solution
Simple Python API
π GitHub
TesseractBN
Tesseract with Bangla support
Command-line & API access
π GitHub
BanglaHWR
Handwriting recognition system
Real-time recognition
π GitHub
Tool
Description
Features
Link
BanglaVoice
Neural TTS system
Natural speech synthesis
π GitHub
FastSpeech-Bangla
Fast and robust TTS
Real-time synthesis
π GitHub
BanglaPhoneme
Phoneme analysis toolkit
IPA transcription support
π GitHub
# BNLP installation
pip install bnlp_toolkit
# BNLTK installation
pip install bnltk
π Benchmarking and Evaluation
Bangla Language Understanding Benchmark (BLUB)
Task
Dataset
Metric
Best Model
Score
Sentiment Classification
SentNoB
Macro-F1
BanglaBERT
72.89
Natural Language Inference
BNLI
Accuracy
BanglaBERT (Large)
83.41
Named Entity Recognition
MultiCoNER
Micro-F1
BanglaBERT (Large)
79.20
Question Answering
BQA/TyDiQA
EM/F1
BanglaBERT (Large)
76.10/81.50
Recent Datasets for Benchmarking
Dataset
Task
Size
Description
Link
BanglaBook
Sentiment Analysis
158,065 samples
Book reviews sentiment analysis
π GitHub
SentMix-3L
Code-Mixed Sentiment
1,007 samples
Bangla-English-Hindi code-mixed
π GitHub
Awesome Bangla Datasets
Various
Multiple
Comprehensive collection
π GitHub
π Note : I am not the owner of these following datasets. It's just a collection to find amazing peoples and their works.
π Please give them support! Your support will encourage them to do more amazing things.
π Awesome Dataset Sources
π° News Articles and Documents
Dataset
Description
Link
Wiki Articles
Wikipedia Articles in Bangla
π Kaggle
Bangladesh Protidin
News from Bangladesh Protidin
π Kaggle
40k News Articles
40k Bangla Newspaper Articles
π Kaggle
Largest News Dataset
Bangla Largest Newspaper Dataset
π Kaggle
Wikipedia Dumps
All types of Wikipedia Articles
π Wiki Dumps
bdNews24 Corpus
bdNews24 largest dataset
π Kaggle
π€ Speech to Text / Text to Speech
Dataset
Description
Size
Link
OpenSLR Bangla
Large-scale speech corpus
250+ hours, 2000+ speakers
π OpenSLR
Common Voice Bangla
Crowdsourced speech data
500+ hours (growing)
π Mozilla
FLEURS Bangla
Cross-lingual speech corpus
12 hours
π€ HuggingFace
BanglaASR Dataset
Fine-tuned ASR corpus
23.8 hours
π GitHub
Text to Speech
Bengali Text to Speech Dataset
Studio quality
π Bengali.ai
Speech Recognition
Bengali Automatic Speech Recognition Dataset
Various speakers
π Bengali.ai
Regional Dialect ASR
Dialect-specific speech recognition
100+ hours, 8 dialects
π GitHub
Multi-Speaker TTS
Multiple speaker TTS corpus
20 hours, 10 speakers
π GitHub
Expressive TTS Dataset
Emotional speech synthesis
15 hours, 8 emotions
π GitHub
Handwritten Digits
Numta Handwritten Bengali Digits
Visual recognition
π Bengali.ai
π Sentiment Analysis / Sentence Classification
Dataset
Description
Link
BanglaBook
Large-scale book reviews (158K samples)
π GitHub
SentMix-3L
Code-mixed sentiment (Bangla-English-Hindi)
π GitHub
Social Media Comments
Bangla Text Dataset from Social Media
π GitHub
Sentiment Analysis
Bengali Sentiment Text
π Kaggle
News Classification
Classification Bengali News Articles
π Kaggle
Drama Review
Bangla Drama Review Dataset
π Figshare
News Comments
Bengali News Comments Sentiment
π Kaggle
News Headlines
News Headline Classification
π Kaggle
Big News Classification
Bangla Newspaper Article Classification (Large)
π Kaggle
YouTube Sentiment
Bangla YouTube Sentiment/Emotion Dataset
π Kaggle
Multilingual Sentiment
Sentiment Lexicons for 81 Languages
π Kaggle
Twitter Dataset
Twitter Sentiment Analysis Dataset
π GitHub
EmoNoBa
Emotion analysis on noisy Bangla texts
π GitHub
SentiGOLD
Multi-domain sentiment analysis
π GitHub
Bangla Emotion Corpus
Comprehensive emotion detection
π GitHub
Social Media Sentiment
Social media specific sentiment
π GitHub
Bangla Fake News Detection
Misinformation detection dataset
π Kaggle
BanglaSarc
Sarcasm detection dataset
π GitHub
Complaint Classification
Customer complaint categorization
π GitHub
π Bangla Machine Translation Dataset
Dataset
Description
Link
2.5M Pairs
2.5M pair sentences - NOT low resource anymore
π GitHub
WMT24 Seed Dataset
High-quality manual translations
π Paper
TED Dataset
TED dataset (small)
π₯ Download
Bangla Dictionary
Bengali Dictionary
π GitHub
SUPERA Dataset
SUPARA08M Balanced English-Bangla Parallel Corpus
π IEEE DataPort
Samanantar
Large-scale parallel corpus
π AI4Bharat
OPUS Collections
Multiple parallel corpora
π OPUS
Indic-Indic Translation
Inter-Indic language translation
π GitHub
BanglaDialectTranslation
Regional dialect to standard Bangla
π GitHub
Vashantor
Multi-regional dialect corpus
π GitHub
Legal Translation Corpus
Legal document translation
π GitHub
Medical Translation Dataset
Healthcare translation
π GitHub
π·οΈ Bangla POS Tag Dataset
Dataset
Description
Link
3k Sentences
3k POS tag sentences
π GitHub
100k+ Words
Single word tagging 100k+
π Kaggle
π·οΈ Bangla NER Dataset
Dataset
Description
Link
70k Sentences
70k sentences with 5 types of NER
π GitHub
400k+ Words
Word-level NER 400k+
π Kaggle
B-NER
Comprehensive Bangla NER dataset
π GitHub
BanglaPersonNER
Person name extraction
π GitHub
Complex NER Dataset
Multi-type entity recognition
π GitHub
Medical NER Dataset
Healthcare entity recognition
π GitHub
Financial NER Corpus
Finance domain entities
π GitHub
Legal Entity Recognition
Legal document entity extraction
π GitHub
Bangladesh Geographic NER
Location entity recognition
π GitHub
β Question Answering Dataset
Dataset
Description
Link
Squad 2.0 Style
Question Answering Squad 2.0 in Bangla
π Kaggle
BanglaRQA
Reading comprehension dataset
π GitHub
SQuAD-BN
Bangla version of SQuAD
π GitHub
Contextual QA Dataset
Multi-context question answering
π GitHub
Medical QA Bangla
Healthcare question answering
π GitHub
Legal QA Dataset
Legal question answering
π GitHub
Educational QA Corpus
Academic question answering
π GitHub
Bangla Conversational QA
Multi-turn question answering
π GitHub
π Bangla Text Summarization
Dataset
Description
Link
Article Summarization
Articles Summarization (extractive & abstractive)
π Kaggle
BANSData
Dataset for Bengali Abstractive News Summarization
π Kaggle
3 Human Evaluated
Articles with 3 human evaluated summaries
π BNLPC
BenSum
Bangla news summarization
π GitHub
BanglaNewsSummarization
Extended news corpus
π GitHub
BUSUM
Multi-document summarization
π GitHub
Academic Paper Summarization
Research paper summarization
π GitHub
Book Chapter Summarization
Literature summarization
π GitHub
π΅οΈ Bangla Fake News Detection
Dataset
Description
Link
50k Fake News
50k Bangla fake news dataset
π Kaggle
ποΈ Handwriting Recognition / OCR
Dataset
Description
Link
Ekush
Bangla Handwritten Characters
π Website
Bayanno
Multi-purpose handwritten dataset
π Mendeley
BN-HTRd
Document Level Offline Bangla HTR (108k words)
π Mendeley
Bongabdo
Bangla handwritten script dataset
π Research Paper
BanglaOCR Dataset
Comprehensive OCR training data
π GitHub
BanglaHWR Dataset
Handwriting recognition corpus
π GitHub
Document Layout Analysis
Document understanding dataset
π GitHub
π Knowledge Graphs and Information Extraction
Dataset
Description
Link
BanglaAutoKG
Automatic knowledge graph construction
π GitHub
Bangladesh Agricultural KG
Agricultural data integration
π IEEE Access
Bangla Wikipedia Knowledge Graph
Structured Wikipedia knowledge
π GitHub
Bangla Event Extraction
News event extraction
π GitHub
Social Media Event Detection
Real-time event detection
π GitHub
Bangla Relation Extraction
Entity relationship extraction
π GitHub
Knowledge Base Relations
Structured knowledge extraction
π GitHub
Aspect-Based Opinion Mining
Detailed opinion analysis
π GitHub
Bangla Semantic Textual Similarity
Sentence similarity dataset
π GitHub
Concept Mapping Dataset
Conceptual relationship mapping
π GitHub
Bangla WordNet
Lexical semantic network
π GitHub
π Corpus and Language Modeling
Dataset
Description
Size
Link
BanglaLM
Large language modeling corpus
27.5 GB
π GitHub
Indic Corpus
Multi-lingual Indic corpus
6.5 GB Bangla
π AI4Bharat
CC-100 Bangla
CommonCrawl Bangla subset
8.3 GB
π StatMT
OSCAR Bangla
Web-crawled multilingual corpus
12 GB
π OSCAR
Bangla Poetry Corpus
Classical and modern poetry
25,000+ poems
π GitHub
Literary Text Collection
Bangla literature corpus
10,000+ books
π GitHub
Academic Text Corpus
Scholarly text collection
50,000+ papers
π GitHub
Bangla Morphological Analyzer
Morphological analysis dataset
100,000+ word-morpheme pairs
π GitHub
Phonetic Transcription Corpus
IPA transcription dataset
50,000+ word-pronunciation pairs
π GitHub
πΌοΈ Multimodal Datasets
Dataset
Description
Size
Link
Bangla Image Captioning
Image description generation
50,000+ image-caption pairs
π GitHub
Visual Question Answering Bangla
Visual reasoning dataset
25,000+ image-question-answer
π GitHub
Bangla Video Captioning
Video description dataset
5,000+ video-caption pairs
π GitHub
Sign Language Recognition
Bangla sign language dataset
10,000+ sign videos
π GitHub
Music-Text Alignment
Song lyrics alignment
2,000+ song-lyric pairs
π GitHub
Dataset
Description
Link
Numbers with Words
Bengali numbers with words
π Kaggle
Image to Text
Bangla Natural Language Image to Text (BnLiT)
π Kaggle
Coming soon...
π€ Usage and Contribute
Documentation for usage and contribution guidelines coming soon...
For Pre-trained Models : Visit HuggingFace model hub links above
For Tools : Install Python libraries like BNLP or BNLTK
For Datasets : Follow the individual dataset links and instructions
For Research : Check out the latest papers and benchmarks
π Submit new datasets through pull requests
π Report issues or broken links
π‘ Suggest improvements to the documentation
π¬ Share your research findings
β If you find this repository helpful, please give it a star! β
π€ Contributions are welcome! Feel free to submit issues and pull requests.
π¬ Questions? Open an issue or contact the maintainers.
π Special thanks to all the researchers and developers who contributed to Bangla NLP!
If this repository has been helpful to you, consider supporting the project:
Your support helps maintain and improve this resource for the Bangla NLP community! π