BenNumEval: A Benchmark to Assess LLM’s Numerical Reasoning Capabilities in Bengali

BenNumEval is a novel benchmark designed to evaluate the numerical reasoning abilities of Large Language Models (LLMs) in the Bengali language. It introduces six diverse task categories and a high-quality dataset containing over 3,200 examples derived from educational and real-world sources.

🔍 Motivation

Despite recent advances in LLMs, their performance in numerical reasoning, especially in low-resource languages like Bengali, remains significantly limited. BenNumEval addresses this gap by:

Providing a structured evaluation framework for Bengali numerical reasoning.
Highlighting the limitations of LLMs in handling arithmetic, logical, and domain-specific reasoning in a linguistically rich yet underrepresented language.

📁 Dataset Overview

BenNumEval includes 3,255 curated examples divided into six task types:

Task Type	Description	Examples
Commonsense + Arithmetic (CA)	Problems combining arithmetic with common-sense knowledge	410
Domain-Specific (DS)	Problems requiring domain knowledge (e.g., physics, chemistry, CS)	705
Commonsense + Quantitative (CQ)	Simple comparisons based on everyday logic	400
Fill-in-the-Blanks (FiB)	Arithmetic word problems in fill-in-the-blank style	665
Quantitative NLI (QNLI)	Natural language inference involving numerical understanding	425
Arithmetic Word Problems (AWP)	Real-world word problems requiring arithmetic reasoning	650

🔧 Benchmarking Methodology

🧠 Prompting Techniques

We evaluate LLMs using three prompting strategies:

BNaP (Bengali Native Prompting): Fully Bengali prompts and responses.
XLP (Cross-Lingual Prompting): Bengali questions, English responses.
XCoT (Cross-Lingual Chain-of-Thought): Bengali inputs, step-by-step English solutions using CoT reasoning.

🧪 Evaluated Models

GPT-4o
Gemini-2.0-flash
Llama-3.3-70B
DeepSeek-R1-Distill-Llama-70B
Mathstral-7B

Accuracy is measured using Exact-Match (EM), and we also report human baseline performance for comparison.

📊 Key Results

Top models (e.g., GPT-4o, Gemini-2.0-flash) achieve ~70–80% accuracy under cross-lingual prompting.
Human performance remains significantly higher at ~98%, revealing persistent challenges in LLM numerical reasoning.
Domain-specific and QNLI tasks remain the hardest across all models and prompts.

🔍 Error Analysis

We include both:

Wrong Format (wf%): Output not in the required structure.
Wrong Calculation (wc%): Incorrect answer despite proper formatting.

Findings show that advanced prompting like XCoT improves reasoning but sometimes increases formatting issues.

📜 Citation

If you use BenNumEval in your work, please cite:

@inproceedings{ahmed2025bennumeval,
  title={BenNumEval: A Benchmark to Assess LLMs’ Numerical Reasoning Capabilities in Bengali},
  author={Ahmed, Kawsar and Osama, Md and Sharif, Omar and Hossain, Eftekhar and Hoque, Mohammed Moshiul},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
  pages={17782--17799},
  year={2025}
}

⚖️ License

BenNumEval is released under the [MIT License].

🌐 Links

📄 Paper (PDF)
🤗 Hugging Face Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Codes		Codes
Dataset		Dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenNumEval: A Benchmark to Assess LLM’s Numerical Reasoning Capabilities in Bengali

🔍 Motivation

📁 Dataset Overview

🔧 Benchmarking Methodology

🧠 Prompting Techniques

🧪 Evaluated Models

📊 Key Results

🔍 Error Analysis

📜 Citation

⚖️ License

🌐 Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenNumEval: A Benchmark to Assess LLM’s Numerical Reasoning Capabilities in Bengali

🔍 Motivation

📁 Dataset Overview

🔧 Benchmarking Methodology

🧠 Prompting Techniques

🧪 Evaluated Models

📊 Key Results

🔍 Error Analysis

📜 Citation

⚖️ License

🌐 Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages