Skip to content

riju-talk/Quench-plus-plus

Repository files navigation

QUENCH++

Indic Language × Chain-of-Thought Stress Testing for Small Language Models

License: MIT Python 3.9+

🎯 Research Question

How do Indic languages and chain-of-thought prompting—individually and jointly—affect general reasoning performance and reasoning structure in small language models?

QUENCH++ is a rigorous research framework that systematically tests:

  1. Language stress (English → Indic)
  2. CoT stress (No-CoT → CoT)
  3. Combined stress (both simultaneously)

🔬 Core Contributions

1. TIRD Metric (NEW)

Translation-Induced Reasoning Drift - Quantifies how much reasoning structure changes when translating questions across languages.

TIRD = 1 - Jaccard(Steps_EN, Steps_Indic)

2. Symbolic CoT Analysis

Goes beyond accuracy to measure:

  • Faithfulness: Do reasoning steps support the answer?
  • Complexity: Reasoning depth vs verbosity
  • Stability: Cross-language reasoning consistency

3. 2×2 Experimental Design

Systematically isolates effects:

Setting ID Language Reasoning Description
S1 English No-CoT QUENCH baseline
S2 English CoT CoT stress test
S3 Indic No-CoT Language stress test
S4 Indic CoT Full stress test

📁 Repository Structure

quench-plus-plus/
│
├── data/
│   ├── original/           # Original QUENCH dataset
│   └── translated/         # Hindi, Bengali, Marathi
│
├── models/
│   └── inference_wrappers/ # Qwen3, DeepSeek, Gemma2, LLaMA
│
├── evaluation/
│   ├── accuracy.py         # Accuracy calculation
│   ├── quench_gap.py       # Gap decomposition
│   ├── cot_faithfulness.py # Faithfulness metrics
│   ├── cot_complexity.py   # Complexity metrics
│   └── cot_stability.py    # TIRD calculation
│
├── notebooks/
│   ├── eng-hin.ipynb       # Hindi translation
│   ├── eng-bang.ipynb      # Bengali translation
│   ├── eng-mar.ipynb       # Marathi translation
│   ├── 01_translation_validation.ipynb
│   ├── 02_baseline_inference.ipynb
│   ├── 03_cot_inference.ipynb
│   ├── 04_cot_annotation.ipynb
│   ├── 05_symbolic_cot_metrics.ipynb
│   └── 06_final_analysis.ipynb
│
└── paper/                  # Paper-ready outputs

🚀 Quick Start

1. Setup Environment

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run setup script
python setup_project.py

2. Configure API Keys

Edit config/config.yaml:

api_keys:
  openai: "your-openai-api-key-here"
  huggingface: "your-hf-token-here"

3. Prepare Dataset

Place QUENCH dataset at: data/original/quench.json

4. Run Experiments

Execute notebooks in order (01 through 06).


📊 Key Findings

  1. Language Translation Degrades Performance: 8-12% average drop
  2. CoT Shows Mixed Effects: Can help or hurt depending on model
  3. Full Stress Compounds Effects: 15-20% total degradation
  4. Translation Changes Reasoning Structure: 34% average TIRD score

🔑 Novel Contributions

  • TIRD Metric: First quantification of translation-induced reasoning drift
  • Symbolic CoT Analysis: Goes beyond accuracy to understand reasoning changes
  • 2×2 Design: Isolates individual and interaction effects
  • Research Finding: Translation changes how models reason, not just accuracy

🛠️ Model Support

  • Qwen3 (reasoning-capable)
  • DeepSeek (baseline)
  • Gemma 2 (9B)
  • LLaMA 3 (8B)

🌐 Language Support

  • Hindi (Devanagari)
  • Bengali
  • Marathi

Translation uses NLLB-200 by default.


📝 Experimental Controls

Answer Preservation: Answers kept in original form
No Dataset Contamination: Same questions across all settings
GPT as Annotator Only: Not for correctness judgement
Statistical Rigor: Paired t-tests, bootstrap CI


🎓 Citation

@inproceedings{yourname2025quenchpp,
  title={QUENCH++: Indic Language and Chain-of-Thought Stress Testing for Small Language Models},
  author={Your Name},
  booktitle={Proceedings of ACL 2025},
  year={2025}
}

📄 License

MIT License


📧 Contact

For questions: Open a GitHub issue or contact the maintainers.


Built for rigorous, culturally aware LLM evaluation.

About

Quench++ extends Indic reasoning benchmarks with bias injection, three new languages, and structured Chain-of-Thought cause-effect generation in Boolean logic. It enables robust evaluation of LLM trustworthiness, reasoning, and bias mitigation through reproducible Jupyter notebooks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors