Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

(Edu4AI Workshop @ ECAI 2025)

This repository contains the codebase, datasets, and analysis scripts accompanying the paper:

"Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs" Presented at:
Edu4AI 2025: 2nd Workshop on Education for Artificial Intelligence,
co-located with the 28th European Conference on Artificial Intelligence (ECAI 2025),
Bologna, Italy.

Project Overview

Large Language Models (LLMs) are increasingly used to support students and teachers in education. However, most commercial models are trained on overwhelmingly English-centric data, raising concerns about language-based disparities in clarity, accuracy, and pedagogical quality.

This project introduces a scalable, automated pipeline to systematically investigate these disparities. Using 628 curriculum-aligned math exercises (Grades 2–10, German K–10 curriculum), we translate, solve, and evaluate problems across English, German, and Arabic.

Three LLMs (GPT-4o-mini, Gemini-2.5-Flash, and Qwen-Plus) produced solutions, which were then evaluated by a panel of LLM judges including Claude 3.5 Haiku.

Our results highlight persistent linguistic bias in educational AI, underscoring the need for more inclusive training and evaluation practices.

Research Team

This project was conducted at HTW Berlin – Hochschule für Technik und Wirtschaft Berlin within the KIWI Project.

Mariam Mahran – Research Assistant, AI & Interpretability - HTW Berlin
Prof. Dr. Katharina Simbeck – Professor of Business Informatics (Information Management) - HTW Berlin

Pipeline Overview

The multilingual pipeline is organized into four major stages:

1. Dataset Preparation

Problem generation: 628 math exercises were created in English, aligned with the German K–10 curriculum. Exercises were generated using ChatGPT-4o in a manual, iterative process guided by curriculum-aligned topic names and learning objectives.
Translation: Exercises were translated into German and Arabic.
Technical terms extraction: Key mathematical terms were identified for each exercise to support later evaluation.

2. Solution Generation

Three commercial LLMs — GPT-4o-mini, Gemini-2.5-Flash, and Qwen-Plus — produced step-by-step solutions for all exercises in English, German, and Arabic.

Note that each model uses its own dedicated solver script, since APIs differ in setup and calling conventions:

GPT-4o-mini: gpt_4o_mini/_2_solve_tasks_gpt.py
Gemini-2.5-Flash: gemini/_2_solve_tasks_gemini.py
Qwen-Plus: qwen_plus/_2_solve_tasks_qwen.py

Each script loads the corresponding API key, constructs prompts in the target language, and writes the model’s step-by-step solutions to CSV files.

3. Evaluation

Judging framework: Solutions were assessed by a panel of LLM judges.
Held-out strategy: The model under evaluation was excluded from the judging panel and replaced with a neutral model (Claude 3.5 Haiku).
Randomized presentation: Solutions were shown in shuffled order to reduce position bias.
Comparative assessment: Judges ranked the three solutions (1 = best, 3 = worst) and provided short justifications.
Majority voting: Rankings were aggregated across judges to determine the best-performing language per exercise. If no consensus was reached, the result was labeled as a tie (TIE).

4. Justifications Analysis

N-gram, sentiment, and topic modeling of judge justifications.

Repository Structure

SOLUTION_EVALUATOR/
│
├── archive/                     # Old experiments, preliminary query sets, early analyses
├── assets/                      # Static assets for the repository
│
├── gpt_4o_mini/                 # Outputs and results for GPT-4o-mini
│   ├── _2_solve_tasks_gpt.py    # Solve tasks with GPT-4o-mini
│   └── (other model-specific output files...)
│
├── gemini/                      # Outputs and results for Gemini-2.5-Flash
│   ├── _2_solve_tasks_gemini.py    # Solve tasks with Gemini-2.5-Flash
│   └── (other model-specific output files...)
│
├── qwen_plus/                   # Outputs and results for Qwen-Plus
│   ├── _2_solve_tasks_qwen.py       # Solve tasks with Qwen-Plus
│   └── (other model-specific output files...)
│
├── _1_translate_tasks.py        # Translate exercises into target language
├── _3_technical_terms.py        # Extract key technical terms per exercise
├── _4_pairwise_evaluation.py    # Pairwise ranking evaluation with justifications
├── _5_pairwise_results_clean.py # Majority-vote cleaning & heatmap visualization
├── _6_justifications_analysis.py   # Justification text analysis (ngrams, sentiment, topics)
│
├── justifications_analysis.ipynb         # Notebook for justification analysis
├── test_and_eval_pipeline-gpt.ipynb      # Full GPT pipeline
├── test_and_eval_pipeline-gemini.ipynb   # Full Gemini pipeline
├── test_and_eval_pipeline-qwen.ipynb     # Full Qwen pipeline
│
├── math_exercises_en_en.csv     # Original English dataset
├── math_exercises_en_de.csv     # Translated German dataset
├── math_exercises_en_ar.csv     # Translated Arabic dataset
├── technical_terms_en.csv       # Extracted terms (English)
├── technical_terms_de.csv       # Extracted terms (German)
├── technical_terms_ar.csv       # Extracted terms (Arabic)
│
├── Part_C_Mathematics_2015_11_10.pdf  # Reference German curriculum
├── requirements.txt                   # Minimal dependencies
├── README.md                    # Project overview (this file)

Installation & Setup

# Clone the repository
git clone https://github.com/iug-htw/solution_evaluator.git
cd solution_evaluator

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

API Keys

This project requires access to:

OpenAI API Key (for GPT-4o-mini and as a judge)
Google Gemini API Key (for Gemini-2.5-Flash and as a judge)
Alibaba Cloud API Key (for Qwen-Plus and as a judge)
Anthropic API Key (for Claude 3.5 Haiku as a judge)

Example `.env` File

OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_gemini_key_here
DASHSCOPE_API_KEY=your_alibaba_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here

Reproducing the Experiment

This repository includes a detailed, step-by-step guide for reproducing the full multilingual evaluation pipeline, including translation, solution generation, LLM-based judging, and result aggregation.

➡️ See REPRODUCIBILITY.md for full instructions.

Citation

If you use this code, please cite:

@inproceedings{MahranSimbeck2025Edu4AI,
  author    = {Mariam Mahran and Katharina Simbeck},
  title     = {Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs},
  booktitle = {Proceedings of the 2nd International Workshop on Education for Artificial Intelligence (edu4AI 2025), ECAI},
  series    = {CEUR Workshop Proceedings},
  volume    = {4114},
  year      = {2025},
  address   = {Bologna, Italy},
  url       = {https://ceur-ws.org/Vol-4114/6_paper.pdf},
  issn      = {1613-0073}
}

}

Acknowledgments

This work was carried out as part of the KIWI Project, generously funded by the Federal Ministry of Education and Research (BMBF).

We gratefully acknowledge their support, which enabled this research.

Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs - HTW Berlin, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Project Overview

Research Team

Pipeline Overview

1. Dataset Preparation

2. Solution Generation

3. Evaluation

4. Justifications Analysis

Repository Structure

Installation & Setup

API Keys

Example `.env` File

Reproducing the Experiment

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
archive		archive
assets		assets
gemini		gemini
gpt_4o_mini		gpt_4o_mini
qwen_plus		qwen_plus
.gitignore		.gitignore
Part_C_Mathematics_2015_11_10.pdf		Part_C_Mathematics_2015_11_10.pdf
README.md		README.md
REPRODUCIBILITY.md		REPRODUCIBILITY.md
_1_translate_tasks.py		_1_translate_tasks.py
_3_technical_terms.py		_3_technical_terms.py
_4_pairwise_evaluation.py		_4_pairwise_evaluation.py
_5_pairwise_results_clean.py		_5_pairwise_results_clean.py
_6_justifications_analysis.py		_6_justifications_analysis.py
justifications_analysis.ipynb		justifications_analysis.ipynb
math_exercises_en_ar.csv		math_exercises_en_ar.csv
math_exercises_en_de.csv		math_exercises_en_de.csv
math_exercises_en_en.csv		math_exercises_en_en.csv
requirements.txt		requirements.txt
technical_terms_ar.csv		technical_terms_ar.csv
technical_terms_de.csv		technical_terms_de.csv
technical_terms_en.csv		technical_terms_en.csv
test_and_eval_pipeline-gemini.ipynb		test_and_eval_pipeline-gemini.ipynb
test_and_eval_pipeline-gpt.ipynb		test_and_eval_pipeline-gpt.ipynb
test_and_eval_pipeline-qwen.ipynb		test_and_eval_pipeline-qwen.ipynb

Folders and files

Latest commit

History

Repository files navigation

Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Project Overview

Research Team

Pipeline Overview

1. Dataset Preparation

2. Solution Generation

3. Evaluation

4. Justifications Analysis

Repository Structure

Installation & Setup

API Keys

Example .env File

Reproducing the Experiment

Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example `.env` File

Packages