Reasoning Agent (MMLU)

An interactive Streamlit UI to run and compare reasoning models on the MMLU benchmark using multiple prompting techniques.
It supports both Frontier models (OpenAI, Google) and Small models (local Ollama), with single-question transparency mode and batch evaluation with live progress.

Features

Single Question (Transparency Mode)
- Watch sanitized reasoning steps live while the model solves a question.
- Choose between few-shot, chain-of-thought (CoT), self-consistency, or self-ask prompting.
Batch Evaluation
- Run multiple MMLU questions per subject.
- See live per-subject progress bars and logs.
- Summarized accuracy and latency statistics.
Results & Analysis
- Automatic logging of all runs to results/log.csv.
- Filter results by model, provider, technique, or subject.
- Charts for accuracy and latency comparisons.

🚀 Quick Start

Follow these steps to download and run the project locally:

1. Clone the Repository

If you have SSH access:

git clone [email protected]:<YOUR_USERNAME>/<YOUR_REPO_NAME>.git

2. Navigate to the project

cd <YOUR_REPO_NAME>

3. Create and Activate a Virtual Environment

python3 -m venv venv
source venv/bin/activate   # Mac/Linux
venv\Scripts\activate      # Windows

4. Install Dependencies

pip install -r requirements.txt

5. Set API Keys

You need at least one API key to run the app with hosted models. Set them as environment variables:

Mac/Linux (bash/zsh):

export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key"

Windows (Powershell):

csetx OPENAI_API_KEY "your-openai-key"
setx GOOGLE_API_KEY "your-google-key"

If you want to use local models with Ollama, install it from https://ollama.ai and run:

ollama serve
ollama pull gemma2:9b
ollama pull llama3:8b
ollama pull gemma3:1b
ollama pull gemma3:270m

If you to try other models from Ollama, full list here https://ollama.com/search

6. Run the App

streamlit run app/ui.py

Requirements

Python 3.9+
Streamlit
datasets
pandas
altair
Access to:
- Frontier Models — API keys for OpenAI or Google
- Small Models — Ollama installed locally

Installation

Clone the repository (via SSH)

git clone [email protected]:YOUR_USERNAME/YOUR_REPO.git
cd YOUR_REPO

Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate   # macOS/Linux
venv\Scripts\activate      # Windows

Install dependencies
```
pip install -r requirements.txt
```

Set your API keys (replace with your keys)

export OPENAI_API_KEY="your_openai_key"
export GOOGLE_API_KEY="your_google_key"

For local models, install and run Ollama:

Download Ollama -> https://ollama.com/download

Start the server:
ollama serve

Pull the model you want to use:
ollama pull gemma2:9b
ollama pull gemma3:1b
ollama pull gemma3:270m
ollama pull llama3:8b

Running the App Launch the Streamlit UI: streamlit run app/ui.py

How to Use Tab 1 — Single Question (Transparency) Select a subject, question index, model family, and prompting technique.

Optionally enable Live mode to see sanitized reasoning steps.

Click Solve to run the query.

Tab 2 — Batch Evaluation Select one or more subjects.

Set the number of items per subject.

Click Run Batch to evaluate and view live progress.

Tab 3 — Results & Analysis Explore saved runs from results/log.csv.

Filter by family, provider, model, technique, or subject.

View accuracy and latency charts.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
eval		eval
mmlu		mmlu
models		models
results		results
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning Agent (MMLU)

Features

🚀 Quick Start

1. Clone the Repository

2. Navigate to the project

3. Create and Activate a Virtual Environment

4. Install Dependencies

5. Set API Keys

6. Run the App

Requirements

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reasoning Agent (MMLU)

Features

🚀 Quick Start

1. Clone the Repository

2. Navigate to the project

3. Create and Activate a Virtual Environment

4. Install Dependencies

5. Set API Keys

6. Run the App

Requirements

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages