Fortran Prime Arena

An LLM benchmarking harness that generates, compiles, and races Fortran programs for computing the sum of all primes up to a configurable bound. Each program is produced by an LLM from a different algorithmic strategy, then evaluated for correctness and speed.

How It Works

Strategy selection — 10 built-in strategies describe different prime-summing algorithms (naive trial division, sieve of Eratosthenes, bit-packed sieve, wheel factorization, etc.), each with style and constraint directives that steer the LLM's output.
Code generation — For each strategy, the LLM is prompted to return a complete, compilable Fortran program inside a fenced code block. The Fortran source is extracted via regex.
Evaluation pipeline — Each program is compiled with gfortran, executed, verified against a Python-computed reference sum, and benchmarked over multiple runs.
Ranking — Solutions are ranked: correct results first (sorted by median runtime), then incorrect results, then compile failures. A results table and the winning source are logged.

Project Structure

fortran-prime-arena/
├── pyproject.toml                     # Project metadata and dependencies (managed by uv)
├── config.sample.yaml                 # Sample configuration (copy to config.yaml)
├── fortran_prime_arena/               # Main package
│   ├── __main__.py                    # CLI entry point
│   ├── arena.py                       # Orchestrator — ties generation, evaluation, and ranking together
│   ├── evaluator.py                   # Compile, run, verify, benchmark pipeline
│   ├── generator.py                   # LLM prompting and Fortran code extraction
│   ├── llm_client.py                  # LiteLLM wrapper for model-agnostic LLM calls
│   ├── models.py                      # Pydantic configs, dataclasses for domain objects
│   └── strategies.py                  # 10 predefined algorithmic strategy descriptors
└── tests/                             # Unit and integration tests

Prerequisites

Python 3.11+
uv (brew install uv on macOS, or curl -LsSf https://astral.sh/uv/install.sh | sh)
gfortran (e.g. via brew install gcc on macOS)
A valid LLM API endpoint (Azure OpenAI, OpenAI, or any provider supported by LiteLLM)

Install Python dependencies:

uv sync

Configuration

Copy the sample config and edit it with your LLM provider, compiler path, and test parameters:

cp config.sample.yaml config.yaml

llm:
  model: "gpt-4o"               # Any LiteLLM-compatible model string
  api_base: null                 # Override if using a custom endpoint
  api_key: null                  # Override if not using env vars (e.g. OPENAI_API_KEY)
  temperature: 1.0
  max_tokens: 2048
  retries: 2

compiler:
  path: "/opt/homebrew/bin/gfortran"
  flags: ["-O2"]
  timeout: 30

test:
  upper_bound: 1000000           # Sum primes from 2 to this value
  timeout: 60                    # Max seconds per program execution
  benchmark_repeats: 3           # Number of timed runs per solution

num_solutions: 10                # How many strategies to evaluate (max 10)

LiteLLM will also read standard environment variables (OPENAI_API_KEY, AZURE_API_KEY, etc.) if api_key is left null.

Usage

Run the arena with the default config:

uv run fortran-prime-arena

Or specify a custom config file:

uv run fortran-prime-arena path/to/config.yaml

The output is a ranked table showing compile status, correctness, median runtime, and the overall metric for each strategy, followed by the winning Fortran source code.

Running Tests

uv run pytest tests/

Integration tests that require gfortran are automatically skipped if the compiler is not found.

Example Output

Rank  Strategy                      Compile   Correct  Median(s)   Metric
---------------------------------------------------------------------------
1     Classic Sieve (logical array) 0.312s    YES      0.0098      0.0098
2     Bit-Packed Sieve              0.287s    YES      0.0124      0.0124
3     Segmented Sieve               0.341s    YES      0.0131      0.0131
4     Sqrt Trial Division           0.198s    YES      2.4510      2.4510
5     Naive Trial Division          0.176s    NO       8.1200      8.1200
6     Wheel Factorization           FAIL      N/A      N/A         INF
---------------------------------------------------------------------------

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fortran_prime_arena		fortran_prime_arena
tests		tests
.gitignore		.gitignore
README.md		README.md
config.sample.yaml		config.sample.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fortran Prime Arena

How It Works

Project Structure

Prerequisites

Configuration

Usage

Running Tests

Example Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fortran Prime Arena

How It Works

Project Structure

Prerequisites

Configuration

Usage

Running Tests

Example Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages