Skip to content

spChalk/fortran-prime-arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fortran Prime Arena

An LLM benchmarking harness that generates, compiles, and races Fortran programs for computing the sum of all primes up to a configurable bound. Each program is produced by an LLM from a different algorithmic strategy, then evaluated for correctness and speed.

How It Works

  1. Strategy selection — 10 built-in strategies describe different prime-summing algorithms (naive trial division, sieve of Eratosthenes, bit-packed sieve, wheel factorization, etc.), each with style and constraint directives that steer the LLM's output.
  2. Code generation — For each strategy, the LLM is prompted to return a complete, compilable Fortran program inside a fenced code block. The Fortran source is extracted via regex.
  3. Evaluation pipeline — Each program is compiled with gfortran, executed, verified against a Python-computed reference sum, and benchmarked over multiple runs.
  4. Ranking — Solutions are ranked: correct results first (sorted by median runtime), then incorrect results, then compile failures. A results table and the winning source are logged.

Project Structure

fortran-prime-arena/
├── pyproject.toml                     # Project metadata and dependencies (managed by uv)
├── config.sample.yaml                 # Sample configuration (copy to config.yaml)
├── fortran_prime_arena/               # Main package
│   ├── __main__.py                    # CLI entry point
│   ├── arena.py                       # Orchestrator — ties generation, evaluation, and ranking together
│   ├── evaluator.py                   # Compile, run, verify, benchmark pipeline
│   ├── generator.py                   # LLM prompting and Fortran code extraction
│   ├── llm_client.py                  # LiteLLM wrapper for model-agnostic LLM calls
│   ├── models.py                      # Pydantic configs, dataclasses for domain objects
│   └── strategies.py                  # 10 predefined algorithmic strategy descriptors
└── tests/                             # Unit and integration tests

Prerequisites

  • Python 3.11+
  • uv (brew install uv on macOS, or curl -LsSf https://astral.sh/uv/install.sh | sh)
  • gfortran (e.g. via brew install gcc on macOS)
  • A valid LLM API endpoint (Azure OpenAI, OpenAI, or any provider supported by LiteLLM)

Install Python dependencies:

uv sync

Configuration

Copy the sample config and edit it with your LLM provider, compiler path, and test parameters:

cp config.sample.yaml config.yaml
llm:
  model: "gpt-4o"               # Any LiteLLM-compatible model string
  api_base: null                 # Override if using a custom endpoint
  api_key: null                  # Override if not using env vars (e.g. OPENAI_API_KEY)
  temperature: 1.0
  max_tokens: 2048
  retries: 2

compiler:
  path: "/opt/homebrew/bin/gfortran"
  flags: ["-O2"]
  timeout: 30

test:
  upper_bound: 1000000           # Sum primes from 2 to this value
  timeout: 60                    # Max seconds per program execution
  benchmark_repeats: 3           # Number of timed runs per solution

num_solutions: 10                # How many strategies to evaluate (max 10)

LiteLLM will also read standard environment variables (OPENAI_API_KEY, AZURE_API_KEY, etc.) if api_key is left null.

Usage

Run the arena with the default config:

uv run fortran-prime-arena

Or specify a custom config file:

uv run fortran-prime-arena path/to/config.yaml

The output is a ranked table showing compile status, correctness, median runtime, and the overall metric for each strategy, followed by the winning Fortran source code.

Running Tests

uv run pytest tests/

Integration tests that require gfortran are automatically skipped if the compiler is not found.

Example Output

Rank  Strategy                      Compile   Correct  Median(s)   Metric
---------------------------------------------------------------------------
1     Classic Sieve (logical array) 0.312s    YES      0.0098      0.0098
2     Bit-Packed Sieve              0.287s    YES      0.0124      0.0124
3     Segmented Sieve               0.341s    YES      0.0131      0.0131
4     Sqrt Trial Division           0.198s    YES      2.4510      2.4510
5     Naive Trial Division          0.176s    NO       8.1200      8.1200
6     Wheel Factorization           FAIL      N/A      N/A         INF
---------------------------------------------------------------------------

About

LLM benchmarking harness that generates, compiles, and races Fortran programs for computing prime sums across 10 algorithmic strategies.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages