An LLM benchmarking harness that generates, compiles, and races Fortran programs for computing the sum of all primes up to a configurable bound. Each program is produced by an LLM from a different algorithmic strategy, then evaluated for correctness and speed.
- Strategy selection — 10 built-in strategies describe different prime-summing algorithms (naive trial division, sieve of Eratosthenes, bit-packed sieve, wheel factorization, etc.), each with style and constraint directives that steer the LLM's output.
- Code generation — For each strategy, the LLM is prompted to return a complete, compilable Fortran program inside a fenced code block. The Fortran source is extracted via regex.
- Evaluation pipeline — Each program is compiled with
gfortran, executed, verified against a Python-computed reference sum, and benchmarked over multiple runs. - Ranking — Solutions are ranked: correct results first (sorted by median runtime), then incorrect results, then compile failures. A results table and the winning source are logged.
fortran-prime-arena/
├── pyproject.toml # Project metadata and dependencies (managed by uv)
├── config.sample.yaml # Sample configuration (copy to config.yaml)
├── fortran_prime_arena/ # Main package
│ ├── __main__.py # CLI entry point
│ ├── arena.py # Orchestrator — ties generation, evaluation, and ranking together
│ ├── evaluator.py # Compile, run, verify, benchmark pipeline
│ ├── generator.py # LLM prompting and Fortran code extraction
│ ├── llm_client.py # LiteLLM wrapper for model-agnostic LLM calls
│ ├── models.py # Pydantic configs, dataclasses for domain objects
│ └── strategies.py # 10 predefined algorithmic strategy descriptors
└── tests/ # Unit and integration tests
- Python 3.11+
- uv (
brew install uvon macOS, orcurl -LsSf https://astral.sh/uv/install.sh | sh) gfortran(e.g. viabrew install gccon macOS)- A valid LLM API endpoint (Azure OpenAI, OpenAI, or any provider supported by LiteLLM)
Install Python dependencies:
uv syncCopy the sample config and edit it with your LLM provider, compiler path, and test parameters:
cp config.sample.yaml config.yamlllm:
model: "gpt-4o" # Any LiteLLM-compatible model string
api_base: null # Override if using a custom endpoint
api_key: null # Override if not using env vars (e.g. OPENAI_API_KEY)
temperature: 1.0
max_tokens: 2048
retries: 2
compiler:
path: "/opt/homebrew/bin/gfortran"
flags: ["-O2"]
timeout: 30
test:
upper_bound: 1000000 # Sum primes from 2 to this value
timeout: 60 # Max seconds per program execution
benchmark_repeats: 3 # Number of timed runs per solution
num_solutions: 10 # How many strategies to evaluate (max 10)LiteLLM will also read standard environment variables (OPENAI_API_KEY, AZURE_API_KEY, etc.) if api_key is left null.
Run the arena with the default config:
uv run fortran-prime-arenaOr specify a custom config file:
uv run fortran-prime-arena path/to/config.yamlThe output is a ranked table showing compile status, correctness, median runtime, and the overall metric for each strategy, followed by the winning Fortran source code.
uv run pytest tests/Integration tests that require gfortran are automatically skipped if the compiler is not found.
Rank Strategy Compile Correct Median(s) Metric
---------------------------------------------------------------------------
1 Classic Sieve (logical array) 0.312s YES 0.0098 0.0098
2 Bit-Packed Sieve 0.287s YES 0.0124 0.0124
3 Segmented Sieve 0.341s YES 0.0131 0.0131
4 Sqrt Trial Division 0.198s YES 2.4510 2.4510
5 Naive Trial Division 0.176s NO 8.1200 8.1200
6 Wheel Factorization FAIL N/A N/A INF
---------------------------------------------------------------------------