Frontier-CS is a benchmark system for evaluating AI models on challenging computer science problems. The system provides automated evaluation infrastructure that measures model performance on two problem tracks: algorithmic problems (competitive programming) and research problems (systems optimization). The benchmark is designed to be unsolved (no perfect scores), open-ended (unbounded scoring), verifiable (automated evaluation), and diverse (multiple domains).
For installation instructions, see Installation and Setup. For a quick tutorial on running evaluations, see Quick Start Guide. For detailed architectural design, see System Architecture.
Sources: README.md1-172
Frontier-CS evaluates solutions across two distinct tracks, each with different problem types, evaluation methods, and scoring systems.
| Track | Problems | Language | Evaluation Method | Score Type |
|---|---|---|---|---|
| Algorithmic | 172 | C++17 | Testlib-based judge with go-judge sandboxing | Partial credit (0-100) based on test cases passed |
| Research | 68 | Python (default) or C++ | Docker/SkyPilot execution with custom evaluators | Continuous (0-100+) based on performance metrics |
Algorithmic problems use competitive programming formats with binary test data (.in/.ans files), testlib-based checkers and interactors, and a Node.js judge server backed by go-judge for sandboxed execution. Problems include standard, interactive, and partial-score types across categories like graph theory, dynamic programming, and data structures.
Key files:
algorithmic/problems/{id}/config.yamlalgorithmic/judge/AlgorithmicLocalRunner and AlgorithmicSkyPilotRunnerResearch problems require domain expertise in systems optimization, covering six categories: OS, HPC (High-Performance Computing), AI, Database, Programming Languages, and Security. Each problem has a custom Python-based evaluator that benchmarks solution performance against baselines and computes continuous scores (e.g., speedup ratios, cost reductions).
Key files:
research/problems/{problem}/config.yamlresearch/problems/{problem}/evaluator.pyResearchDockerRunner and ResearchSkyPilotRunnerSources: README.md24-42 algorithmic/README.md1-125 research/README.md1-138
The system consists of five primary layers:
Command-Line Interface: The frontier command provides four subcommands (eval, batch, list, show) implemented in src/frontier_cs/cli.py1-897 The CLI handles argument parsing, backend selection, and result formatting.
Evaluation Engine: Two evaluator classes coordinate execution:
SingleEvaluator src/frontier_cs/single_evaluator.py for individual solution evaluationBatchEvaluator src/frontier_cs/batch/evaluator.py35-1076 for parallel batch evaluation with worker pools and state managementExecution Backends: Abstract Runner base class src/frontier_cs/runner/base.py46-220 with four implementations:
AlgorithmicLocalRunner for Docker-based local executionAlgorithmicSkyPilotRunner src/frontier_cs/runner/algorithmic_skypilot.py24-330 for cloud executionResearchDockerRunner for local Docker containersResearchSkyPilotRunner src/frontier_cs/runner/research_skypilot.py42-487 for cloud VMs with GPU supportProblem Repositories: Directories containing problem definitions with config.yaml files specifying runtime parameters, resource requirements, and evaluation settings.
Results & State: Evaluation results are captured in EvaluationResult dataclasses src/frontier_cs/runner/base.py24-44 for single evaluations and persisted as PairResult objects in EvaluationState src/frontier_cs/batch/state.py114-632 for batch evaluations.
Sources: src/frontier_cs/cli.py1-897 src/frontier_cs/batch/evaluator.py1-1076 src/frontier_cs/runner/base.py1-220 src/frontier_cs/batch/state.py1-632
Initialization: User invokes frontier eval command with track, problem ID, and solution file. The CLI src/frontier_cs/cli.py799-869 parses arguments and determines the backend (Docker by default for algorithmic, SkyPilot by default for research).
Configuration Loading: The evaluator loads the problem's config.yaml using load_problem_config() src/frontier_cs/config.py87-154 which specifies timeout limits, resource requirements (CPU/GPU/memory), Docker image, and dependency configuration.
Runner Selection: Based on track and backend, the evaluator instantiates the appropriate runner:
AlgorithmicLocalRunner src/frontier_cs/runner/algorithmic_local.pyAlgorithmicSkyPilotRunner src/frontier_cs/runner/algorithmic_skypilot.py24-330ResearchDockerRunner src/frontier_cs/runner/research_docker.pyResearchSkyPilotRunner src/frontier_cs/runner/research_skypilot.py42-487Execution:
evaluator.py in Docker container or cloud VM, which loads the solution's solve() method and runs benchmarksScoring: Raw metrics (test results, performance measurements) are converted to 0-100 scores based on problem-specific formulas defined in the evaluator or checker.
Result Capture: Runner returns EvaluationResult dataclass src/frontier_cs/runner/base.py24-44 containing score, status (SUCCESS/ERROR/TIMEOUT), message, logs, and execution duration.
Sources: src/frontier_cs/cli.py799-869 src/frontier_cs/single_evaluator.py src/frontier_cs/runner/base.py46-220 src/frontier_cs/config.py87-154
For evaluating multiple solutions at scale, BatchEvaluator src/frontier_cs/batch/evaluator.py35-1076 provides parallel execution with incremental state tracking.
Key Features:
| Feature | Implementation |
|---|---|
| Parallel Execution | ThreadPoolExecutor with configurable worker count (default: 10) |
| State Persistence | JSON-based EvaluationState src/frontier_cs/batch/state.py114-632 with atomic writes |
| Cache Invalidation | Hash-based tracking of solution and problem changes src/frontier_cs/batch/state.py22-78 |
| Resume Capability | Automatically skips completed pairs, retries only modified solutions |
| Cluster Pooling | For SkyPilot research evaluations, maintains pool of reusable clusters src/frontier_cs/batch/evaluator.py490-589 |
The batch system tracks evaluation progress in .state.{track}.json files containing:
{solution}:{problem} to PairResult with score, status, and hashesWhen resuming, the system compares stored hashes with current file hashes. If either the solution code or problem configuration changes, the pair is marked for re-evaluation src/frontier_cs/batch/evaluator.py210-243
For research track with SkyPilot backend, BatchEvaluator groups problems by resource signature (GPU type, CPU count, memory) and creates separate cluster pools for each group src/frontier_cs/batch/evaluator.py490-589 This ensures GPU problems get GPU clusters while CPU-only problems use cheaper instances.
Sources: src/frontier_cs/batch/evaluator.py35-1076 src/frontier_cs/batch/state.py114-632
All problems (both tracks) define runtime parameters in config.yaml files loaded by load_problem_config() src/frontier_cs/config.py87-154
Sources: algorithmic/README.md106-113
The RuntimeConfig src/frontier_cs/config.py67-76 contains:
timeout_seconds: Execution time limitdocker: DockerConfig src/frontier_cs/config.py48-63 specifying image and GPU requirementsresources: ResourcesConfig src/frontier_cs/config.py18-46 for SkyPilot cluster specificationslanguage: Target language (affects solution file extension)Dependencies specified in uv_project are automatically installed via uv pip install src/frontier_cs/runner/base.py203-216 before evaluation.
Sources: src/frontier_cs/config.py18-154 research/README.md52-82
Solutions must be C++17 source files with standard I/O:
stdinstdout.cpp file per problemThe judge compiles with g++ -std=c++17 -O2 and runs against test cases.
Solutions must implement a Solution class with a solve() method. The signature varies by problem type:
Kernel optimization problems (flash_attn, gemm_optimization):
Training problems (imagenet_pareto):
Strategy problems (cant_be_late, llm_router):
The evaluator loads the solution as a Python module, instantiates Solution, and calls solve() with problem-specific arguments defined in evaluator.py research/problems/{problem}/evaluator.py
Sources: research/README.md84-137 SUBMIT.md216-226
Frontier-CS uses three repositories:
| Repository | Purpose | Contents |
|---|---|---|
| Public (Frontier-CS) | Problem definitions, evaluation tools | Algorithmic/research problems (partial test cases), CLI, runners, generation scripts |
| Internal (Frontier-CS-internal) | Complete test suites | Same structure as public but with full test cases |
| Results (Frontier-CS-Result) | Evaluation history | JSON files with historical scores (algorithmic.json, research.json) |
The weekly evaluation workflow .github/workflows/weekly-eval.yml and scripts/run_eval.sh1-519:
This separation allows public access to problem definitions while keeping complete test suites private to prevent overfitting.
Sources: scripts/run_eval.sh1-519 README.md171-172
To begin using Frontier-CS:
Installation: Install the frontier-cs package. See Installation and Setup for detailed instructions including Python version requirements and Docker setup.
Quick Evaluation: Run a single evaluation to verify installation. See Quick Start Guide for example commands.
Understanding Architecture: Learn about the evaluation pipeline, runners, and problem formats. See System Architecture for detailed component descriptions.
Command Reference: Explore available CLI commands. See Command-Line Interface for complete command documentation.
Problem Contribution: Submit new problems to the benchmark. See Contributing Algorithmic Problems and Contributing Research Problems for guidelines.
Sources: README.md77-139 SUBMIT.md1-421
Refresh this wiki