Overview

Relevant source files

Frontier-CS is a benchmark system for evaluating AI models on challenging computer science problems. The system provides automated evaluation infrastructure that measures model performance on two problem tracks: algorithmic problems (competitive programming) and research problems (systems optimization). The benchmark is designed to be unsolved (no perfect scores), open-ended (unbounded scoring), verifiable (automated evaluation), and diverse (multiple domains).

For installation instructions, see Installation and Setup. For a quick tutorial on running evaluations, see Quick Start Guide. For detailed architectural design, see System Architecture.

Sources: README.md1-172

Problem Tracks

Frontier-CS evaluates solutions across two distinct tracks, each with different problem types, evaluation methods, and scoring systems.

Track	Problems	Language	Evaluation Method	Score Type
Algorithmic	172	C++17	Testlib-based judge with go-judge sandboxing	Partial credit (0-100) based on test cases passed
Research	68	Python (default) or C++	Docker/SkyPilot execution with custom evaluators	Continuous (0-100+) based on performance metrics

Algorithmic Track

Algorithmic problems use competitive programming formats with binary test data (.in/.ans files), testlib-based checkers and interactors, and a Node.js judge server backed by go-judge for sandboxed execution. Problems include standard, interactive, and partial-score types across categories like graph theory, dynamic programming, and data structures.

Key files:

Problem configuration: algorithmic/problems/{id}/config.yaml
Judge implementation: algorithmic/judge/
Evaluation entry point: AlgorithmicLocalRunner and AlgorithmicSkyPilotRunner

Research Track

Research problems require domain expertise in systems optimization, covering six categories: OS, HPC (High-Performance Computing), AI, Database, Programming Languages, and Security. Each problem has a custom Python-based evaluator that benchmarks solution performance against baselines and computes continuous scores (e.g., speedup ratios, cost reductions).

Key files:

Problem configuration: research/problems/{problem}/config.yaml
Evaluator scripts: research/problems/{problem}/evaluator.py
Evaluation runners: ResearchDockerRunner and ResearchSkyPilotRunner

Sources: README.md24-42 algorithmic/README.md1-125 research/README.md1-138

Core System Components

The system consists of five primary layers:

Command-Line Interface: The frontier command provides four subcommands (eval, batch, list, show) implemented in src/frontier_cs/cli.py1-897 The CLI handles argument parsing, backend selection, and result formatting.
Evaluation Engine: Two evaluator classes coordinate execution:
- SingleEvaluator src/frontier_cs/single_evaluator.py for individual solution evaluation
- BatchEvaluator src/frontier_cs/batch/evaluator.py35-1076 for parallel batch evaluation with worker pools and state management
Execution Backends: Abstract Runner base class src/frontier_cs/runner/base.py46-220 with four implementations:
- AlgorithmicLocalRunner for Docker-based local execution
- AlgorithmicSkyPilotRunner src/frontier_cs/runner/algorithmic_skypilot.py24-330 for cloud execution
- ResearchDockerRunner for local Docker containers
- ResearchSkyPilotRunner src/frontier_cs/runner/research_skypilot.py42-487 for cloud VMs with GPU support
Problem Repositories: Directories containing problem definitions with config.yaml files specifying runtime parameters, resource requirements, and evaluation settings.
Results & State: Evaluation results are captured in EvaluationResult dataclasses src/frontier_cs/runner/base.py24-44 for single evaluations and persisted as PairResult objects in EvaluationState src/frontier_cs/batch/state.py114-632 for batch evaluations.

Sources: src/frontier_cs/cli.py1-897 src/frontier_cs/batch/evaluator.py1-1076 src/frontier_cs/runner/base.py1-220 src/frontier_cs/batch/state.py1-632

Evaluation Workflow

Evaluation Flow

Initialization: User invokes frontier eval command with track, problem ID, and solution file. The CLI src/frontier_cs/cli.py799-869 parses arguments and determines the backend (Docker by default for algorithmic, SkyPilot by default for research).
Configuration Loading: The evaluator loads the problem's config.yaml using load_problem_config() src/frontier_cs/config.py87-154 which specifies timeout limits, resource requirements (CPU/GPU/memory), Docker image, and dependency configuration.
Runner Selection: Based on track and backend, the evaluator instantiates the appropriate runner:
- Algorithmic + Docker → AlgorithmicLocalRunner src/frontier_cs/runner/algorithmic_local.py
- Algorithmic + SkyPilot → AlgorithmicSkyPilotRunner src/frontier_cs/runner/algorithmic_skypilot.py24-330
- Research + Docker → ResearchDockerRunner src/frontier_cs/runner/research_docker.py
- Research + SkyPilot → ResearchSkyPilotRunner src/frontier_cs/runner/research_skypilot.py42-487
Execution:
- Algorithmic: Runner submits code to go-judge HTTP server algorithmic/judge/src/index.ts which compiles C++ and runs test cases in sandboxed environment
- Research: Runner executes evaluator.py in Docker container or cloud VM, which loads the solution's solve() method and runs benchmarks
Scoring: Raw metrics (test results, performance measurements) are converted to 0-100 scores based on problem-specific formulas defined in the evaluator or checker.
Result Capture: Runner returns EvaluationResult dataclass src/frontier_cs/runner/base.py24-44 containing score, status (SUCCESS/ERROR/TIMEOUT), message, logs, and execution duration.

Sources: src/frontier_cs/cli.py799-869 src/frontier_cs/single_evaluator.py src/frontier_cs/runner/base.py46-220 src/frontier_cs/config.py87-154

Batch Evaluation System

For evaluating multiple solutions at scale, BatchEvaluator src/frontier_cs/batch/evaluator.py35-1076 provides parallel execution with incremental state tracking.

Key Features:

Feature	Implementation
Parallel Execution	ThreadPoolExecutor with configurable worker count (default: 10)
State Persistence	JSON-based `EvaluationState` src/frontier_cs/batch/state.py114-632 with atomic writes
Cache Invalidation	Hash-based tracking of solution and problem changes src/frontier_cs/batch/state.py22-78
Resume Capability	Automatically skips completed pairs, retries only modified solutions
Cluster Pooling	For SkyPilot research evaluations, maintains pool of reusable clusters src/frontier_cs/batch/evaluator.py490-589

State Management

The batch system tracks evaluation progress in .state.{track}.json files containing:

Pair results: Map of {solution}:{problem} to PairResult with score, status, and hashes
Metadata: Start time, update time, total pairs, backend
Hashes: Solution file hash and problem directory hash for detecting changes

When resuming, the system compares stored hashes with current file hashes. If either the solution code or problem configuration changes, the pair is marked for re-evaluation src/frontier_cs/batch/evaluator.py210-243

Resource-Based Cluster Pooling

For research track with SkyPilot backend, BatchEvaluator groups problems by resource signature (GPU type, CPU count, memory) and creates separate cluster pools for each group src/frontier_cs/batch/evaluator.py490-589 This ensures GPU problems get GPU clusters while CPU-only problems use cheaper instances.

Sources: src/frontier_cs/batch/evaluator.py35-1076 src/frontier_cs/batch/state.py114-632

Problem Configuration Format

All problems (both tracks) define runtime parameters in config.yaml files loaded by load_problem_config() src/frontier_cs/config.py87-154

Algorithmic Problems

Sources: algorithmic/README.md106-113

Research Problems

The RuntimeConfig src/frontier_cs/config.py67-76 contains:

timeout_seconds: Execution time limit
docker: DockerConfig src/frontier_cs/config.py48-63 specifying image and GPU requirements
resources: ResourcesConfig src/frontier_cs/config.py18-46 for SkyPilot cluster specifications
language: Target language (affects solution file extension)

Dependencies specified in uv_project are automatically installed via uv pip install src/frontier_cs/runner/base.py203-216 before evaluation.

Sources: src/frontier_cs/config.py18-154 research/README.md52-82

Solution Interface

Algorithmic Track

Solutions must be C++17 source files with standard I/O:

Read input from stdin
Write output to stdout
Single .cpp file per problem

The judge compiles with g++ -std=c++17 -O2 and runs against test cases.

Research Track

Solutions must implement a Solution class with a solve() method. The signature varies by problem type:

Kernel optimization problems (flash_attn, gemm_optimization):

Training problems (imagenet_pareto):

Strategy problems (cant_be_late, llm_router):

The evaluator loads the solution as a Python module, instantiates Solution, and calls solve() with problem-specific arguments defined in evaluator.py research/problems/{problem}/evaluator.py

Sources: research/README.md84-137 SUBMIT.md216-226

Multi-Repository Structure

Frontier-CS uses three repositories:

Repository	Purpose	Contents
Public (Frontier-CS)	Problem definitions, evaluation tools	Algorithmic/research problems (partial test cases), CLI, runners, generation scripts
Internal (Frontier-CS-internal)	Complete test suites	Same structure as public but with full test cases
Results (Frontier-CS-Result)	Evaluation history	JSON files with historical scores (`algorithmic.json`, `research.json`)

The weekly evaluation workflow .github/workflows/weekly-eval.yml and scripts/run_eval.sh1-519:

Clones internal and results repositories
Validates internal ⊇ public (all public problems exist in internal)
Runs batch evaluation using public tools but internal test data
Pushes results to results repository
Renders leaderboard from results repository data

This separation allows public access to problem definitions while keeping complete test suites private to prevent overfitting.

Sources: scripts/run_eval.sh1-519 README.md171-172

Getting Started

To begin using Frontier-CS:

Installation: Install the frontier-cs package. See Installation and Setup for detailed instructions including Python version requirements and Docker setup.
Quick Evaluation: Run a single evaluation to verify installation. See Quick Start Guide for example commands.
Understanding Architecture: Learn about the evaluation pipeline, runners, and problem formats. See System Architecture for detailed component descriptions.
Command Reference: Explore available CLI commands. See Command-Line Interface for complete command documentation.
Problem Contribution: Submit new problems to the benchmark. See Contributing Algorithmic Problems and Contributing Research Problems for guidelines.

Sources: README.md77-139 SUBMIT.md1-421

Overview

Relevant source files

For installation instructions, see Installation and Setup. For a quick tutorial on running evaluations, see Quick Start Guide. For detailed architectural design, see System Architecture.

Sources: README.md1-172

Problem Tracks

Frontier-CS evaluates solutions across two distinct tracks, each with different problem types, evaluation methods, and scoring systems.

Track	Problems	Language	Evaluation Method	Score Type
Algorithmic	172	C++17	Testlib-based judge with go-judge sandboxing	Partial credit (0-100) based on test cases passed
Research	68	Python (default) or C++	Docker/SkyPilot execution with custom evaluators	Continuous (0-100+) based on performance metrics

Algorithmic Track

Key files:

Problem configuration: algorithmic/problems/{id}/config.yaml
Judge implementation: algorithmic/judge/
Evaluation entry point: AlgorithmicLocalRunner and AlgorithmicSkyPilotRunner

Research Track

Key files:

Problem configuration: research/problems/{problem}/config.yaml
Evaluator scripts: research/problems/{problem}/evaluator.py
Evaluation runners: ResearchDockerRunner and ResearchSkyPilotRunner

Sources: README.md24-42 algorithmic/README.md1-125 research/README.md1-138

Core System Components

The system consists of five primary layers:

Command-Line Interface: The frontier command provides four subcommands (eval, batch, list, show) implemented in src/frontier_cs/cli.py1-897 The CLI handles argument parsing, backend selection, and result formatting.
Evaluation Engine: Two evaluator classes coordinate execution:
- SingleEvaluator src/frontier_cs/single_evaluator.py for individual solution evaluation
- BatchEvaluator src/frontier_cs/batch/evaluator.py35-1076 for parallel batch evaluation with worker pools and state management
Execution Backends: Abstract Runner base class src/frontier_cs/runner/base.py46-220 with four implementations:
- AlgorithmicLocalRunner for Docker-based local execution
- AlgorithmicSkyPilotRunner src/frontier_cs/runner/algorithmic_skypilot.py24-330 for cloud execution
- ResearchDockerRunner for local Docker containers
- ResearchSkyPilotRunner src/frontier_cs/runner/research_skypilot.py42-487 for cloud VMs with GPU support
Problem Repositories: Directories containing problem definitions with config.yaml files specifying runtime parameters, resource requirements, and evaluation settings.
Results & State: Evaluation results are captured in EvaluationResult dataclasses src/frontier_cs/runner/base.py24-44 for single evaluations and persisted as PairResult objects in EvaluationState src/frontier_cs/batch/state.py114-632 for batch evaluations.

Sources: src/frontier_cs/cli.py1-897 src/frontier_cs/batch/evaluator.py1-1076 src/frontier_cs/runner/base.py1-220 src/frontier_cs/batch/state.py1-632

Evaluation Workflow

Evaluation Flow

Initialization: User invokes frontier eval command with track, problem ID, and solution file. The CLI src/frontier_cs/cli.py799-869 parses arguments and determines the backend (Docker by default for algorithmic, SkyPilot by default for research).
Configuration Loading: The evaluator loads the problem's config.yaml using load_problem_config() src/frontier_cs/config.py87-154 which specifies timeout limits, resource requirements (CPU/GPU/memory), Docker image, and dependency configuration.
Runner Selection: Based on track and backend, the evaluator instantiates the appropriate runner:
- Algorithmic + Docker → AlgorithmicLocalRunner src/frontier_cs/runner/algorithmic_local.py
- Algorithmic + SkyPilot → AlgorithmicSkyPilotRunner src/frontier_cs/runner/algorithmic_skypilot.py24-330
- Research + Docker → ResearchDockerRunner src/frontier_cs/runner/research_docker.py
- Research + SkyPilot → ResearchSkyPilotRunner src/frontier_cs/runner/research_skypilot.py42-487
Execution:
- Algorithmic: Runner submits code to go-judge HTTP server algorithmic/judge/src/index.ts which compiles C++ and runs test cases in sandboxed environment
- Research: Runner executes evaluator.py in Docker container or cloud VM, which loads the solution's solve() method and runs benchmarks
Scoring: Raw metrics (test results, performance measurements) are converted to 0-100 scores based on problem-specific formulas defined in the evaluator or checker.
Result Capture: Runner returns EvaluationResult dataclass src/frontier_cs/runner/base.py24-44 containing score, status (SUCCESS/ERROR/TIMEOUT), message, logs, and execution duration.

Sources: src/frontier_cs/cli.py799-869 src/frontier_cs/single_evaluator.py src/frontier_cs/runner/base.py46-220 src/frontier_cs/config.py87-154

Batch Evaluation System

For evaluating multiple solutions at scale, BatchEvaluator src/frontier_cs/batch/evaluator.py35-1076 provides parallel execution with incremental state tracking.

Key Features:

Feature	Implementation
Parallel Execution	ThreadPoolExecutor with configurable worker count (default: 10)
State Persistence	JSON-based `EvaluationState` src/frontier_cs/batch/state.py114-632 with atomic writes
Cache Invalidation	Hash-based tracking of solution and problem changes src/frontier_cs/batch/state.py22-78
Resume Capability	Automatically skips completed pairs, retries only modified solutions
Cluster Pooling	For SkyPilot research evaluations, maintains pool of reusable clusters src/frontier_cs/batch/evaluator.py490-589

State Management

The batch system tracks evaluation progress in .state.{track}.json files containing:

Pair results: Map of {solution}:{problem} to PairResult with score, status, and hashes
Metadata: Start time, update time, total pairs, backend
Hashes: Solution file hash and problem directory hash for detecting changes

Resource-Based Cluster Pooling

Sources: src/frontier_cs/batch/evaluator.py35-1076 src/frontier_cs/batch/state.py114-632

Problem Configuration Format

All problems (both tracks) define runtime parameters in config.yaml files loaded by load_problem_config() src/frontier_cs/config.py87-154

Algorithmic Problems

Sources: algorithmic/README.md106-113

Research Problems

The RuntimeConfig src/frontier_cs/config.py67-76 contains:

timeout_seconds: Execution time limit
docker: DockerConfig src/frontier_cs/config.py48-63 specifying image and GPU requirements
resources: ResourcesConfig src/frontier_cs/config.py18-46 for SkyPilot cluster specifications
language: Target language (affects solution file extension)

Dependencies specified in uv_project are automatically installed via uv pip install src/frontier_cs/runner/base.py203-216 before evaluation.

Sources: src/frontier_cs/config.py18-154 research/README.md52-82

Solution Interface

Algorithmic Track

Solutions must be C++17 source files with standard I/O:

Read input from stdin
Write output to stdout
Single .cpp file per problem

The judge compiles with g++ -std=c++17 -O2 and runs against test cases.

Research Track

Solutions must implement a Solution class with a solve() method. The signature varies by problem type:

Kernel optimization problems (flash_attn, gemm_optimization):

Training problems (imagenet_pareto):

Strategy problems (cant_be_late, llm_router):

The evaluator loads the solution as a Python module, instantiates Solution, and calls solve() with problem-specific arguments defined in evaluator.py research/problems/{problem}/evaluator.py

Sources: research/README.md84-137 SUBMIT.md216-226

Multi-Repository Structure

Frontier-CS uses three repositories:

Repository	Purpose	Contents
Public (Frontier-CS)	Problem definitions, evaluation tools	Algorithmic/research problems (partial test cases), CLI, runners, generation scripts
Internal (Frontier-CS-internal)	Complete test suites	Same structure as public but with full test cases
Results (Frontier-CS-Result)	Evaluation history	JSON files with historical scores (`algorithmic.json`, `research.json`)

The weekly evaluation workflow .github/workflows/weekly-eval.yml and scripts/run_eval.sh1-519:

Clones internal and results repositories
Validates internal ⊇ public (all public problems exist in internal)
Runs batch evaluation using public tools but internal test data
Pushes results to results repository
Renders leaderboard from results repository data

This separation allows public access to problem definitions while keeping complete test suites private to prevent overfitting.

Sources: scripts/run_eval.sh1-519 README.md171-172

Getting Started

To begin using Frontier-CS:

Installation: Install the frontier-cs package. See Installation and Setup for detailed instructions including Python version requirements and Docker setup.
Quick Evaluation: Run a single evaluation to verify installation. See Quick Start Guide for example commands.
Understanding Architecture: Learn about the evaluation pipeline, runners, and problem formats. See System Architecture for detailed component descriptions.
Command Reference: Explore available CLI commands. See Command-Line Interface for complete command documentation.
Problem Contribution: Submit new problems to the benchmark. See Contributing Algorithmic Problems and Contributing Research Problems for guidelines.

Sources: README.md77-139 SUBMIT.md1-421

Overview

Problem Tracks

Algorithmic Track

Research Track

Core System Components

Evaluation Workflow

Evaluation Flow

Batch Evaluation System

State Management

Resource-Based Cluster Pooling

Problem Configuration Format

Algorithmic Problems

Research Problems

Solution Interface

Algorithmic Track

Research Track

Multi-Repository Structure

Getting Started

On this page

Overview

Problem Tracks

Algorithmic Track

Research Track

Core System Components

Evaluation Workflow

Evaluation Flow

Batch Evaluation System

State Management

Resource-Based Cluster Pooling

Problem Configuration Format

Algorithmic Problems

Research Problems

Solution Interface

Algorithmic Track

Research Track

Multi-Repository Structure

Getting Started

On this page