APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries

APE-Bench is the reference implementation for the paper "APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries". This codebase implements the complete infrastructure described in the paper, including APE-Bench automated pipeline, APE-Harness task contract framework, and multi-version verification/retrieval services.

Paper to Code Mapping

Core Concepts

Paper Concept	Code Path
Task Contract Abstraction	`src/ape/tasks/base.py` (`BaseTask`, `BaseTaskData`, `BaseTaskConfig`, `BaseTaskResult`, `EvaluationResult`), `src/ape/tasks/models.py` (`WorkspaceInfo`)
APE-Harness Infrastructure	`src/ape/` (orchestration, toolkits, runtime, scaffolds)
APE-Bench Pipeline	`src/datasets/ape_bench/`
Execute Services	`src/ape/toolkits/execute/lean/`
Retrieve Services	`src/ape/toolkits/retrieve/lean/`
APE-Agent Scaffold	`src/ape/scaffolds/ape_agent/`
Claude Code Scaffold	`src/ape/scaffolds/claude_code/`
Codex Scaffold	`src/ape/scaffolds/codex/`

Task Types

Paper Task Type	Code Implementation	Entry Point
Proof Engineering	`src/ape/tasks/lean_tasks/formal_math/proof_engineering/`	`LeanProofEngineeringTask`
Theorem Proving	`src/ape/tasks/lean_tasks/formal_math/theorem_proving/`	`LeanTheoremProvingTask`
Judgment	`src/ape/tasks/lean_tasks/formal_math/judgment/`	`LeanJudgmentTask`
Instruction Synthesis	`src/datasets/ape_bench/task.py`	`InstructionGenerationTask`
Library Annotation	`src/ape/toolkits/retrieve/lean/semantic_annotation/task.py`	`AnnotationTask`

Infrastructure Services

Paper Component	Code Path	Key Function
Content-Addressable Storage	`src/ape/toolkits/execute/lean/core/build_manager.py`	`BuildManager.build_workspace()`
Semantic Search	`src/ape/toolkits/retrieve/lean/`	`LeanRetrieveToolsProvider`
Workspace Isolation	`src/ape/tasks/models.py`, `src/ape/runtime/`	`WorkspaceInfo` (access control), `LocalRuntime`, `IsolatedLocalRuntime`
Dual Verification	Task evaluation in `src/ape/tasks/lean_tasks/formal_math/`	`evaluate()` methods

Installation

Prerequisites

Python 3.10+
Lean 4 with elan installed
Git

Setup Steps

# 1. Clone repository
git clone https://github.com/xinhjBrant/APE-Bench.git
cd APE-Bench

# 2. Install repository
pip install -e .

Quick Start

Creating a custom task requires three components: TaskData (input), Task (logic), and register_task (registration).

See examples/arithmetic/task.py for a complete minimal example.

Example: Arithmetic Task

from ape.tasks.base import BaseTask, BaseTaskData, BaseTaskResult, EvaluationResult, register_task

# 1. Define input data
class ArithmeticTaskData(BaseTaskData):
    expression: str
    expected_result: float

# 2. Implement task logic
class ArithmeticTask(BaseTask):
    task_type = "arithmetic"
    data_class = ArithmeticTaskData
    task_result_class = BaseTaskResult

    async def create_user_prompt(self) -> str:
        return f"Compute: {self.data.expression}"

    async def evaluate(self, submission: str) -> EvaluationResult:
        result = float(submission.strip())
        is_correct = abs(result - self.data.expected_result) < 1e-6
        return EvaluationResult(success=is_correct, score=1.0 if is_correct else 0.0)

# 3. Register task
register_task("arithmetic", ArithmeticTask)

Input JSONL (input.jsonl):

{"task_id": "task_001", "task_type": "arithmetic", "expression": "1234 + 5678", "expected_result": 6912.0}

Usage:

python -m ape.scaffolds.ape_agent.main input.jsonl \
  --task_modules your.module.task \
  llm_config.model_name=gpt_5_mini

The --task_modules parameter imports your task module, which executes register_task() and makes the task type available.

Usage Guide

1. Building Execute Database

Paper Reference: Section 4.2 "Multi-Version Execution"

The execute database provides compilation verification across multiple Mathlib4 versions using content-addressable storage.

# Build from single commit
python -m ape.toolkits.execute.lean.build \
  --target_repo "https://github.com/leanprover-community/mathlib4.git@@85eacf338f4140b402a3f970dde352b457e0dd5f"

# Build from JSONL file containing tasks
python -m ape.toolkits.execute.lean.build \
  --input_file inputs/ape_bench/ape_bench.jsonl \
  --num_processes 8

# Build from multiple datasets
python -m ape.toolkits.execute.lean.build \
  --input_file inputs/minictx_v2/*.jsonl \
  --num_processes 16

Key Parameters:

--target_repo: Single repository in format repo_url@@commit_hash
--input_file: JSONL file(s) containing tasks with target_workspace field
--num_processes: Parallel build processes (default: CPU count)

Output: Compiled artifacts stored in data/code_execute/storage/ with content deduplication.

2. Building Retrieval Index

Paper Reference: Section 4.2 "Version-Scoped Retrieval"

The retrieval index enables semantic search across library declarations at specific commits.

# Build index for single commit
python -m ape.toolkits.retrieve.lean.build \
  --target_repo "https://github.com/leanprover-community/mathlib4.git@@2df2f0150c275ad53cb3c90f7c98ec15a56a1a67@@Mathlib" \
  num_processes=32

# Build with reference workspaces
python -m ape.toolkits.retrieve.lean.build \
  --target_repo "https://github.com/fpvandoorn/carleson.git@@a5d265f109105809de4aaff16776b7c16b1c0bd5@@Carleson" \
  --reference_repo "https://github.com/leanprover-community/mathlib4.git@@79e94a093aff4a60fb1b1f92d9681e407124c2ca@@Mathlib" \
  num_processes=32

# Build from task file
python -m ape.toolkits.retrieve.lean.build \
  --input_file inputs/ape_bench/ape_bench.jsonl \
  num_processes=32

Key Parameters:

--target_repo: Target repository in format repo_url@@commit_hash@@default_target
- default_target: Directory to extract (e.g., Mathlib), or omit for all files
--reference_repo: Reference repositories (can be specified multiple times)
num_processes: Parallel annotation processes

Output:

Embeddings database: data/lean_retrieve/
Annotated declarations: data/lean_retrieve/repos/{repo_name}/storage/annotated_ids.txt

3. Running Evaluation

Paper Reference: Section 5 "Evaluation"

Execute tasks using APE-Harness with different scaffolds and configurations.

Basic Execution

# Run with APE-Agent scaffold
python -m ape.scaffolds.ape_agent.main \
  inputs/minictx_v2/all_test.jsonl \
  llm_config.model_name=gemini_3_flash \
  execution.sample_max_cost=3.0 \
  execution.num_processes=8 \
  runtime_config.runtime_type=isolated_local \
  --orchestrator_id ape_agent_minictx_gemini_3_flash_isolated_local

# Run with Claude Code scaffold
python -m ape.scaffolds.claude_code.main \
  inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=claude_3_opus \
  execution.sample_max_cost=3.0 \
  --orchestrator_id claude_code_ape_bench_opus

Configuration via CLI

All configuration parameters can be overridden using dot-notation:

# LLM configuration
llm_config.model_name=gpt_5.2
llm_config.temperature=0.7
llm_config.max_tokens=8000

# Execution configuration
execution.sample_max_cost=3.0          # Budget per task (USD)
execution.max_turns=100                # Conversation turn limit
execution.num_processes=8              # Parallel workers
execution.sample_count=1               # Samples per task

# Runtime configuration
runtime_config.runtime_type=isolated_local   # or: local, container

Scaffold Comparison (Table 3 in paper)

# APE-Agent (integrated verification)
python -m ape.scaffolds.ape_agent.main inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=gemini_3_flash \
  --orchestrator_id ape_agent_comparison

# Claude Code
python -m ape.scaffolds.claude_code.main inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=gemini_3_flash \
  --orchestrator_id claude_code_comparison

# Codex CLI
python -m ape.scaffolds.codex.main inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=gemini_3_flash \
  --orchestrator_id codex_comparison

4. Constructing APE-Bench

Paper Reference: Section 4 "APE-Bench: Automated Benchmark Construction"

The APE-Bench pipeline automatically extracts proof engineering tasks from Mathlib4 commit histories.

# Extract tasks from date range
python -m src.datasets.ape_bench.main \
  --repo_url "https://github.com/leanprover-community/mathlib4.git" \
  --start_date "2026-01-06" \
  --end_date "2026-01-12" \
  --num_processes 8 \
  model=gpt_5.2 \
  validate_generated_task=true

# Use custom configuration
python -m src.datasets.ape_bench.main \
  --config configs/ape_bench_config.yaml \
  --repo_url "https://github.com/leanprover-community/mathlib4.git" \
  --start_date "2026-01-06" \
  --end_date "2026-01-12"

Key Parameters:

--repo_url: Mathlib4 repository URL
--start_date, --end_date: Date range for commit extraction
model: LLM for instruction synthesis
validate_generated_task: Whether to validate tasks via dual verification
num_processes: Parallel processes for pipeline stages

Pipeline Stages (Section 4.1):

Commit Filtering: Extract file-level modifications (5-100 lines)
Instruction Synthesis: Generate natural-language task descriptions
Validation: Verify tasks are solvable via dual verification

Output: Generated tasks saved to orchestrator run directory under .ape/runs/

Project Structure

.
├── inputs/                          # Benchmark datasets
│   ├── ape_bench/                  # APE-Bench (100 proof engineering tasks)
│   ├── ape_judge_benchmark/        # Semantic judge validation (64 tasks)
│   ├── minictx_v2/                 # Theorem proving benchmarks
│   └── minif2f/                    # Competition theorem proving
├── src/
│   ├── ape/
│   │   ├── tasks/                  # Task contract implementations
│   │   │   ├── base.py            # BaseTask, BaseTaskData
│   │   │   ├── models.py          # WorkspaceInfo, verification protocols
│   │   │   └── lean_tasks/        # Lean-specific tasks
│   │   │       └── formal_math/
│   │   │           ├── proof_engineering/   # APE tasks
│   │   │           ├── theorem_proving/     # Theorem proving
│   │   │           └── judgment/            # Semantic validation
│   │   ├── scaffolds/              # Agent implementations
│   │   │   ├── ape_agent/         # APE-Agent with integrated verification
│   │   │   ├── claude_code/       # Claude Code integration
│   │   │   └── codex/             # Codex CLI scaffold
│   │   ├── toolkits/               # Infrastructure services
│   │   │   ├── execute/           # Verification services
│   │   │   │   └── lean/
│   │   │   │       ├── build.py  # Execute database builder
│   │   │   │       └── core/     # Content-addressable storage
│   │   │   ├── retrieve/          # Semantic search
│   │   │   │   └── lean/
│   │   │   │       ├── build.py  # Retrieval index builder
│   │   │   │       └── semantic_annotation/
│   │   │   ├── file_system/       # File operations
│   │   │   └── code/              # Code navigation
│   │   ├── runtime/                # Execution environments
│   │   │   ├── local/             # Direct execution
│   │   │   ├── isolated_local/    # Workspace copying
│   │   │   └── container/         # Docker isolation
│   │   ├── orchestration/          # Batch execution
│   │   │   ├── orchestrator.py   # TaskOrchestrator
│   │   │   └── worker.py         # Task execution
│   │   └── llm_clients/            # LLM integrations
│   └── datasets/
│       ├── ape_bench/            # APE-Bench pipeline
│       │   ├── main.py            # Pipeline entry point
│       │   ├── collector.py       # Commit filtering
│       │   └── task.py            # Instruction synthesis
│       └── taxonomy/               # Task taxonomy
├── setup.py
├── requirements.txt
└── README.md

Configuration

Task Contract Specification

Task contracts are defined as JSONL files where each line is a JSON object. Tasks are loaded using load_tasks_from_file() from src/ape/tasks/utils.py:

from ape.tasks.utils import load_tasks_from_file
from pathlib import Path

# Load tasks from JSONL file
tasks = load_tasks_from_file(
    file_path=Path("inputs/minif2f/minif2f_test.jsonl"),
    config=scaffold_config,
    max_tasks=None,  # Load all tasks
    task_config_overrides={"enabled_tools": ["lean_verify"]}
)

JSONL Task Format Example (from inputs/minif2f/minif2f_test.jsonl):

{
  "task_id": "mathd_algebra_478",
  "task_type": "lean_theorem_proving",
  "metadata": {"split": "test"},
  "theorem_statement": "import Mathlib\n...\ntheorem mathd_algebra_478 ...",
  "target_workspace": {
    "name": "target",
    "commit_hash": "2df2f0150c275ad53cb3c90f7c98ec15a56a1a67",
    "repo_url": "https://github.com/leanprover-community/mathlib4.git",
    "default_target": "Mathlib",
    "toolchain": "leanprover/lean4:v4.22.0",
    "read_only_path_patterns": ["**/*"]
  }
}

Task Contract Components (from src/ape/tasks/base.py):

BaseTaskData: Input data (task_id, task_type, metadata, etc.)
BaseTaskConfig: Task-specific configuration (enabled_tools, etc.)
BaseTask: Task execution logic (setup, evaluate, etc.)
BaseTaskResult: Output results (success, score, metrics, etc.)
EvaluationResult: Intermediate evaluation state
WorkspaceInfo (from src/ape/tasks/models.py): Workspace metadata with access control:
- blocked_path_patterns: Paths completely blocked from access
- read_only_path_patterns: Paths that can be read but not written
- no_read_path_patterns: Paths that can be written but not read

Scaffold Configuration

Scaffolds are configured via CLI or YAML:

# APE-Agent configuration
scaffold_type: ape_agent

llm_config:
  model_name: gpt_5.2
  temperature: 0.7
  max_tokens: 8000

runtime_config:
  runtime_type: isolated_local

execution:
  sample_count: 3
  sample_max_cost: 3.0
  max_turns: 100
  num_processes: 8

task_config:
  semantic_validation:
    enabled: true
    num_judges: 3

Multi-Version Infrastructure

Execute and retrieve services are configured in:

src/ape/toolkits/execute/lean/config.py - LeanVerifyToolConfig
src/ape/toolkits/retrieve/lean/config.py - LeanRetrieveToolConfig

Default paths:

Repository storage: data/code_execute/repos/
Build artifacts: data/code_execute/storage/
Retrieval database: data/lean_retrieve/
Execution runs: .ape/runs/

Datasets

APE-Bench

Path: inputs/ape_bench/ape_bench.jsonl
Size: 100 tasks from 67 Mathlib4 commits (2026-01-06 to 2026-01-12)
Task Type: Proof engineering (bug fixes, features, refactoring)
Zero Contamination: All commits after model training cutoffs

APE Judge Benchmark

Path: inputs/ape_judge_benchmark/ape_judge_benchmark.jsonl
Size: 64 expert-annotated tasks
Purpose: Validate LLM-as-Judge reliability for semantic equivalence

MiniCtx v2

Path: inputs/minictx_v2/
Size: 334 test tasks + 334 validation tasks
Task Type: Theorem proving from multiple domains
Subsets: mathlib, carleson, FLT, foundation, HepLean, Seymour, ConNF

MiniF2F

Path: inputs/minif2f/
Size: 488 tasks (244 test + 243 validation)
Task Type: Competition mathematics theorem proving

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use APE-Bench in your research, please cite:

@misc{xin2026apebenchevaluatingautomatedproof,
      title={APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries},
      author={Huajian Xin and Luming Li and Xiaoran Jin and Jacques Fleuriot and Wenda Li},
      year={2026},
      eprint={2504.19110},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.19110v3},
}

APE-Bench: 评估形式化数学库的自动化证明工程

APE-Bench 是论文"APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries"的参考实现。本代码库实现了论文中描述的完整基础设施，包括 APE-Bench 自动化流水线、APE-Harness 任务合约框架以及多版本验证/检索服务。

论文到代码映射

核心概念

论文概念	代码路径
Task Contract Abstraction	`src/ape/tasks/base.py` (`BaseTask`, `BaseTaskData`, `BaseTaskConfig`, `BaseTaskResult`, `EvaluationResult`), `src/ape/tasks/models.py` (`WorkspaceInfo`)
APE-Harness Infrastructure	`src/ape/` (orchestration, toolkits, runtime, scaffolds)
APE-Bench Pipeline	`src/datasets/ape_bench/`
Execute Services	`src/ape/toolkits/execute/lean/`
Retrieve Services	`src/ape/toolkits/retrieve/lean/`
APE-Agent Scaffold	`src/ape/scaffolds/ape_agent/`
Claude Code Scaffold	`src/ape/scaffolds/claude_code/`
Codex Scaffold	`src/ape/scaffolds/codex/`

任务类型

论文任务类型	代码实现	入口点
Proof Engineering	`src/ape/tasks/lean_tasks/formal_math/proof_engineering/`	`LeanProofEngineeringTask`
Theorem Proving	`src/ape/tasks/lean_tasks/formal_math/theorem_proving/`	`LeanTheoremProvingTask`
Judgment	`src/ape/tasks/lean_tasks/formal_math/judgment/`	`LeanJudgmentTask`
Instruction Synthesis	`src/datasets/ape_bench/task.py`	`InstructionGenerationTask`
Library Annotation	`src/ape/toolkits/retrieve/lean/semantic_annotation/task.py`	`AnnotationTask`

基础设施服务

论文组件	代码路径	关键函数
Content-Addressable Storage	`src/ape/toolkits/execute/lean/core/build_manager.py`	`BuildManager.build_workspace()`
Semantic Search	`src/ape/toolkits/retrieve/lean/`	`LeanRetrieveToolsProvider`
Workspace Isolation	`src/ape/tasks/models.py`, `src/ape/runtime/`	`WorkspaceInfo` (访问控制), `LocalRuntime`, `IsolatedLocalRuntime`
Dual Verification	`src/ape/tasks/lean_tasks/formal_math/` 中的任务评估	`evaluate()` 方法

安装

前置要求

Python 3.10+
已安装 Lean 4 和 elan
Git

安装步骤

# 1. 克隆仓库
git clone https://github.com/xinhjBrant/APE-Bench.git
cd APE-Bench

# 2. 安装仓库
pip install -e .

快速开始

创建自定义任务需要三个组件：TaskData（输入）、Task（逻辑）、register_task（注册）。

参考 examples/arithmetic/task.py 查看完整的最小示例。

示例：算术任务

from ape.tasks.base import BaseTask, BaseTaskData, BaseTaskResult, EvaluationResult, register_task

# 1. 定义输入数据
class ArithmeticTaskData(BaseTaskData):
    expression: str
    expected_result: float

# 2. 实现任务逻辑
class ArithmeticTask(BaseTask):
    task_type = "arithmetic"
    data_class = ArithmeticTaskData
    task_result_class = BaseTaskResult

    async def create_user_prompt(self) -> str:
        return f"计算: {self.data.expression}"

    async def evaluate(self, submission: str) -> EvaluationResult:
        result = float(submission.strip())
        is_correct = abs(result - self.data.expected_result) < 1e-6
        return EvaluationResult(success=is_correct, score=1.0 if is_correct else 0.0)

# 3. 注册任务
register_task("arithmetic", ArithmeticTask)

输入 JSONL (input.jsonl):

{"task_id": "task_001", "task_type": "arithmetic", "expression": "1234 + 5678", "expected_result": 6912.0}

使用:

python -m ape.scaffolds.ape_agent.main input.jsonl \
  --task_modules your.module.task \
  llm_config.model_name=gpt_5_mini

--task_modules 参数导入任务模块，执行 register_task() 使任务类型可用。

使用指南

1. 构建执行数据库

论文参考: 第 4.2 节 "Multi-Version Execution"

执行数据库使用内容寻址存储为多个 Mathlib4 版本提供编译验证。

# 从单个提交构建
python -m ape.toolkits.execute.lean.build \
  --target_repo "https://github.com/leanprover-community/mathlib4.git@@85eacf338f4140b402a3f970dde352b457e0dd5f"

# 从包含任务的 JSONL 文件构建
python -m ape.toolkits.execute.lean.build \
  --input_file inputs/ape_bench/ape_bench.jsonl \
  --num_processes 8

# 从多个数据集构建
python -m ape.toolkits.execute.lean.build \
  --input_file inputs/minictx_v2/*.jsonl \
  --num_processes 16

关键参数:

--target_repo: 单个仓库，格式为 repo_url@@commit_hash
--input_file: 包含 target_workspace 字段的任务 JSONL 文件
--num_processes: 并行构建进程数 (默认: CPU 数量)

输出: 编译产物存储在 data/code_execute/storage/，带内容去重。

2. 构建检索索引

论文参考: 第 4.2 节 "Version-Scoped Retrieval"

检索索引支持在特定提交的库声明中进行语义搜索。

# 为单个提交构建索引
python -m ape.toolkits.retrieve.lean.build \
  --target_repo "https://github.com/leanprover-community/mathlib4.git@@2df2f0150c275ad53cb3c90f7c98ec15a56a1a67@@Mathlib" \
  num_processes=32

# 使用参考工作空间构建
python -m ape.toolkits.retrieve.lean.build \
  --target_repo "https://github.com/fpvandoorn/carleson.git@@a5d265f109105809de4aaff16776b7c16b1c0bd5@@Carleson" \
  --reference_repo "https://github.com/leanprover-community/mathlib4.git@@79e94a093aff4a60fb1b1f92d9681e407124c2ca@@Mathlib" \
  num_processes=32

# 从任务文件构建
python -m ape.toolkits.retrieve.lean.build \
  --input_file inputs/ape_bench/ape_bench.jsonl \
  num_processes=32

关键参数:

--target_repo: 目标仓库，格式为 repo_url@@commit_hash@@default_target
- default_target: 要提取的目录 (如 Mathlib)，或省略以提取所有文件
--reference_repo: 参考仓库 (可以多次指定)
num_processes: 并行标注进程数

输出:

嵌入数据库: data/lean_retrieve/
已标注声明: data/lean_retrieve/repos/{repo_name}/storage/annotated_ids.txt

3. 运行评估

论文参考: 第 5 节 "Evaluation"

使用 APE-Harness 以不同的 scaffolds 和配置执行任务。

基本执行

# 使用 APE-Agent scaffold 运行
python -m ape.scaffolds.ape_agent.main \
  inputs/minictx_v2/all_test.jsonl \
  llm_config.model_name=gemini_3_flash \
  execution.sample_max_cost=3.0 \
  execution.num_processes=8 \
  runtime_config.runtime_type=isolated_local \
  --orchestrator_id ape_agent_minictx_gemini_3_flash_isolated_local

# 使用 Claude Code scaffold 运行
python -m ape.scaffolds.claude_code.main \
  inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=claude_3_opus \
  execution.sample_max_cost=3.0 \
  --orchestrator_id claude_code_ape_bench_opus

通过 CLI 配置

所有配置参数都可以使用点号语法覆盖:

# LLM 配置
llm_config.model_name=gpt_5.2
llm_config.temperature=0.7
llm_config.max_tokens=8000

# 执行配置
execution.sample_max_cost=3.0          # 每个任务的预算 (USD)
execution.max_turns=100                # 对话轮次限制
execution.num_processes=8              # 并行 worker
execution.sample_count=1               # 每个任务的样本数

# 运行时配置
runtime_config.runtime_type=isolated_local   # 或: local, container

Scaffold 对比 (论文表 3)

# APE-Agent (集成验证)
python -m ape.scaffolds.ape_agent.main inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=gemini_3_flash \
  --orchestrator_id ape_agent_comparison

# Claude Code
python -m ape.scaffolds.claude_code.main inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=gemini_3_flash \
  --orchestrator_id claude_code_comparison

# Codex CLI
python -m ape.scaffolds.codex.main inputs/ape_bench/ape_bench.jsonl \
  llm_config.model_name=gemini_3_flash \
  --orchestrator_id codex_comparison

4. 构造 APE-Bench

论文参考: 第 4 节 "APE-Bench: Automated Benchmark Construction"

APE-Bench 流水线从 Mathlib4 提交历史自动提取证明工程任务。

# 从日期范围提取任务
python -m src.datasets.ape_bench.main \
  --repo_url "https://github.com/leanprover-community/mathlib4.git" \
  --start_date "2026-01-06" \
  --end_date "2026-01-12" \
  --num_processes 8 \
  model=gpt_5.2 \
  validate_generated_task=true

# 使用自定义配置
python -m src.datasets.ape_bench.main \
  --config configs/ape_bench_config.yaml \
  --repo_url "https://github.com/leanprover-community/mathlib4.git" \
  --start_date "2026-01-06" \
  --end_date "2026-01-12"

关键参数:

--repo_url: Mathlib4 仓库 URL
--start_date, --end_date: 提取提交的日期范围
model: 用于指令合成的 LLM
validate_generated_task: 是否通过双重验证来验证任务
num_processes: 流水线阶段的并行进程数

流水线阶段 (第 4.1 节):

Commit Filtering: 提取文件级修改 (5-100 行)
Instruction Synthesis: 生成自然语言任务描述
Validation: 通过双重验证验证任务可解

输出: 生成的任务保存到 .ape/runs/ 下的编排器运行目录

项目结构

.
├── inputs/                          # 基准测试数据集
│   ├── ape_bench/                  # APE-Bench (100 个证明工程任务)
│   ├── ape_judge_benchmark/        # 语义判断验证 (64 个任务)
│   ├── minictx_v2/                 # 定理证明基准
│   └── minif2f/                    # 竞赛定理证明
├── src/
│   ├── ape/
│   │   ├── tasks/                  # 任务合约实现
│   │   │   ├── base.py            # BaseTask, BaseTaskData
│   │   │   ├── models.py          # WorkspaceInfo, 验证协议
│   │   │   └── lean_tasks/        # Lean 特定任务
│   │   │       └── formal_math/
│   │   │           ├── proof_engineering/   # APE 任务
│   │   │           ├── theorem_proving/     # 定理证明
│   │   │           └── judgment/            # 语义验证
│   │   ├── scaffolds/              # Agent 实现
│   │   │   ├── ape_agent/         # 带集成验证的 APE-Agent
│   │   │   ├── claude_code/       # Claude Code 集成
│   │   │   └── codex/             # Codex CLI scaffold
│   │   ├── toolkits/               # 基础设施服务
│   │   │   ├── execute/           # 验证服务
│   │   │   │   └── lean/
│   │   │   │       ├── build.py  # 执行数据库构建器
│   │   │   │       └── core/     # 内容寻址存储
│   │   │   ├── retrieve/          # 语义搜索
│   │   │   │   └── lean/
│   │   │   │       ├── build.py  # 检索索引构建器
│   │   │   │       └── semantic_annotation/
│   │   │   ├── file_system/       # 文件操作
│   │   │   └── code/              # 代码导航
│   │   ├── runtime/                # 执行环境
│   │   │   ├── local/             # 直接执行
│   │   │   ├── isolated_local/    # 工作空间复制
│   │   │   └── container/         # Docker 隔离
│   │   ├── orchestration/          # 批量执行
│   │   │   ├── orchestrator.py   # TaskOrchestrator
│   │   │   └── worker.py         # 任务执行
│   │   └── llm_clients/            # LLM 集成
│   └── datasets/
│       ├── ape_bench/            # APE-Bench 流水线
│       │   ├── main.py            # 流水线入口点
│       │   ├── collector.py       # 提交过滤
│       │   └── task.py            # 指令合成
│       └── taxonomy/               # 任务分类
├── setup.py
├── requirements.txt
└── README.md

配置

任务合约规范

任务合约定义为JSONL文件，每行是一个JSON对象。任务通过 src/ape/tasks/utils.py 中的 load_tasks_from_file() 加载：

from ape.tasks.utils import load_tasks_from_file
from pathlib import Path

# 从JSONL文件加载任务
tasks = load_tasks_from_file(
    file_path=Path("inputs/minif2f/minif2f_test.jsonl"),
    config=scaffold_config,
    max_tasks=None,  # 加载所有任务
    task_config_overrides={"enabled_tools": ["lean_verify"]}
)

JSONL任务格式示例 (来自 inputs/minif2f/minif2f_test.jsonl):

{
  "task_id": "mathd_algebra_478",
  "task_type": "lean_theorem_proving",
  "metadata": {"split": "test"},
  "theorem_statement": "import Mathlib\n...\ntheorem mathd_algebra_478 ...",
  "target_workspace": {
    "name": "target",
    "commit_hash": "2df2f0150c275ad53cb3c90f7c98ec15a56a1a67",
    "repo_url": "https://github.com/leanprover-community/mathlib4.git",
    "default_target": "Mathlib",
    "toolchain": "leanprover/lean4:v4.22.0",
    "read_only_path_patterns": ["**/*"]
  }
}

任务合约组件 (来自 src/ape/tasks/base.py):

BaseTaskData: 输入数据 (task_id, task_type, metadata等)
BaseTaskConfig: 任务特定配置 (enabled_tools等)
BaseTask: 任务执行逻辑 (setup, evaluate等)
BaseTaskResult: 输出结果 (success, score, metrics等)
EvaluationResult: 中间评估状态
WorkspaceInfo (来自 src/ape/tasks/models.py): 工作空间元数据及访问控制:
- blocked_path_patterns: 完全禁止访问的路径
- read_only_path_patterns: 可读但不可写的路径
- no_read_path_patterns: 可写但不可读的路径

Scaffold 配置

Scaffolds 通过 CLI 或 YAML 配置:

# APE-Agent 配置
scaffold_type: ape_agent

llm_config:
  model_name: gpt_5.2
  temperature: 0.7
  max_tokens: 8000

runtime_config:
  runtime_type: isolated_local

execution:
  sample_count: 3
  sample_max_cost: 3.0
  max_turns: 100
  num_processes: 8

task_config:
  semantic_validation:
    enabled: true
    num_judges: 3

多版本基础设施

执行和检索服务配置在:

src/ape/toolkits/execute/lean/config.py - LeanVerifyToolConfig
src/ape/toolkits/retrieve/lean/config.py - LeanRetrieveToolConfig

默认路径:

仓库存储: data/code_execute/repos/
构建产物: data/code_execute/storage/
检索数据库: data/lean_retrieve/
执行运行: .ape/runs/

数据集

APE-Bench

路径: inputs/ape_bench/ape_bench.jsonl
大小: 来自 67 个 Mathlib4 提交的 100 个任务 (2026-01-06 至 2026-01-12)
任务类型: 证明工程 (错误修复、功能、重构)
零污染: 所有提交都在模型训练截止之后

APE Judge Benchmark

路径: inputs/ape_judge_benchmark/ape_judge_benchmark.jsonl
大小: 64 个专家标注任务
用途: 验证 LLM-as-Judge 在语义等价性判断上的可靠性

MiniCtx v2

路径: inputs/minictx_v2/
大小: 334 个测试任务 + 334 个验证任务
任务类型: 多个领域的定理证明
子集: mathlib, carleson, FLT, foundation, HepLean, Seymour, ConNF

MiniF2F

路径: inputs/minif2f/
大小: 488 个任务 (244 测试 + 243 验证)
任务类型: 竞赛数学定理证明

许可证

本项目根据 MIT 许可证授权。详情见 LICENSE 文件。

引用

如果您在研究中使用 APE-Bench，请引用:

@misc{xin2026apebenchevaluatingautomatedproof,
      title={APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries},
      author={Huajian Xin and Luming Li and Xiaoran Jin and Jacques Fleuriot and Wenda Li},
      year={2026},
      eprint={2504.19110},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.19110v3},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
examples/arithmetic		examples/arithmetic
inputs		inputs
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries

Table of Contents

Paper to Code Mapping

Core Concepts

Task Types

Infrastructure Services

Installation

Prerequisites

Setup Steps

Quick Start

Example: Arithmetic Task

Usage Guide

1. Building Execute Database

2. Building Retrieval Index

3. Running Evaluation

Basic Execution

Configuration via CLI

Scaffold Comparison (Table 3 in paper)

4. Constructing APE-Bench

Project Structure

Configuration

Task Contract Specification

Scaffold Configuration

Multi-Version Infrastructure

Datasets

APE-Bench

APE Judge Benchmark

MiniCtx v2

MiniF2F

License

Citation

APE-Bench: 评估形式化数学库的自动化证明工程

目录

论文到代码映射

核心概念

任务类型

基础设施服务

安装

前置要求

安装步骤

快速开始

示例：算术任务

使用指南

1. 构建执行数据库

2. 构建检索索引

3. 运行评估

基本执行

通过 CLI 配置

Scaffold 对比 (论文表 3)

4. 构造 APE-Bench

项目结构

配置

任务合约规范

Scaffold 配置

多版本基础设施

数据集

APE-Bench

APE Judge Benchmark

MiniCtx v2

MiniF2F

许可证

引用

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages