APE-Bench is the reference implementation for the paper "APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries". This codebase implements the complete infrastructure described in the paper, including APE-Bench automated pipeline, APE-Harness task contract framework, and multi-version verification/retrieval services.
- Paper to Code Mapping
- Installation
- Quick Start
- Usage Guide
- Project Structure
- Configuration
- Datasets
- License
- Citation
| Paper Concept | Code Path |
|---|---|
| Task Contract Abstraction | src/ape/tasks/base.py (BaseTask, BaseTaskData, BaseTaskConfig, BaseTaskResult, EvaluationResult), src/ape/tasks/models.py (WorkspaceInfo) |
| APE-Harness Infrastructure | src/ape/ (orchestration, toolkits, runtime, scaffolds) |
| APE-Bench Pipeline | src/datasets/ape_bench/ |
| Execute Services | src/ape/toolkits/execute/lean/ |
| Retrieve Services | src/ape/toolkits/retrieve/lean/ |
| APE-Agent Scaffold | src/ape/scaffolds/ape_agent/ |
| Claude Code Scaffold | src/ape/scaffolds/claude_code/ |
| Codex Scaffold | src/ape/scaffolds/codex/ |
| Paper Task Type | Code Implementation | Entry Point |
|---|---|---|
| Proof Engineering | src/ape/tasks/lean_tasks/formal_math/proof_engineering/ |
LeanProofEngineeringTask |
| Theorem Proving | src/ape/tasks/lean_tasks/formal_math/theorem_proving/ |
LeanTheoremProvingTask |
| Judgment | src/ape/tasks/lean_tasks/formal_math/judgment/ |
LeanJudgmentTask |
| Instruction Synthesis | src/datasets/ape_bench/task.py |
InstructionGenerationTask |
| Library Annotation | src/ape/toolkits/retrieve/lean/semantic_annotation/task.py |
AnnotationTask |
| Paper Component | Code Path | Key Function |
|---|---|---|
| Content-Addressable Storage | src/ape/toolkits/execute/lean/core/build_manager.py |
BuildManager.build_workspace() |
| Semantic Search | src/ape/toolkits/retrieve/lean/ |
LeanRetrieveToolsProvider |
| Workspace Isolation | src/ape/tasks/models.py, src/ape/runtime/ |
WorkspaceInfo (access control), LocalRuntime, IsolatedLocalRuntime |
| Dual Verification | Task evaluation in src/ape/tasks/lean_tasks/formal_math/ |
evaluate() methods |
# 1. Clone repository
git clone https://github.com/xinhjBrant/APE-Bench.git
cd APE-Bench
# 2. Install repository
pip install -e .Creating a custom task requires three components: TaskData (input), Task (logic), and register_task (registration).
See examples/arithmetic/task.py for a complete minimal example.
from ape.tasks.base import BaseTask, BaseTaskData, BaseTaskResult, EvaluationResult, register_task
# 1. Define input data
class ArithmeticTaskData(BaseTaskData):
expression: str
expected_result: float
# 2. Implement task logic
class ArithmeticTask(BaseTask):
task_type = "arithmetic"
data_class = ArithmeticTaskData
task_result_class = BaseTaskResult
async def create_user_prompt(self) -> str:
return f"Compute: {self.data.expression}"
async def evaluate(self, submission: str) -> EvaluationResult:
result = float(submission.strip())
is_correct = abs(result - self.data.expected_result) < 1e-6
return EvaluationResult(success=is_correct, score=1.0 if is_correct else 0.0)
# 3. Register task
register_task("arithmetic", ArithmeticTask)Input JSONL (input.jsonl):
{"task_id": "task_001", "task_type": "arithmetic", "expression": "1234 + 5678", "expected_result": 6912.0}Usage:
python -m ape.scaffolds.ape_agent.main input.jsonl \
--task_modules your.module.task \
llm_config.model_name=gpt_5_miniThe --task_modules parameter imports your task module, which executes register_task() and makes the task type available.
Paper Reference: Section 4.2 "Multi-Version Execution"
The execute database provides compilation verification across multiple Mathlib4 versions using content-addressable storage.
# Build from single commit
python -m ape.toolkits.execute.lean.build \
--target_repo "https://github.com/leanprover-community/mathlib4.git@@85eacf338f4140b402a3f970dde352b457e0dd5f"
# Build from JSONL file containing tasks
python -m ape.toolkits.execute.lean.build \
--input_file inputs/ape_bench/ape_bench.jsonl \
--num_processes 8
# Build from multiple datasets
python -m ape.toolkits.execute.lean.build \
--input_file inputs/minictx_v2/*.jsonl \
--num_processes 16Key Parameters:
--target_repo: Single repository in formatrepo_url@@commit_hash--input_file: JSONL file(s) containing tasks withtarget_workspacefield--num_processes: Parallel build processes (default: CPU count)
Output: Compiled artifacts stored in data/code_execute/storage/ with content deduplication.
Paper Reference: Section 4.2 "Version-Scoped Retrieval"
The retrieval index enables semantic search across library declarations at specific commits.
# Build index for single commit
python -m ape.toolkits.retrieve.lean.build \
--target_repo "https://github.com/leanprover-community/mathlib4.git@@2df2f0150c275ad53cb3c90f7c98ec15a56a1a67@@Mathlib" \
num_processes=32
# Build with reference workspaces
python -m ape.toolkits.retrieve.lean.build \
--target_repo "https://github.com/fpvandoorn/carleson.git@@a5d265f109105809de4aaff16776b7c16b1c0bd5@@Carleson" \
--reference_repo "https://github.com/leanprover-community/mathlib4.git@@79e94a093aff4a60fb1b1f92d9681e407124c2ca@@Mathlib" \
num_processes=32
# Build from task file
python -m ape.toolkits.retrieve.lean.build \
--input_file inputs/ape_bench/ape_bench.jsonl \
num_processes=32Key Parameters:
--target_repo: Target repository in formatrepo_url@@commit_hash@@default_targetdefault_target: Directory to extract (e.g.,Mathlib), or omit for all files
--reference_repo: Reference repositories (can be specified multiple times)num_processes: Parallel annotation processes
Output:
- Embeddings database:
data/lean_retrieve/ - Annotated declarations:
data/lean_retrieve/repos/{repo_name}/storage/annotated_ids.txt
Paper Reference: Section 5 "Evaluation"
Execute tasks using APE-Harness with different scaffolds and configurations.
# Run with APE-Agent scaffold
python -m ape.scaffolds.ape_agent.main \
inputs/minictx_v2/all_test.jsonl \
llm_config.model_name=gemini_3_flash \
execution.sample_max_cost=3.0 \
execution.num_processes=8 \
runtime_config.runtime_type=isolated_local \
--orchestrator_id ape_agent_minictx_gemini_3_flash_isolated_local
# Run with Claude Code scaffold
python -m ape.scaffolds.claude_code.main \
inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=claude_3_opus \
execution.sample_max_cost=3.0 \
--orchestrator_id claude_code_ape_bench_opusAll configuration parameters can be overridden using dot-notation:
# LLM configuration
llm_config.model_name=gpt_5.2
llm_config.temperature=0.7
llm_config.max_tokens=8000
# Execution configuration
execution.sample_max_cost=3.0 # Budget per task (USD)
execution.max_turns=100 # Conversation turn limit
execution.num_processes=8 # Parallel workers
execution.sample_count=1 # Samples per task
# Runtime configuration
runtime_config.runtime_type=isolated_local # or: local, container# APE-Agent (integrated verification)
python -m ape.scaffolds.ape_agent.main inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=gemini_3_flash \
--orchestrator_id ape_agent_comparison
# Claude Code
python -m ape.scaffolds.claude_code.main inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=gemini_3_flash \
--orchestrator_id claude_code_comparison
# Codex CLI
python -m ape.scaffolds.codex.main inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=gemini_3_flash \
--orchestrator_id codex_comparisonPaper Reference: Section 4 "APE-Bench: Automated Benchmark Construction"
The APE-Bench pipeline automatically extracts proof engineering tasks from Mathlib4 commit histories.
# Extract tasks from date range
python -m src.datasets.ape_bench.main \
--repo_url "https://github.com/leanprover-community/mathlib4.git" \
--start_date "2026-01-06" \
--end_date "2026-01-12" \
--num_processes 8 \
model=gpt_5.2 \
validate_generated_task=true
# Use custom configuration
python -m src.datasets.ape_bench.main \
--config configs/ape_bench_config.yaml \
--repo_url "https://github.com/leanprover-community/mathlib4.git" \
--start_date "2026-01-06" \
--end_date "2026-01-12"Key Parameters:
--repo_url: Mathlib4 repository URL--start_date,--end_date: Date range for commit extractionmodel: LLM for instruction synthesisvalidate_generated_task: Whether to validate tasks via dual verificationnum_processes: Parallel processes for pipeline stages
Pipeline Stages (Section 4.1):
- Commit Filtering: Extract file-level modifications (5-100 lines)
- Instruction Synthesis: Generate natural-language task descriptions
- Validation: Verify tasks are solvable via dual verification
Output: Generated tasks saved to orchestrator run directory under .ape/runs/
.
├── inputs/ # Benchmark datasets
│ ├── ape_bench/ # APE-Bench (100 proof engineering tasks)
│ ├── ape_judge_benchmark/ # Semantic judge validation (64 tasks)
│ ├── minictx_v2/ # Theorem proving benchmarks
│ └── minif2f/ # Competition theorem proving
├── src/
│ ├── ape/
│ │ ├── tasks/ # Task contract implementations
│ │ │ ├── base.py # BaseTask, BaseTaskData
│ │ │ ├── models.py # WorkspaceInfo, verification protocols
│ │ │ └── lean_tasks/ # Lean-specific tasks
│ │ │ └── formal_math/
│ │ │ ├── proof_engineering/ # APE tasks
│ │ │ ├── theorem_proving/ # Theorem proving
│ │ │ └── judgment/ # Semantic validation
│ │ ├── scaffolds/ # Agent implementations
│ │ │ ├── ape_agent/ # APE-Agent with integrated verification
│ │ │ ├── claude_code/ # Claude Code integration
│ │ │ └── codex/ # Codex CLI scaffold
│ │ ├── toolkits/ # Infrastructure services
│ │ │ ├── execute/ # Verification services
│ │ │ │ └── lean/
│ │ │ │ ├── build.py # Execute database builder
│ │ │ │ └── core/ # Content-addressable storage
│ │ │ ├── retrieve/ # Semantic search
│ │ │ │ └── lean/
│ │ │ │ ├── build.py # Retrieval index builder
│ │ │ │ └── semantic_annotation/
│ │ │ ├── file_system/ # File operations
│ │ │ └── code/ # Code navigation
│ │ ├── runtime/ # Execution environments
│ │ │ ├── local/ # Direct execution
│ │ │ ├── isolated_local/ # Workspace copying
│ │ │ └── container/ # Docker isolation
│ │ ├── orchestration/ # Batch execution
│ │ │ ├── orchestrator.py # TaskOrchestrator
│ │ │ └── worker.py # Task execution
│ │ └── llm_clients/ # LLM integrations
│ └── datasets/
│ ├── ape_bench/ # APE-Bench pipeline
│ │ ├── main.py # Pipeline entry point
│ │ ├── collector.py # Commit filtering
│ │ └── task.py # Instruction synthesis
│ └── taxonomy/ # Task taxonomy
├── setup.py
├── requirements.txt
└── README.md
Task contracts are defined as JSONL files where each line is a JSON object. Tasks are loaded using load_tasks_from_file() from src/ape/tasks/utils.py:
from ape.tasks.utils import load_tasks_from_file
from pathlib import Path
# Load tasks from JSONL file
tasks = load_tasks_from_file(
file_path=Path("inputs/minif2f/minif2f_test.jsonl"),
config=scaffold_config,
max_tasks=None, # Load all tasks
task_config_overrides={"enabled_tools": ["lean_verify"]}
)JSONL Task Format Example (from inputs/minif2f/minif2f_test.jsonl):
{
"task_id": "mathd_algebra_478",
"task_type": "lean_theorem_proving",
"metadata": {"split": "test"},
"theorem_statement": "import Mathlib\n...\ntheorem mathd_algebra_478 ...",
"target_workspace": {
"name": "target",
"commit_hash": "2df2f0150c275ad53cb3c90f7c98ec15a56a1a67",
"repo_url": "https://github.com/leanprover-community/mathlib4.git",
"default_target": "Mathlib",
"toolchain": "leanprover/lean4:v4.22.0",
"read_only_path_patterns": ["**/*"]
}
}Task Contract Components (from src/ape/tasks/base.py):
BaseTaskData: Input data (task_id, task_type, metadata, etc.)BaseTaskConfig: Task-specific configuration (enabled_tools, etc.)BaseTask: Task execution logic (setup, evaluate, etc.)BaseTaskResult: Output results (success, score, metrics, etc.)EvaluationResult: Intermediate evaluation stateWorkspaceInfo(fromsrc/ape/tasks/models.py): Workspace metadata with access control:blocked_path_patterns: Paths completely blocked from accessread_only_path_patterns: Paths that can be read but not writtenno_read_path_patterns: Paths that can be written but not read
Scaffolds are configured via CLI or YAML:
# APE-Agent configuration
scaffold_type: ape_agent
llm_config:
model_name: gpt_5.2
temperature: 0.7
max_tokens: 8000
runtime_config:
runtime_type: isolated_local
execution:
sample_count: 3
sample_max_cost: 3.0
max_turns: 100
num_processes: 8
task_config:
semantic_validation:
enabled: true
num_judges: 3Execute and retrieve services are configured in:
src/ape/toolkits/execute/lean/config.py-LeanVerifyToolConfigsrc/ape/toolkits/retrieve/lean/config.py-LeanRetrieveToolConfig
Default paths:
- Repository storage:
data/code_execute/repos/ - Build artifacts:
data/code_execute/storage/ - Retrieval database:
data/lean_retrieve/ - Execution runs:
.ape/runs/
- Path:
inputs/ape_bench/ape_bench.jsonl - Size: 100 tasks from 67 Mathlib4 commits (2026-01-06 to 2026-01-12)
- Task Type: Proof engineering (bug fixes, features, refactoring)
- Zero Contamination: All commits after model training cutoffs
- Path:
inputs/ape_judge_benchmark/ape_judge_benchmark.jsonl - Size: 64 expert-annotated tasks
- Purpose: Validate LLM-as-Judge reliability for semantic equivalence
- Path:
inputs/minictx_v2/ - Size: 334 test tasks + 334 validation tasks
- Task Type: Theorem proving from multiple domains
- Subsets: mathlib, carleson, FLT, foundation, HepLean, Seymour, ConNF
- Path:
inputs/minif2f/ - Size: 488 tasks (244 test + 243 validation)
- Task Type: Competition mathematics theorem proving
This project is licensed under the MIT License. See the LICENSE file for details.
If you use APE-Bench in your research, please cite:
@misc{xin2026apebenchevaluatingautomatedproof,
title={APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries},
author={Huajian Xin and Luming Li and Xiaoran Jin and Jacques Fleuriot and Wenda Li},
year={2026},
eprint={2504.19110},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.19110v3},
}APE-Bench 是论文"APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries"的参考实现。本代码库实现了论文中描述的完整基础设施,包括 APE-Bench 自动化流水线、APE-Harness 任务合约框架以及多版本验证/检索服务。
| 论文概念 | 代码路径 |
|---|---|
| Task Contract Abstraction | src/ape/tasks/base.py (BaseTask, BaseTaskData, BaseTaskConfig, BaseTaskResult, EvaluationResult), src/ape/tasks/models.py (WorkspaceInfo) |
| APE-Harness Infrastructure | src/ape/ (orchestration, toolkits, runtime, scaffolds) |
| APE-Bench Pipeline | src/datasets/ape_bench/ |
| Execute Services | src/ape/toolkits/execute/lean/ |
| Retrieve Services | src/ape/toolkits/retrieve/lean/ |
| APE-Agent Scaffold | src/ape/scaffolds/ape_agent/ |
| Claude Code Scaffold | src/ape/scaffolds/claude_code/ |
| Codex Scaffold | src/ape/scaffolds/codex/ |
| 论文任务类型 | 代码实现 | 入口点 |
|---|---|---|
| Proof Engineering | src/ape/tasks/lean_tasks/formal_math/proof_engineering/ |
LeanProofEngineeringTask |
| Theorem Proving | src/ape/tasks/lean_tasks/formal_math/theorem_proving/ |
LeanTheoremProvingTask |
| Judgment | src/ape/tasks/lean_tasks/formal_math/judgment/ |
LeanJudgmentTask |
| Instruction Synthesis | src/datasets/ape_bench/task.py |
InstructionGenerationTask |
| Library Annotation | src/ape/toolkits/retrieve/lean/semantic_annotation/task.py |
AnnotationTask |
| 论文组件 | 代码路径 | 关键函数 |
|---|---|---|
| Content-Addressable Storage | src/ape/toolkits/execute/lean/core/build_manager.py |
BuildManager.build_workspace() |
| Semantic Search | src/ape/toolkits/retrieve/lean/ |
LeanRetrieveToolsProvider |
| Workspace Isolation | src/ape/tasks/models.py, src/ape/runtime/ |
WorkspaceInfo (访问控制), LocalRuntime, IsolatedLocalRuntime |
| Dual Verification | src/ape/tasks/lean_tasks/formal_math/ 中的任务评估 |
evaluate() 方法 |
# 1. 克隆仓库
git clone https://github.com/xinhjBrant/APE-Bench.git
cd APE-Bench
# 2. 安装仓库
pip install -e .创建自定义任务需要三个组件:TaskData(输入)、Task(逻辑)、register_task(注册)。
参考 examples/arithmetic/task.py 查看完整的最小示例。
from ape.tasks.base import BaseTask, BaseTaskData, BaseTaskResult, EvaluationResult, register_task
# 1. 定义输入数据
class ArithmeticTaskData(BaseTaskData):
expression: str
expected_result: float
# 2. 实现任务逻辑
class ArithmeticTask(BaseTask):
task_type = "arithmetic"
data_class = ArithmeticTaskData
task_result_class = BaseTaskResult
async def create_user_prompt(self) -> str:
return f"计算: {self.data.expression}"
async def evaluate(self, submission: str) -> EvaluationResult:
result = float(submission.strip())
is_correct = abs(result - self.data.expected_result) < 1e-6
return EvaluationResult(success=is_correct, score=1.0 if is_correct else 0.0)
# 3. 注册任务
register_task("arithmetic", ArithmeticTask)输入 JSONL (input.jsonl):
{"task_id": "task_001", "task_type": "arithmetic", "expression": "1234 + 5678", "expected_result": 6912.0}使用:
python -m ape.scaffolds.ape_agent.main input.jsonl \
--task_modules your.module.task \
llm_config.model_name=gpt_5_mini--task_modules 参数导入任务模块,执行 register_task() 使任务类型可用。
论文参考: 第 4.2 节 "Multi-Version Execution"
执行数据库使用内容寻址存储为多个 Mathlib4 版本提供编译验证。
# 从单个提交构建
python -m ape.toolkits.execute.lean.build \
--target_repo "https://github.com/leanprover-community/mathlib4.git@@85eacf338f4140b402a3f970dde352b457e0dd5f"
# 从包含任务的 JSONL 文件构建
python -m ape.toolkits.execute.lean.build \
--input_file inputs/ape_bench/ape_bench.jsonl \
--num_processes 8
# 从多个数据集构建
python -m ape.toolkits.execute.lean.build \
--input_file inputs/minictx_v2/*.jsonl \
--num_processes 16关键参数:
--target_repo: 单个仓库,格式为repo_url@@commit_hash--input_file: 包含target_workspace字段的任务 JSONL 文件--num_processes: 并行构建进程数 (默认: CPU 数量)
输出: 编译产物存储在 data/code_execute/storage/,带内容去重。
论文参考: 第 4.2 节 "Version-Scoped Retrieval"
检索索引支持在特定提交的库声明中进行语义搜索。
# 为单个提交构建索引
python -m ape.toolkits.retrieve.lean.build \
--target_repo "https://github.com/leanprover-community/mathlib4.git@@2df2f0150c275ad53cb3c90f7c98ec15a56a1a67@@Mathlib" \
num_processes=32
# 使用参考工作空间构建
python -m ape.toolkits.retrieve.lean.build \
--target_repo "https://github.com/fpvandoorn/carleson.git@@a5d265f109105809de4aaff16776b7c16b1c0bd5@@Carleson" \
--reference_repo "https://github.com/leanprover-community/mathlib4.git@@79e94a093aff4a60fb1b1f92d9681e407124c2ca@@Mathlib" \
num_processes=32
# 从任务文件构建
python -m ape.toolkits.retrieve.lean.build \
--input_file inputs/ape_bench/ape_bench.jsonl \
num_processes=32关键参数:
--target_repo: 目标仓库,格式为repo_url@@commit_hash@@default_targetdefault_target: 要提取的目录 (如Mathlib),或省略以提取所有文件
--reference_repo: 参考仓库 (可以多次指定)num_processes: 并行标注进程数
输出:
- 嵌入数据库:
data/lean_retrieve/ - 已标注声明:
data/lean_retrieve/repos/{repo_name}/storage/annotated_ids.txt
论文参考: 第 5 节 "Evaluation"
使用 APE-Harness 以不同的 scaffolds 和配置执行任务。
# 使用 APE-Agent scaffold 运行
python -m ape.scaffolds.ape_agent.main \
inputs/minictx_v2/all_test.jsonl \
llm_config.model_name=gemini_3_flash \
execution.sample_max_cost=3.0 \
execution.num_processes=8 \
runtime_config.runtime_type=isolated_local \
--orchestrator_id ape_agent_minictx_gemini_3_flash_isolated_local
# 使用 Claude Code scaffold 运行
python -m ape.scaffolds.claude_code.main \
inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=claude_3_opus \
execution.sample_max_cost=3.0 \
--orchestrator_id claude_code_ape_bench_opus所有配置参数都可以使用点号语法覆盖:
# LLM 配置
llm_config.model_name=gpt_5.2
llm_config.temperature=0.7
llm_config.max_tokens=8000
# 执行配置
execution.sample_max_cost=3.0 # 每个任务的预算 (USD)
execution.max_turns=100 # 对话轮次限制
execution.num_processes=8 # 并行 worker
execution.sample_count=1 # 每个任务的样本数
# 运行时配置
runtime_config.runtime_type=isolated_local # 或: local, container# APE-Agent (集成验证)
python -m ape.scaffolds.ape_agent.main inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=gemini_3_flash \
--orchestrator_id ape_agent_comparison
# Claude Code
python -m ape.scaffolds.claude_code.main inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=gemini_3_flash \
--orchestrator_id claude_code_comparison
# Codex CLI
python -m ape.scaffolds.codex.main inputs/ape_bench/ape_bench.jsonl \
llm_config.model_name=gemini_3_flash \
--orchestrator_id codex_comparison论文参考: 第 4 节 "APE-Bench: Automated Benchmark Construction"
APE-Bench 流水线从 Mathlib4 提交历史自动提取证明工程任务。
# 从日期范围提取任务
python -m src.datasets.ape_bench.main \
--repo_url "https://github.com/leanprover-community/mathlib4.git" \
--start_date "2026-01-06" \
--end_date "2026-01-12" \
--num_processes 8 \
model=gpt_5.2 \
validate_generated_task=true
# 使用自定义配置
python -m src.datasets.ape_bench.main \
--config configs/ape_bench_config.yaml \
--repo_url "https://github.com/leanprover-community/mathlib4.git" \
--start_date "2026-01-06" \
--end_date "2026-01-12"关键参数:
--repo_url: Mathlib4 仓库 URL--start_date,--end_date: 提取提交的日期范围model: 用于指令合成的 LLMvalidate_generated_task: 是否通过双重验证来验证任务num_processes: 流水线阶段的并行进程数
流水线阶段 (第 4.1 节):
- Commit Filtering: 提取文件级修改 (5-100 行)
- Instruction Synthesis: 生成自然语言任务描述
- Validation: 通过双重验证验证任务可解
输出: 生成的任务保存到 .ape/runs/ 下的编排器运行目录
.
├── inputs/ # 基准测试数据集
│ ├── ape_bench/ # APE-Bench (100 个证明工程任务)
│ ├── ape_judge_benchmark/ # 语义判断验证 (64 个任务)
│ ├── minictx_v2/ # 定理证明基准
│ └── minif2f/ # 竞赛定理证明
├── src/
│ ├── ape/
│ │ ├── tasks/ # 任务合约实现
│ │ │ ├── base.py # BaseTask, BaseTaskData
│ │ │ ├── models.py # WorkspaceInfo, 验证协议
│ │ │ └── lean_tasks/ # Lean 特定任务
│ │ │ └── formal_math/
│ │ │ ├── proof_engineering/ # APE 任务
│ │ │ ├── theorem_proving/ # 定理证明
│ │ │ └── judgment/ # 语义验证
│ │ ├── scaffolds/ # Agent 实现
│ │ │ ├── ape_agent/ # 带集成验证的 APE-Agent
│ │ │ ├── claude_code/ # Claude Code 集成
│ │ │ └── codex/ # Codex CLI scaffold
│ │ ├── toolkits/ # 基础设施服务
│ │ │ ├── execute/ # 验证服务
│ │ │ │ └── lean/
│ │ │ │ ├── build.py # 执行数据库构建器
│ │ │ │ └── core/ # 内容寻址存储
│ │ │ ├── retrieve/ # 语义搜索
│ │ │ │ └── lean/
│ │ │ │ ├── build.py # 检索索引构建器
│ │ │ │ └── semantic_annotation/
│ │ │ ├── file_system/ # 文件操作
│ │ │ └── code/ # 代码导航
│ │ ├── runtime/ # 执行环境
│ │ │ ├── local/ # 直接执行
│ │ │ ├── isolated_local/ # 工作空间复制
│ │ │ └── container/ # Docker 隔离
│ │ ├── orchestration/ # 批量执行
│ │ │ ├── orchestrator.py # TaskOrchestrator
│ │ │ └── worker.py # 任务执行
│ │ └── llm_clients/ # LLM 集成
│ └── datasets/
│ ├── ape_bench/ # APE-Bench 流水线
│ │ ├── main.py # 流水线入口点
│ │ ├── collector.py # 提交过滤
│ │ └── task.py # 指令合成
│ └── taxonomy/ # 任务分类
├── setup.py
├── requirements.txt
└── README.md
任务合约定义为JSONL文件,每行是一个JSON对象。任务通过 src/ape/tasks/utils.py 中的 load_tasks_from_file() 加载:
from ape.tasks.utils import load_tasks_from_file
from pathlib import Path
# 从JSONL文件加载任务
tasks = load_tasks_from_file(
file_path=Path("inputs/minif2f/minif2f_test.jsonl"),
config=scaffold_config,
max_tasks=None, # 加载所有任务
task_config_overrides={"enabled_tools": ["lean_verify"]}
)JSONL任务格式示例 (来自 inputs/minif2f/minif2f_test.jsonl):
{
"task_id": "mathd_algebra_478",
"task_type": "lean_theorem_proving",
"metadata": {"split": "test"},
"theorem_statement": "import Mathlib\n...\ntheorem mathd_algebra_478 ...",
"target_workspace": {
"name": "target",
"commit_hash": "2df2f0150c275ad53cb3c90f7c98ec15a56a1a67",
"repo_url": "https://github.com/leanprover-community/mathlib4.git",
"default_target": "Mathlib",
"toolchain": "leanprover/lean4:v4.22.0",
"read_only_path_patterns": ["**/*"]
}
}任务合约组件 (来自 src/ape/tasks/base.py):
BaseTaskData: 输入数据 (task_id, task_type, metadata等)BaseTaskConfig: 任务特定配置 (enabled_tools等)BaseTask: 任务执行逻辑 (setup, evaluate等)BaseTaskResult: 输出结果 (success, score, metrics等)EvaluationResult: 中间评估状态WorkspaceInfo(来自src/ape/tasks/models.py): 工作空间元数据及访问控制:blocked_path_patterns: 完全禁止访问的路径read_only_path_patterns: 可读但不可写的路径no_read_path_patterns: 可写但不可读的路径
Scaffolds 通过 CLI 或 YAML 配置:
# APE-Agent 配置
scaffold_type: ape_agent
llm_config:
model_name: gpt_5.2
temperature: 0.7
max_tokens: 8000
runtime_config:
runtime_type: isolated_local
execution:
sample_count: 3
sample_max_cost: 3.0
max_turns: 100
num_processes: 8
task_config:
semantic_validation:
enabled: true
num_judges: 3执行和检索服务配置在:
src/ape/toolkits/execute/lean/config.py-LeanVerifyToolConfigsrc/ape/toolkits/retrieve/lean/config.py-LeanRetrieveToolConfig
默认路径:
- 仓库存储:
data/code_execute/repos/ - 构建产物:
data/code_execute/storage/ - 检索数据库:
data/lean_retrieve/ - 执行运行:
.ape/runs/
- 路径:
inputs/ape_bench/ape_bench.jsonl - 大小: 来自 67 个 Mathlib4 提交的 100 个任务 (2026-01-06 至 2026-01-12)
- 任务类型: 证明工程 (错误修复、功能、重构)
- 零污染: 所有提交都在模型训练截止之后
- 路径:
inputs/ape_judge_benchmark/ape_judge_benchmark.jsonl - 大小: 64 个专家标注任务
- 用途: 验证 LLM-as-Judge 在语义等价性判断上的可靠性
- 路径:
inputs/minictx_v2/ - 大小: 334 个测试任务 + 334 个验证任务
- 任务类型: 多个领域的定理证明
- 子集: mathlib, carleson, FLT, foundation, HepLean, Seymour, ConNF
- 路径:
inputs/minif2f/ - 大小: 488 个任务 (244 测试 + 243 验证)
- 任务类型: 竞赛数学定理证明
本项目根据 MIT 许可证授权。详情见 LICENSE 文件。
如果您在研究中使用 APE-Bench,请引用:
@misc{xin2026apebenchevaluatingautomatedproof,
title={APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries},
author={Huajian Xin and Luming Li and Xiaoran Jin and Jacques Fleuriot and Wenda Li},
year={2026},
eprint={2504.19110},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.19110v3},
}