Skip to content

feat: add agent-based solution generation via Claude Agent SDK#104

Open
andylizf wants to merge 12 commits intomainfrom
feat/agent-eval-algorithmic
Open

feat: add agent-based solution generation via Claude Agent SDK#104
andylizf wants to merge 12 commits intomainfrom
feat/agent-eval-algorithmic

Conversation

@andylizf
Copy link
Copy Markdown
Contributor

@andylizf andylizf commented Apr 16, 2026

Summary

  • Add agent-based solution generation pipeline using Claude Agent SDK
  • Agent models identified by -agent suffix (e.g., claude-sonnet-4-5-agent)
  • Integrated into existing generate_solutions.py — same CLI, just pass an agent model name
  • Agent gets problem statement only, must self-test (no test data, no checker, no interactor)

Files

  • src/frontier_cs/gen/agent_interface.py — core agent lifecycle: prompt construction, SDK invocation, streaming, transcript logging, timeout/cost control, solution extraction
  • src/frontier_cs/gen/agent_constants.py — prompt templates, helper shell scripts, CLAUDE.md content
  • src/frontier_cs/models.py-agent model suffix handling in prefix/provider detection
  • algorithmic/scripts/generate_solutions.py — agent mode integration
  • tests/test_agent_interface.py — 18 tests

Test plan

  • pytest tests/test_agent_interface.py — 18/18 pass
  • End-to-end run on a few problems with actual agent

andylizf added 10 commits April 6, 2026 11:40
Add agent model support to the solution generation pipeline:
- Detect -agent suffix models and store problem_dir in GenerationTask
- Add --agent-timeout and --agent-cost-limit CLI arguments
- Branch execute_task to call generate_agent_solution for agent models
- Save .meta.json alongside generated .cpp solutions
- Add import json for metadata serialization
- Copy problem dir to temp directory so agent doesn't pollute originals
- Makes concurrent runs on same problem safe
- Track token usage from streaming message_delta events (only reliable
  source when timeout kills run before ResultMessage arrives)
- Clean up temp dir after extraction
… for agent eval

Build dynamic agent prompts from problem config (time/memory limits,
subtask counts, interactive vs standard). Write test_all.sh and
run_interactive.sh into agent workdir. Embed small sample I/O directly
in prompt. Add CLAUDE.md with solving strategy guidance.
Parity mode (--parity flag) strips all test data, helper scripts, checker,
and interactor from the agent workspace — matching the Harbor adapter setup
where agents must self-test via brute-force cross-validation (对拍).

Changes:
- agent_interface.py: parity-aware prompt, workspace setup, CLAUDE.md,
  _get_infra_git_hash(), and enriched build_metadata (timestamp, parity flag)
- generate_solutions.py: --parity CLI argument
- tests: parity prompt validation (standard + interactive)
- docs: solutions repo separation plan (infra_git_hash in meta.json)
- .gitignore: exclude .claude/ directory
- pyproject.toml: add pytest dev dependency
These belong to the solutions repo separation effort, which is docs-only
for now. Removed _get_infra_git_hash(), subprocess import, and the
infra_git_hash/timestamp/parity fields from build_metadata().
…n doc

Agent always runs without test data — no --parity flag needed.
The solutions repo separation plan is not ready to commit.
Move all large string constants (prompt templates, shell scripts, CLAUDE.md
content) out of agent_interface.py into a dedicated constants module.
@andylizf andylizf changed the title feat: agent eval with parity mode for Harbor alignment feat: add agent-based solution generation via Claude Agent SDK Apr 16, 2026
Prompt (initial message) is now lean — only problem-specific info (path,
type, limits). CLAUDE.md carries persistent guidance that survives context
compaction: self-testing methodology, workflow steps, common mistakes,
retreat strategy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant